Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Чтение книги онлайн.
Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 16
Table 10.2 Results of the single-block regression models. PCovR isPrincipal Covariates Regression, U-PLS is unfold-PLS,MCovR is multiway covariates regression. The 3,2,3 com-ponents for MCovR refer to the components for thethree modes of Tucker3. For more explanation, see text.
Table 10.3 Results of the multiway multiblock models. MB-PLS ismultiblock PLS, MWMBCovR is multiway multiblockcovariates regression. For more explanation, see text.
Table 10.4 SO-PLS-PM results for wine data. The four columns of num-bers correspond to the explained variances for the models forthe endogenous blocks B, C, D, and E (the numbers in paren-theses represent the number of components used). Source: (Romano et al., 2019). Reproduced with permission from Wiley.
Table 11.1 R packages on CRAN having one or more multiblock methods.
Table 11.2 MATLAB toolboxes and functionshaving one or more multiblock methods.
Table 11.3 Python packages having one or more multiblock methods.
Table 11.4 Commercial software having one or more multiblock methods.
1 Introduction
1.1 Scope of the Book
In many areas of the natural and life sciences, data sets are collected consisting of multiple blocks of data measured on the same or similar systems. Examples are abundant, e.g., in genomics it is becoming increasingly common to measure gene-expression, protein abundances and metabolite levels on the same biological system (Clish et al., 2004; Heijne et al., 2005; Kleemann et al., 2007; Curtis et al., 2012; Brink-Jensen et al., 2013; Franzosa et al., 2015). In sensory science, the interest is often in relations between the chemical and sensory properties of the samples involved as well as consumer liking of the same samples (Næs et al., 2010). In chemistry, sometimes different types of instruments are utilised to characterise different properties the same set of samples (de Juan and Tauler, 2006). In cohort studies, it is increasingly popular to perform the same type of measurements in different cohorts to confirm results and perform meta-analyses. In (bio-)chemical process industry, plant-wide measurements are available collected by several sensors in the plant (Lopes et al., 2002). Clinical trials are often supported by auxiliary measurements such as gene-expression and cytokines to characterise immune responses (Coccia et al., 2018). Challenge tests to establish the health status of individuals usually contain multiple types of data collected for the same individuals as a function of time (Wopereis et al., 2009; Pellis et al., 2012; Kardinaal et al., 2015). All these examples show that simple data sets are increasingly becoming less common.
Unfortunately, there is no consensus yet about terminology regarding the structure of such data sets and the related research questions. In bioinformatics, the terms data fusion or data integration are often used where the latter distinguishes also N- or P-integration (N means the same samples and P means the same variables), horizontal and vertical integration. In psychometrics, the terms multiset and multigroup data analysis are used; in chemometrics, multiblock data analysis is in use and in the computational sciences and machine learning the term multiview or multitable data analysis is used. We will encounter all these terms in this book but we will use the noun multiblock as much as possible.1
In Elaboration 1.1 we define the terms concerning data sets we will use throughout in this book. Sometimes, we will sidestep this to some extent to make connections between fields. At those places we will clarify exactly what we mean.
ELABORATION 1.1
Glossary of terms
Data set:The total collection of all data that is under consideration for a particular problem.Data block:One block of data organised in a matrix (array) with rows and columns as a part of a data set.Multiblock data set:The organisation of the data set in blocks of data.Multiblock data analysis:The process of analysing the whole multiblock data set simultaneously using multiblock methods.Object, Subject, Sample:Entity for which measurements are obtained. They can be random drawings from a population and/or they can come from an experimental design. The general term is a sample but if these samples pertain to human beings they may be called subjects. They constitute the row entries of a matrix.Variable:A measured property of an entity collected in the columns of a matrix; this is called a feature in machine learning.Measurement scale:The scale on which a variable is measured (ratio, interval, ordinal, or nominal-scaled).Homogeneous versus heterogeneous data:If a data set contains blocks of data all measured on the same scale then this is called homogeneous data; if not, then the data are called heterogeneous. In most cases, homogeneous data will refer to blocks containing quantitative data (at least interval-scaled).
Elaboration 1.1 suggests a consistent vocabulary to be used in the book. However, the difference between variables and objects is not always that clear (for examples, see Chapter 8 on complex relations). We will try, however, to remain as consistent as possible and give extra explanations of terms at the appropriate places. In the rest of this chapter we will delineate our potential audience. We will give some examples of why multiblock methods are necessary and give an overview of the types of problems encountered. Moreover, we will give some history and discuss briefly some fundamental concepts which we need in the rest of the book. We end by giving the notation which we will use in this book and a list of abbreviations.
1.2 Potential Audience
Our ambition is to serve different types of audiences. The first set of users consists of practitioners in the natural and life sciences, such as in bioinformatics, sensometrics, chemometrics, statistics, and machine learning. They will mainly be interested in the question how to perform multiblock data analysis and what to use in which data analysis situation. They may benefit from reading the main text and studying the examples. The second set of users are method developers. They want to know what is already available and spot niches for further development; apart from the main text and the examples they may also be interested in the elaborations. The final set of users are computer scientists and software developers. They want to know which methods are worthwhile to build software for and may also study the algorithms.
We will try to serve all groups. This means that we will explain most of the methods in a rather detailed manner (especially in Parts II and III) and will also pay attention to validation and visualisation to encourage proper interpretation. At the end of the book in Chapter 11, we describe multiblock toolboxes and packages in R, MATLAB and Python and showcase the accompanying R package multiblock which includes many of the methods described in this book.
1.3 Types of Data and Analyses
1.3.1 Supervised and Unsupervised Analyses
In any multiblock data analysis, we first have to choose between the main paradigms unsupervised and supervised analysis. Unsupervised analysis refers to explorative analysis looking for structure and connections in the data either in a single data block or across data blocks, typically using dimension reduction including maximisation/minimisation of some criterion combined with orthogonalisation, or by clustering