Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Чтение книги онлайн.
Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 18
Multivariate data analysis becomes numerically stable and statistically robust if the components are chosen in a suitable way.
Empirical validation of the models becomes manageable.
The effect of measurement noise is reduced.
Outliers can often be detected by visual inspection of the associated subspace projections provided by the extracted components.
1.3.4 Indirect Versus Direct Data
When discussing types of data, it is useful to distinguish between direct and indirect data. Direct data are always in the form of a matrix or table containing measurements of variables on a set of samples. Indirect data or derived data are always in the form of variables × variables or samples times samples matrices. Examples of such types of data are cross-products of matrices of direct data, covariances, distances and the like. The main focus in this book is on direct data, but we will discuss some indirect methods as well. First, to limit ourselves and, secondly, analyses on direct data are usually easier to understand and interpret. Thirdly, in many applications of multiblock data analysis in the natural and life sciences, direct data are available. For a more formal description of this distinction, see Section 2.2.1.
1.3.5 Heterogeneous Fusion
The final property of data we need to present is whether all blocks in the data set are measured on the same scale or not, i.e., if the data set is homogeneous or heterogeneous. These concepts are explained in more detail in Chapter 2 (Section 2.2.2). Briefly, if all blocks contain measurements on the same scale, e.g., they are all numerical or quantitative data, then the resulting problem will be called homogeneous fusion. If they are not of the same scale, e.g., a mixture of quantitative and binary measurements, then the problem is called heterogeneous fusion. We will discuss both of these in this book although most methods are made for homogeneous data.
1.4 Examples
This section contains some examples of multiblock data analysis problems in different fields of the natural and life sciences. It serves to give an idea about which types of questions are asked and which types of data sets are available. A full explanation of the methods used is given in the following chapters. These examples are only appetisers!
1.4.1 Metabolomics
Metabolomics is the part of life sciences concerned with measuring and studying the behaviour of metabolites (small biochemical compounds) in biological systems. The field has grown considerably in the last 20 years with conferences and dedicated journals. A large part of the applications concern finding biomarkers for diseases which translates into finding the metabolites that discriminate between groups of objects (e.g., control versus diseased subjects). Elaboration 1.3 shows some of the terms used in metabolomics research.
ELABORATION 1.3
Terms in metabolomics and proteomics
Biomarkers:Chemical compounds (e.g., metabolites) that mark a difference between conditions, e.g., between healthy and diseased persons.GC-MS:Gas chromatography–mass spectrometry. A separation method coupled to a mass spectrometer used a lot in advanced chemical analyses of volatile compounds.LC-MS:Liquid chromatography–mass spectrometry. A separation method coupled to a mass spectrometer used a lot in advanced chemical analyses for a large diversity of chemical compounds.Metabolome:The set of all metabolites of a biological organism responsible for its metabolism.NMR:Nuclear magnetic resonance. A fast chemical analysis method giving a fingerprint of a sample and concentrations of chemical compounds.Proteomics:The study and measurements of proteins in biological organisms. Proteins are mostly enzymes catalysing metabolic reactions.
There are several multiblock data analysis challenges in metabolomics. It is increasingly popular to measure different sets of chemically related metabolites on the same samples using different instrumental protocols (Smilde et al., 2005b; Pellis et al., 2012; Kardinaal et al., 2015). These blocks of data (each block pertaining to one instrumental protocol) then need to be combined to arrive at a global view on metabolism. Metabolites can also be measured in different compartments, such as in blood, urine, liver, muscle, kidney (Fazelzadeh et al., 2016). This also generates multiblock data analysis problems. Metabolites are converted in biochemical reactions catalysed by enzymes (proteins). Hence, it is also worthwhile in some cases to measure proteins and combine those with metabolomics measurements (Wopereis et al., 2009). Plants are complex organisms with a rich variety of metabolites. The metabolism of plants is influenced by environmental conditions, such as temperature and light. Example 1.1 illustrates this.
Example 1.1: Metabolomics example: plant science data
This metabolomics example comes from a larger study in plant sciences (Caldana et al., 2011). The goal of the study was to investigate changes in metabolism and gene-expression of Arabidopsis related to growth under different light and temperature conditions. To this end, time-resolved experiments were performed. The design of the data set is shown in Figure 1.3. It is not a fully crossed design, but for each cell in the design gene-expression and metabolomics measurements were performed at 19 time points. We will only use the metabolomics measurements which comprised around 65 identified metabolites and use the part of 210C (the third line in the table below). This results in four blocks of data (21-D, 21-LL, 21-L and 21-HL) each consisting of 19 rows (time points) and 65 columns (measured metabolites). Hence, we only study the factors light and time (the factor temperature is kept constant).
Figure 1.3 Design of the plant experiment. Numbers in the top row refer to light levels (in μE m −2 sec −1); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.
A first impression of the variation in metabolite levels can be obtained by performing a principal component analysis (PCA) on the data, see Figure 1.4(a)), where we have concatenated all four blocks (21-D, 21-LL, 21-L and 21-HL) below each other. The colour coding is according to the light conditions and this figure shows that there is systematic variation associated with the factor light in the data. A more advanced analysis of this data is by using a multiblock data analysis method that takes into account the underlying experimental design, such as ANOVA-simultaneous component analysis (ASCA, see Chapter 6). Figure 1.4(b) shows the scores on the first ASCA interaction component and this clearly shows a time dependent contrast between dark and high light conditions. The original data set also comprises gene-expression measurements which makes the problem even more challenging.
Figure 1.4 Scores on the first two principal components of a PCA on the plant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.
1.4.2 Genomics
Genomics