Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs

Чтение книги онлайн.

Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 18

Автор:
Жанр:
Серия:
Издательство:
Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs

Скачать книгу

       Multivariate data analysis becomes numerically stable and statistically robust if the components are chosen in a suitable way.

       Empirical validation of the models becomes manageable.

       The effect of measurement noise is reduced.

       Outliers can often be detected by visual inspection of the associated subspace projections provided by the extracted components.

      1.3.4 Indirect Versus Direct Data

      When discussing types of data, it is useful to distinguish between direct and indirect data. Direct data are always in the form of a matrix or table containing measurements of variables on a set of samples. Indirect data or derived data are always in the form of variables × variables or samples times samples matrices. Examples of such types of data are cross-products of matrices of direct data, covariances, distances and the like. The main focus in this book is on direct data, but we will discuss some indirect methods as well. First, to limit ourselves and, secondly, analyses on direct data are usually easier to understand and interpret. Thirdly, in many applications of multiblock data analysis in the natural and life sciences, direct data are available. For a more formal description of this distinction, see Section 2.2.1.

      1.3.5 Heterogeneous Fusion

      The final property of data we need to present is whether all blocks in the data set are measured on the same scale or not, i.e., if the data set is homogeneous or heterogeneous. These concepts are explained in more detail in Chapter 2 (Section 2.2.2). Briefly, if all blocks contain measurements on the same scale, e.g., they are all numerical or quantitative data, then the resulting problem will be called homogeneous fusion. If they are not of the same scale, e.g., a mixture of quantitative and binary measurements, then the problem is called heterogeneous fusion. We will discuss both of these in this book although most methods are made for homogeneous data.

      1.4 Examples

      This section contains some examples of multiblock data analysis problems in different fields of the natural and life sciences. It serves to give an idea about which types of questions are asked and which types of data sets are available. A full explanation of the methods used is given in the following chapters. These examples are only appetisers!

      1.4.1 Metabolomics

       ELABORATION 1.3

      Terms in metabolomics and proteomics

      Biomarkers:Chemical compounds (e.g., metabolites) that mark a difference between conditions, e.g., between healthy and diseased persons.GC-MS:Gas chromatography–mass spectrometry. A separation method coupled to a mass spectrometer used a lot in advanced chemical analyses of volatile compounds.LC-MS:Liquid chromatography–mass spectrometry. A separation method coupled to a mass spectrometer used a lot in advanced chemical analyses for a large diversity of chemical compounds.Metabolome:The set of all metabolites of a biological organism responsible for its metabolism.NMR:Nuclear magnetic resonance. A fast chemical analysis method giving a fingerprint of a sample and concentrations of chemical compounds.Proteomics:The study and measurements of proteins in biological organisms. Proteins are mostly enzymes catalysing metabolic reactions.

      There are several multiblock data analysis challenges in metabolomics. It is increasingly popular to measure different sets of chemically related metabolites on the same samples using different instrumental protocols (Smilde et al., 2005b; Pellis et al., 2012; Kardinaal et al., 2015). These blocks of data (each block pertaining to one instrumental protocol) then need to be combined to arrive at a global view on metabolism. Metabolites can also be measured in different compartments, such as in blood, urine, liver, muscle, kidney (Fazelzadeh et al., 2016). This also generates multiblock data analysis problems. Metabolites are converted in biochemical reactions catalysed by enzymes (proteins). Hence, it is also worthwhile in some cases to measure proteins and combine those with metabolomics measurements (Wopereis et al., 2009). Plants are complex organisms with a rich variety of metabolites. The metabolism of plants is influenced by environmental conditions, such as temperature and light. Example 1.1 illustrates this.

      Example 1.1: Metabolomics example: plant science data

      Figure 1.3 Design of the plant experiment. Numbers in the top row refer to light levels (in μE m −2 sec −1); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.

      

      Figure 1.4 Scores on the first two principal components of a PCA on the plant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.

      1.4.2 Genomics

Скачать книгу