Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs

Чтение книги онлайн.

Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 16

Автор:
Жанр:
Серия:
Издательство:
Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs

Скачать книгу

CLD=common/local/distinct, LS=least squares, ML=maximum likelihood, ED=eigendecomposition, MC=maximising correlations/covariances. The abbreviations for the methods follow the same order as the sections. For abbreviations (or descriptions) of the methods, see Section 1.11.

       Table 10.2 Results of the single-block regression models. PCovR isPrincipal Covariates Regression, U-PLS is unfold-PLS,MCovR is multiway covariates regression. The 3,2,3 com-ponents for MCovR refer to the components for thethree modes of Tucker3. For more explanation, see text.

       Table 10.3 Results of the multiway multiblock models. MB-PLS ismultiblock PLS, MWMBCovR is multiway multiblockcovariates regression. For more explanation, see text.

       Table 11.1 R packages on CRAN having one or more multiblock methods.

       Table 11.2 MATLAB toolboxes and functionshaving one or more multiblock methods.

       Table 11.3 Python packages having one or more multiblock methods.

Part I Introductory Concepts and Theory

      1.1 Scope of the Book

      In many areas of the natural and life sciences, data sets are collected consisting of multiple blocks of data measured on the same or similar systems. Examples are abundant, e.g., in genomics it is becoming increasingly common to measure gene-expression, protein abundances and metabolite levels on the same biological system (Clish et al., 2004; Heijne et al., 2005; Kleemann et al., 2007; Curtis et al., 2012; Brink-Jensen et al., 2013; Franzosa et al., 2015). In sensory science, the interest is often in relations between the chemical and sensory properties of the samples involved as well as consumer liking of the same samples (Næs et al., 2010). In chemistry, sometimes different types of instruments are utilised to characterise different properties the same set of samples (de Juan and Tauler, 2006). In cohort studies, it is increasingly popular to perform the same type of measurements in different cohorts to confirm results and perform meta-analyses. In (bio-)chemical process industry, plant-wide measurements are available collected by several sensors in the plant (Lopes et al., 2002). Clinical trials are often supported by auxiliary measurements such as gene-expression and cytokines to characterise immune responses (Coccia et al., 2018). Challenge tests to establish the health status of individuals usually contain multiple types of data collected for the same individuals as a function of time (Wopereis et al., 2009; Pellis et al., 2012; Kardinaal et al., 2015). All these examples show that simple data sets are increasingly becoming less common.

      In Elaboration 1.1 we define the terms concerning data sets we will use throughout in this book. Sometimes, we will sidestep this to some extent to make connections between fields. At those places we will clarify exactly what we mean.

       ELABORATION 1.1

      Glossary of terms

      Elaboration 1.1 suggests a consistent vocabulary to be used in the book. However, the difference between variables and objects is not always that clear (for examples, see Chapter 8 on complex relations). We will try, however, to remain as consistent as possible and give extra explanations of terms at the appropriate places. In the rest of this chapter we will delineate our potential audience. We will give some examples of why multiblock methods are necessary and give an overview of the types of problems encountered. Moreover, we will give some history and discuss briefly some fundamental concepts which we need in the rest of the book. We end by giving the notation which we will use in this book and a list of abbreviations.

      1.2 Potential Audience

      Our ambition is to serve different types of audiences. The first set of users consists of practitioners in the natural and life sciences, such as in bioinformatics, sensometrics, chemometrics, statistics, and machine learning. They will mainly be interested in the question how to perform multiblock data analysis and what to use in which data analysis situation. They may benefit from reading the main text and studying the examples. The second set of users are method developers. They want to know what is already available and spot niches for further development; apart from the main text and the examples they may also be interested in the elaborations. The final set of users are computer scientists and software developers. They want to know which methods are worthwhile to build software for and may also study the algorithms.

      1.3 Types of Data and Analyses

      1.3.1 Supervised and Unsupervised Analyses

      In any multiblock data analysis, we first have to choose between the main paradigms unsupervised and supervised analysis. Unsupervised analysis refers to explorative analysis looking for structure and connections in the data either in a single data block or across data blocks, typically using dimension reduction including maximisation/minimisation of some criterion combined with orthogonalisation, or by clustering

Скачать книгу