Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Чтение книги онлайн.
Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 19
ELABORATION 1.4
Terms in genomics
CNA:Many biological organisms have several copies of the same gene (see Figure 1.5). Copy number aberration (CNA) quantifies this (see Example 1.2).Epi-genetics:DNA can be modified chemically thereby regulating expression of the corresponding genes. This chemical modification of the DNA is called epi-genetics (see Figure 1.5).Genetics:Biological organisms have DNA encoding their genetic make-up. Genetics studies this DNA.Methylation:A methyl group can be attached to the DNA, affecting transcription. This is a part of epi-genetics (see Figure 1.5).Mutation:DNA consists of four types of nucleotides (A, T, G, and C) containing the genetic code. Some of these nucleotides may be mutated, e.g., change from A to T. If this happens for a single nucleotide then this is called a single nucleotide polymorphism (SNP, see Figure 1.5).RNAseq:The modern way of measuring gene-expression or the RNA of a biological organism. There are many types of RNA of which messenger-RNA (mRNA) is the most studied one.Transcriptomics:Genes are transcribed to RNA and transcriptomics concerns the analysis of these transcripts.
Figure 1.5 Idea of copy number variation (a), methylation (b), and mutation (c) of the DNA. For (a) and (c): Source: Adapted from Koch et al., 2012.
Genomics is a very active field with many multiblock data analysis challenges due to the rapid development of measuring techniques. Whereas in former days gene-expression was measured with micro-arrays, this technology has been overtaken by next generation sequencing (mRNAseq, miRNAseq, siRNAseq, scRNAseq to name a few). This has led to open-access repositories containing genomics data of very different types, e.g., in cancer research (Tomczak et al., 2015) which is often the basis for generating new multiblock data analysis methods (Aben et al., 2016, 2018; Song et al., 2018). Other examples are combining genomics data with data from non-omics techniques like medical imaging, e.g., for treatment response predictions.
Example 1.2: Genetics example
In cancer research, often cell-lines are used derived from tumour tissue (Iorio et al., 2016). Of these cell-lines many measurements are made available in public databases. Such measurements may consist of measured RNA-levels (ratio-scaled values), but also measurements related to mutations (so-called single nucleotide polymorphisms or SNPs) which are on/off measurements and intrinsic of a binary nature.
One of the possible genetic determinants is the copy number of a gene, see Figure 1.5(a); such a gene may be duplicated. An extra layer of gene-regulation is provided by methylation of certain nucleotides of the genome (see Figure 1.5(b)). If a nucleotide is methylated, then transcription of the corresponding gene cannot occur; this area of genetics is called epi-genetics. There are different ways of expressing methylation, but the most simple one is a yes or no whether or not a specific site is methylated. At a certain position on the genome, one nucleotide may have been changed (see Figure 1.5(c)). This is obviously binary since there may be a SNP or no SNP at a certain position on the genome. Hence, treating such data in a multiblock fusion setting requires specialised methods, see Chapter 5.
1.4.3 Systems Biology
Taking it one step further in terms of omics measurements, we enter the area of systems biology. The general idea of systems biology is to describe biological systems as a network of interacting biochemical compounds. Often, the interactions in such networks show emerging behaviour which cannot be understood from studying single biochemical compounds (Bruggeman and Westerhoff, 2007).
There are basically two approaches to systems biology: top-down and bottom-up (Shahzad and Loor, 2012). In bottom-up approaches, fundamental models are made of parts of biochemical systems and, subsequently, parameters in those models are fitted to data. In top-down systems biology, many types of omics data are collected and these are combined into one holistic analysis. The latter goes under different names: intra- and inter-omics analysis, cross-omics analysis, statistical integration, statistical data fusion to name a few (Tayrac et al., 2009; Richards et al., 2010; Richards and Holmes, 2014). In all these top-down applications, multiblock data analysis is important. See also Elaboration 1.5 for more explanation.
ELABORATION 1.5
Terms in systems biology
Biological networks:In biological organisms, biochemical compounds act together in networks of activity. An example is a metabolic network describing all the conversions taking place in the metabolism of a cell.Bottom-up:Approach in which detailed biochemical knowledge of a biological system is used to build mathematical models of that system (e.g., in terms of sets of differential equations). Such models are necessarily limited in size; they describe only a small part of the system.Emerging property:Property of a system which cannot be understood from its single actors. Temperature is an example of an emerging property of a system containing a large number of molecules that interact.Microbiome:The whole set of micro-organisms in and around a biological host. The gut-microbiome is the most famous example; essential for humans to metabolise food.Top-down:Approach in which many measurements are performed on the same biological system and empirical modelling is subsequently used to model that system. These models usually contain many biochemical compounds but are much less detailed than the bottom-up models.
An intriguing new development in systems biology is to involve microbiome measurements of the biological system (Franzosa et al., 2015). This has sparked many studies in different areas of medicine, such as inflammatory bowel disease (Huang et al., 2014) and cancer (Weir et al., 2013). It is also highly relevant for nutritional and food studies (Jacobs et al., 2009; Van Duynhoven et al., 2010; Moco et al., 2012). In all these cases, the microbiome data are combined with other omics data generating multiblock data analysis problems.
1.4.4 Chemistry
Multiblock data analysis problems arise in different parts of chemistry. A very active area is analytical chemistry, with two very prominent topics. The first one is multivariate curve resolution where the general idea is to mathematically resolve chemical mixtures in underlying pure chemical components and their concentration profiles (Tauler et al., 1995; de Juan and Tauler, 2006). Many different types of multiblock data analyses are performed in this area with a special emphasis on applying domain-specific constraints. The second application area is calibration where the purpose is to obtain concentrations from instrumental analysis methods. Also in this area multiblock data analysis methods are used (Næs et al., 2013). A spectroscopy example is given in Example 1.3.
ELABORATION 1.6
Terms in chemistry
Multivariate curve resolution:Part of chemometrics that tries to mathematically resolve mixtures of chemicals into their individual compounds.Multivariate calibration:Part of chemometrics that deals with predicting properties (e.g., concentrations) from spectroscopic measurement. The idea is to replace a slow, expensive measurement technique (the reference method) by a fast, cheaper, and often non-destructive one (a spectroscopic measurement).Process chemometrics:Part of chemometrics devoted to processes; such as process analysis,