Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Чтение книги онлайн.
Читать онлайн книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs страница 20
Another area in chemistry which is populated with multiblock data analysis problems is process chemometrics (MacGregor et al., 1994; Wise and Gallagher, 1996; Kourti et al., 1995; Lopes et al., 2002). The general problem is how to combine multiple chemical process measurements for process understanding and statistical process monitoring.
Example 1.3: Chemistry example: Raman spectroscopy data
The data set was first published in a study containing both Raman and near infrared spectroscopy measurements of emulsions (Afseth et al., 2005). For the Raman data, 1096 Raman shifts, from 1770 cm −1 to 675 cm −1, were recorded for 69 emulsions containing a mixture of proteins, water, and fats (see Figure 1.6). Two reference values are used as responses: polyunsaturated fatty acids (PUFA) as percentage of total sample weight (0.3–11.5%) and as percentage of fats in sample (2.2–61.6%). The reference values have a correlation of R=0.73, i.e. R2=0.54, meaning that around half of the variation in PUFA content is due to the variation in total fat content. The aim of the original study was to be able to quantify the PUFA percentages using only spectroscopy to enable quick, cheap, and non-destructive measurements.
Figure 1.6 Plot of the Raman spectra used in predicting the fat content. The dashed lines show the split of the data set into multiple blocks.
In this book, we will concentrate on the Raman block as this dominated completely in a previous multiblock data analysis study (Liland et al., 2016), and rather split it into suitable wavelength regions, here splitting at 1350 cm −1 and 1100 cm −1. This is done to explore the predictive power of the different wavelength regions. This data set will be analysed using several of the supervised methods in this book to see what is emphasised by each of them. In general, we see that the predictive models mostly leverage the variables corresponding to molecular vibrations associated with lipids and degrees of saturation, and that these models can reproduce the reference values with high precision.
1.4.5 Sensory Science
Sensory and consumer science is an important discipline in the assessment of food quality. It consists of a large number of measurement methods for determining the descriptive properties of products as well as the consumer liking of the same products (Lawless and Heymann, 2010). Often a product will be characterised by a number of different data types, ranging from classical descriptive sensory analysis using predefined attributes and a trained sensory panel to consumer based characterisation based on, for instance, the check-all-that-apply (CATA) method (Varela and Ares, 2012). The data sets will generally consist of a substantial number of attributes and a relatively moderate number of samples. Of special interest is estimating relations between data blocks related to liking and product characterisation. A large number of methods have been developed for this purpose as will be discussed in Chapters 7, 8 and 10 in this book (see Næs et al. (2010) for an overview). An example of a typical data structure and its related questions in sensory science is given in Example 1.4.
ELABORATION 1.7
Terms in sensory analysis
Consumer liking:For hedonic sensory methods, a consumer panel is used. The consumer can be asked about how much they like the different products and how willing they are to buy the products tested.Sensory panel:For assessing product quality, it is common to use a sensory panel consisting of a number of trained assessors which assess the intensity on a predefined scale of a number of relevant sensory attributes.Sensory attribute:The measurements, as performed by the sensory panel, such as sweetness, hardness, and acidity (depending on types of products).Rapid sensory methods:There exist a number of so-called rapid sensory methods, for instance, projective mapping, sorting, and CATA. For the latter all participants are asked to tick, for each product, on the relevant attributes on a predefined list. This gives a table of 0s and 1s for each participant.
Example 1.4: Sensory example: consumer liking
A typical multiblock data structure that occurs in consumer science is depicted in Figure 1.7. The context is typically product development where interest is in understanding the relations between descriptive information of a number of prototype samples and the consumer liking of the same samples. In addition, interest is in interpreting the liking patterns in terms of consumer characteristics for better understanding of which consumer groups prefer which products (see e.g., Næs et al. (2018)). Based on this type of information, the product developer can more easily design products that better fit the consumer needs and liking patterns. As can be seen from the figure, both chemical attributes as well as sensory properties/attributes, obtained by a trained sensory panel, can be of interest for describing the products. A number of different liking scores can also be of interest, for instance related to taste and texture (Menichelli et al., 2013), as depicted by the stack of data blocks for liking. Analysing this so-called L-shape data structure sheds light on, for instance, which are the sensory drivers of liking, which samples are the most liked, what characterises these samples, and what characterises the different consumer groups with different preference patterns.
1.5 Goals of Analyses
Many goals of multiblock data analysis can be envisaged. In current practice, these goals are usually implicit. By making these goals explicit it will become necessary to also make explicit the global optimisation criterion or, when such a criterion is difficult to formulate, to carefully think about the whole data analysis procedure and which method to choose. Several general goals will be discussed briefly.
Exploratory analysis:One of the most obvious goals of multiblock data analysis is exploration which is a part of unsupervised analysis. By plotting the weights, scores, and loadings, summaries of the data are obtained which can be interpreted and maybe further analysed using visualisation tools.Predictive models:Another obvious goal is to try to predict the variation in one data block using several other data blocks; this is a part of supervised analysis. The idea is then that using multiple predictive blocks gives a better prediction for future samples.Finding topologies:In the case of complex data, data blocks can be placed in different relationships. The arrangement of blocks as dependent or independent may be a purpose of the analysis. We call such an arrangement a topology. In that case, it would be useful to have a strategy for deciding on the topology that fits the data best.Common versus distinct variation:There can be common and distinct variation in the multiple data blocks (see Section 1.8). This separation into types of variation greatly simplifies subsequent interpretation of the results.Treatment effects:The effect of a treatment can be measured in different blocks of data. The interest is usually what the main effect of a treatment is on measurements in the different blocks of data.Individual differences:Apart from group differences, also individual differences are useful. This can be for personalised medicine or nutritional interventions or consumer behaviour. Multiblock data analysis may help to find such differences and thereby facilitate population stratification and sub-typing.Mixed goals:In real-life applications, a mixture of goals is usually present. It may be that a treatment has been given which expresses itself differently in the common and distinct variation. Moreover, interest may be in the main effects of treatments but also on individual treatment effect differences.
1.6