Multiblock Data Fusion in Statistics and Machine Learning. Tormod NæsЧитать онлайн книгу.
1.3 Design of the plant experiment. Numbers in the top row refer to light levels (in μE m −2 sec −1); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.
A first impression of the variation in metabolite levels can be obtained by performing a principal component analysis (PCA) on the data, see Figure 1.4(a)), where we have concatenated all four blocks (21-D, 21-LL, 21-L and 21-HL) below each other. The colour coding is according to the light conditions and this figure shows that there is systematic variation associated with the factor light in the data. A more advanced analysis of this data is by using a multiblock data analysis method that takes into account the underlying experimental design, such as ANOVA-simultaneous component analysis (ASCA, see Chapter 6). Figure 1.4(b) shows the scores on the first ASCA interaction component and this clearly shows a time dependent contrast between dark and high light conditions. The original data set also comprises gene-expression measurements which makes the problem even more challenging.
Figure 1.4 Scores on the first two principal components of a PCA on the plant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.
1.4.2 Genomics
Genomics covers the area of life science research related to the genomes of biological organisms. In a broad sense, it may cover very many aspects regarding the genome such as the transcriptome, genetics and epi-genetics. In a smaller context, it may encompass only the measurement of the genomic transcripts, e.g., using RNAseq (see Elaboration 1.4 for an explanation of terms).
ELABORATION 1.4
Terms in genomics
CNA:Many biological organisms have several copies of the same gene (see Figure 1.5). Copy number aberration (CNA) quantifies this (see Example 1.2).Epi-genetics:DNA can be modified chemically thereby regulating expression of the corresponding genes. This chemical modification of the DNA is called epi-genetics (see Figure 1.5).Genetics:Biological organisms have DNA encoding their genetic make-up. Genetics studies this DNA.Methylation:A methyl group can be attached to the DNA, affecting transcription. This is a part of epi-genetics (see Figure 1.5).Mutation:DNA consists of four types of nucleotides (A, T, G, and C) containing the genetic code. Some of these nucleotides may be mutated, e.g., change from A to T. If this happens for a single nucleotide then this is called a single nucleotide polymorphism (SNP, see Figure 1.5).RNAseq:The modern way of measuring gene-expression or the RNA of a biological organism. There are many types of RNA of which messenger-RNA (mRNA) is the most studied one.Transcriptomics:Genes are transcribed to RNA and transcriptomics concerns the analysis of these transcripts.
Figure 1.5 Idea of copy number variation (a), methylation (b), and mutation (c) of the DNA. For (a) and (c): Source: Adapted from Koch et al., 2012.
Genomics is a very active field with many multiblock data analysis challenges due to the rapid development of measuring techniques. Whereas in former days gene-expression was measured with micro-arrays, this technology has been overtaken by next generation sequencing (mRNAseq, miRNAseq, siRNAseq, scRNAseq to name a few). This has led to open-access repositories containing genomics data of very different types, e.g., in cancer research (Tomczak et al., 2015) which is often the basis for generating new multiblock data analysis methods (Aben et al., 2016, 2018; Song et al., 2018). Other examples are combining genomics data with data from non-omics techniques like medical imaging, e.g., for treatment response predictions.
Example 1.2: Genetics example
In cancer research, often cell-lines are used derived from tumour tissue (Iorio et al., 2016). Of these cell-lines many measurements are made available in public databases. Such measurements may consist of measured RNA-levels (ratio-scaled values), but also measurements related to mutations (so-called single nucleotide polymorphisms or SNPs) which are on/off measurements and intrinsic of a binary nature.
One of the possible genetic determinants is the copy number of a gene, see Figure 1.5(a); such a gene may be duplicated. An extra layer of gene-regulation is provided by methylation of certain nucleotides of the genome (see Figure 1.5(b)). If a nucleotide is methylated, then transcription of the corresponding gene cannot occur; this area of genetics is called epi-genetics. There are different ways of expressing methylation, but the most simple one is a yes or no whether or not a specific site is methylated. At a certain position on the genome, one nucleotide may have been changed (see Figure 1.5(c)). This is obviously binary since there may be a SNP or no SNP at a certain position on the genome. Hence, treating such data in a multiblock fusion setting requires specialised methods, see Chapter 5.
1.4.3 Systems Biology
Taking it one step further in terms of omics measurements, we enter the area of systems biology. The general idea of systems biology is to describe biological systems as a network of interacting biochemical compounds. Often, the interactions in such networks show emerging behaviour which cannot be understood from studying single biochemical compounds (Bruggeman and Westerhoff, 2007).
There are basically two approaches to systems biology: top-down and bottom-up (Shahzad and Loor, 2012). In bottom-up approaches, fundamental models are made of parts of biochemical systems and, subsequently, parameters in those models are fitted to data. In top-down systems biology, many types of omics data are collected and these are combined into one holistic analysis. The latter goes under different names: intra- and inter-omics analysis, cross-omics analysis, statistical integration, statistical data fusion to name a few (Tayrac et al., 2009; Richards et al., 2010; Richards and Holmes, 2014). In all these top-down applications, multiblock data analysis is important. See also Elaboration 1.5 for more explanation.
ELABORATION 1.5
Terms in systems biology
Biological networks:In biological organisms, biochemical compounds act together in networks of activity. An example is a metabolic network describing all the conversions taking place in the metabolism of a cell.Bottom-up:Approach in which detailed biochemical knowledge of a biological system is used to build mathematical models of that system (e.g., in terms of sets of differential equations). Such models are necessarily limited in size; they describe only a small part of the system.Emerging property:Property of a system which cannot be understood from its single actors. Temperature is an example of an emerging property of a system containing a large number of molecules that interact.Microbiome:The whole set of micro-organisms in and around a biological host. The gut-microbiome is the most famous example; essential for humans to metabolise food.Top-down:Approach in which many measurements are performed on the same biological system and empirical modelling is subsequently used to model that system. These models usually contain many biochemical compounds but are much less detailed than the bottom-up models.
An intriguing new development in systems biology is to involve microbiome measurements of the biological system (Franzosa et al., 2015). This has sparked many studies in different areas of medicine, such as inflammatory bowel disease (Huang et al., 2014) and cancer (Weir et al., 2013). It is also highly relevant for nutritional and food studies (Jacobs et al., 2009; Van Duynhoven et al., 2010; Moco et al., 2012). In all these cases, the microbiome data are combined with other omics data generating multiblock data analysis problems.