Informatics and Machine Learning. Stephen Winters-Hilt
Чтение книги онлайн.
Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 15
Figure 1.3 Chunking on a dynamic table. Works for a HMM using a simple join recovery.
1.5.1 HMMs for Analysis of Information Encoding Molecules
The main application areas for HMMs covered in this book are power signal analysis generally, and bioinformatics and cheminformatics specifically (the main reviews and applications discussed are from [128–134]). For bioinformatics, we have information encoding molecules that are polymers, giving rise to sequential data format, thus HMMs are well suited for analysis. To begin to understand bioinformatics, however, we need to know not only the biological encoding rules, largely rediscovered on the basis of their statistical anomalies in Chapters 1–4, but also the idiosyncratic structures seen (genomes and transcriptomes) that are full of evolutionary artifacts and similarities to evolutionary cousins. To know the nature of the statistical imprinting on the polymeric encodings also requires an understanding of the biochemical constraints that give rise to the statistical biases seen. Once taken altogether, bioinformatics offers a lot of clarity on why Nature has settled on the particular genomic “mess,” albeit with optimizations, that it has selectively arrived at. See [1, 3] for further discussion of bioinformatics.
1.5.2 HMMs for Cheminformatics and Generic Signal Analysis
The prospect of having a HMM feature extraction in the streaming signal processing pipeline (O(L), for size L data process) offers powerful real‐time feature extraction capabilities and specialized filtering (all of which is implemented in the Nanoscope, Chapter 14). One such processing method, described in Chapter 6, is HMM/Expectation Maximization (EM) EVA (Emission Variance Amplification) Projection which has application in providing simplified automated tFSA Kinetic Feature Extraction from channel current signal. What is needed is the equivalent of low‐pass filtering on blockade levels while retaining sharpness on the timing of the level changes. This is not possible with the standard low‐pass filter because the edges get blurred out in the local filtering process, but notice how this does not happen with the HMM‐based filter, for the data shown in Figure 1.4.
HMM is a common intrinsic statistical sequence modeling method (implementations and applications are mainly drawn from [135–158] in what follows), so the question naturally arises – how to optimally incorporate extrinsic “side‐information” into a HMM? This can be done by treating duration distribution information itself as side‐information and a process is shown for incorporating side‐information into a HMM. It is thereby demonstrated how to bootstrap from a HMM to a HMMD (more generally, a hidden semi‐Markov model or HSMM, as it will be described in Chapter 7).
In many applications, the ability to incorporate the state duration into the HMM is very important because conventional HMM‐based, Viterbi and Baum‐Welch algorithms are otherwise critically constrained in their modeling ability to distributions on state intervals that are geometric (this is shown in Chapter 7). This can lead to a significant decoding failure in noisy environments when the state‐interval distributions are not geometric (or approximately geometric). The starkest contrast occurs for multimodal distributions and heavy‐tailed distributions, the latter occurring for exon and intron length distributions (thus critical in gene finders). The hidden Markov model with binned duration (HMMBD) algorithm eliminates the HMM geometric distribution modeling constraint, as well as the HMMD maximum duration constraint, and offers a significant reduction in computational time for all HMMBD‐based methods to be approximately equal to the computational time of the HMM‐process alone.
Figure 1.4 Edge feature enhancement via HMM/EM EVA filter. The filter “projects” via a Gaussian parameterization on emissions with variance boosted by the factor indicated. From prior publications by the author [1–3].
Source: Based on Winters‐Hilt [1–3].
In adopting any model with “more parameters,” such as a HMMBD over a HMM, there is potentially a problem with having sufficient data to support the additional modeling. This is generally not a problem in any HMM model that requires thousands of samples of non‐self transitions for sensor modeling, such as for the gene‐finding that is described in what follows, since knowing the boundary positions allows the regions of self‐transitions (the durations) to be extracted with similar sample number as well, which is typically sufficient for effective modeling of the duration distributions in a HMMD.
Improvement to overall HMM application rests not only with the aforementioned improvements to the HMM/HMMBD, but also with improvements to the hidden state model and emission model. This is because standard HMMs are at low Markov order in transitions (first) and in emissions (zeroth), and transitions are decoupled from emissions (which can miss critical structure in the model, such as state transition probabilities that are sequence dependent). This weakness is eliminated if we generalize to the largest state‐emission clique possible, fully interpolated on the data set, as is done with the generalized‐clique HMM, where gene finding is performed on the Caenorhabditis elegans genome. The clique generalization improves the modeling of the critical signal information at the transitions between exon regions and noncoding regions, e.g. intron and junk regions. In doing this we arrive at a HMM structure identification platform that is novel, and robustly performing, in a number of ways.
Prior HMM‐based systems for SSA had undesirable limitations and disadvantages. For example, the speed of operation made such systems difficult, if not impossible, to use for real‐time analysis of information. In the SSA Protocol described here, distributed generalized HMM processing together with the use of the SVM‐based Classification and Clustering Methods (described next) permit the general use of the SSA Protocol free of the usual limitations. After the HMM and SSA methods are described, their synergistic union is used to convey a new approach to signal analysis with HMM methods, including a new form of stochastic‐carrier wave (SCW) communication.
1.6 Theoretical Foundations for Learning
Before moving on to classification and clustering (Chapter 10), a brief description is given of some of the theoretical foundations for learning, starting with the foundation for the choice of information measures used in Chapters 2–4, and this is shown in Chapter 8. In Chapter 9 we then describe the theory of NNs. The Chapter 9 background is not meant to be a complete exposition on NN learning (the opposite), but merely goes through a few specific analyses in the area of Loss Bounds analysis to give a sense of what makes a good classification method.
1.7 Classification and Clustering
SVMs