Informatics and Machine Learning. Stephen Winters-Hilt
Чтение книги онлайн.
Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 16
![Informatics and Machine Learning - Stephen Winters-Hilt Informatics and Machine Learning - Stephen Winters-Hilt](/cover_pre1049298.jpg)
The biophysics and “information flows” associated with the nanopore transduction detector (NTD) in Chapter 14 are analyzed using a generalized set of HMM and SVM‐based tools, as well as ad hoc FSAs‐based methods, and a collection of distributed genetic algorithm methods for tuning and selection. Used with a nanopore detector, the channel current cheminformatics (CCC) for the stationary signal channel blockades (with “stationary statistics”) enables a method for a highly sensitive nanopore detector for single molecule biophysical analysis.
The SVM implementations described involve SVM algorithmic variants, kernel variants, and chunking variants; as well as SVM classification tuning metaheuristics; and SVM clustering metaheuristics. The SVM tuning metaheuristics typically enable use of the SVM’s confidence parameter to bootstrap from a strong classification engine to a strong clustering engine via use of label changes, and repeated SVM training processes with the new label information obtained.
SVM Methods and Systems are given in Chapter 10 for classification, clustering, and SSA in general, with a broad range of applications:
sequential‐structure identification
pattern recognition
knowledge discovery
bioinformatics
nanopore detector cheminformatics
computational engineering with information flows
“SSA” Architectures favoring Deep Learning (see next section)
SVM binary discrimination outperforms other classification methods with or without dropping weak data (while many other methods cannot even identify weak data).
1.8 Search
All of the core methods described thus far (FSA, HMM, SVM) require some amount of parameter “tuning” for good performance. In essence, tuning is a search through parameter space (of the method) for best performance (according to a variety of metrics). The tuning on acquisition parameters in an FSA, or choice of states in a HMM, or SVM Kernels and Kernel parameters, is often not terribly complicated allowing for a “brute‐force” search over a set of parameters, choosing the best from that set. On occasion, however, a more elaborate, and fully automated, search‐optimization is sought (or just search problem in general), For more complex search tasks it is good to know the modern search methodologies and what they are capable of, so these are described in Chapter 11.
1.9 Stochastic Sequential Analysis (SSA) Protocol (Deep Learning Without NNs)
The SSA protocol is shown in Figure 1.5 (from prior publications and patent work, see [1–3]) and is a general signal‐processing flow topology and database schema (Left Panel), with specialized variants for CCC (Center) and kinetic feature extraction based on blockade‐level duration observations (Right). The SSA Protocol allows for the discovery, characterization, and classification of localizable, approximately‐stationary, statistical signal structures in channel current data, or genomic data, or sequential data in general. The core signal processing stage in Figure 1.5 is usually the feature extraction stage, where central to the signal processing protocol is a generalized HMM. The SSA Protocol also has a built‐in recovery protocol for weak signal handling, outlined next, where the HMM methods are complemented by the strengths of other ML methods.
Figure 1.5 (Left) The general stochastic sequential analysis flow topology. (Center) The general signal processing flow in performing channel current analysis is typically Input ➔ tFSA ➔ Meta‐HMMBD ➔ SVM ➔ Output. (Right) Notable differences occur in channel current cheminformatics during state discovery when EVA‐projection (emission variance amplification projection), or a similar method, is used to achieve a quantization on states, then have Input ➔ tFSA ➔ HMMBD/EVA (state discovery) ➔ meta‐HMMBD‐side ➔ SVM ➔ Output. While, in gene‐finding just have: Input ➔ meta‐HMMBD‐side ➔ Output. In gene‐finding, however, the HMM internal “sensors” are sometimes replaced, locally, with profile‐HMMs [1, 3] (equivalent to position‐dependent Markov Models, or pMM’s, see Chapter 7), or SVM‐based profiling [1, 3], so the topology can differ not only in the connections between the boxes shown, but in their ability to embed in other boxes as part of an internal refinement.
Source: Based on Winters‐Hilt [1, 3].
The sequence of algorithmic methods used in the SSA Protocol, for the information‐processing flow topology shown in Figure 1.5, comprise a weak signal handling protocol as follows: (i) the weakness in the (fast) Finite State Automaton (FSA) methods will be shown to be their difficulty in nonlocal structure identification, for which HMM methods (and tuning metaheuristics) are the solution; (ii) for the HMM, in turn, the main weakness is in local sensing “classification” due to conditional independence assumptions. Once in the setting of a classification problem, however, the problem can be solved via incorporation of generalized SVM methods [1, 3]. If facing only classification task (data already preprocessed), the SVM will also be the method of choice in what follows. (iii) The weakness of the SVM, whether used for classification or clustering, but especially for the latter, is the need to optimize over algorithmic, model (kernel), chunking, and other process parameters during learning. This is solved via use of metaheuristics for optimization such as simulated annealing, and genetic algorithm optimization in (iv). The main weaknesses in the metaheuristic tuning effort is partly resolved via use of the “front‐end” methods, like the FSA, and partly resolved by a knowledge discovery process using the SVM clustering methods. The SSA Protocol weak signal acquisition and analysis method thereby establishes a robust signal processing platform.
The HMM methods are the central methodology or stage in the SSA Protocol, particularly in the gene finders, and sometimes with the CCC protocol or implementation, in that the other stages can be dropped or merged with the HMM stage in many incarnations. For example, in some CCC analysis situations the tFSA methods could be totally eliminated in favor of the more accurate (but time consuming) HMM‐based approaches to the problem, with signal states defined or explored in more or less the same setting, but with the optimized Viterbi path solution taken as the basis for the signal acquisition.
The HMM features, and other features (from NN, wavelet, or spike profiling,