Informatics and Machine Learning. Stephen Winters-Hilt
Чтение книги онлайн.
Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 12
![Informatics and Machine Learning - Stephen Winters-Hilt Informatics and Machine Learning - Stephen Winters-Hilt](/cover_pre1049298.jpg)
1.1 Data Science: Statistics, Probability, Calculus … Python (or Perl) and Linux
Knowledge construction using statistical and computational methods is at the heart of data science and informatics. Counts on data features (or events) are typically gathered as a starting point in many analyses [101, 102]. Computer hardware is very well suited to such counting tasks. Basic operating system commands and a popular scripting language (Python) will be taught to enable doing these tasks easily. Computer software methods will also be shown that allow easy implementation and understanding of basic statistical methods, whereby the counts, for example, can be used to determine event frequencies, from which statistical anomalies can be subsequently identified. The computational implementation of basic statistics methods then provides the framework to perform more sophisticated knowledge construction and discovery by use of information theory and basic ML methods. ML can be thought of as a specialized branch of statistics where there is minimal assumption of a statistical “model” based on prior human learning. This book shows how to use computational, statistical, and informatics/algorithmic methods to analyze any data that is captured in digital form, whether it be text, sequential data in general (such as experimental observations over time, or stock market/econometric histories), symbolic data (genomes), or image data. Along the way there will be a brief introduction to probability and statistics concepts (Chapter 2) and basic Python/Linux system programming methods (Chapter 2 and Appendix A).
1.2 Informatics and Data Analytics
It is common to need to acquire a signal where the signal properties are not known, or the signal is only suspected and not discovered yet, or the signal properties are known but they may be too much trouble to fully enumerate. There is no common solution, however, to the acquisition task. For this reason the initial phases of acquisition methods unavoidably tend to be ad hoc. As with data dependency in non‐evolutionary search metaheuristics (where there is no optimal search method that is guaranteed to always work well), here there is no optimal signal acquisition method known in advance. In what follows methods are described for bootstrap optimization in signal acquisition to enable the most general‐use, almost “common,” solution possible. The bootstrap algorithmic method involves repeated passes over the data sequence, with improved priors, and trained filters, among other things, to have improved signal acquisition on subsequent passes. The signal acquisition is guided by statistical measures to recognize anomalies. Informatics methods and information theory measures are central to the design of a good finite state automata (FSAs) acquisition method, and will be reviewed in signal acquisition context in Chapters 2–4. Code examples are given in Python and C (with introductory Python described in Chapter 2 and Appendix A). Bootstrap acquisition methods may not automatically provide a common solution, but appear to offer a process whereby a solution can be improved to some desirable level of general‐data applicability.
The signal analysis and pattern recognition methods described in this book are mainly applied to problems involving stochastic sequential data: power signals and genomic sequences in particular. The information modeling, feature selection/extraction, and feature‐vector discrimination, however, were each developed separately in a general‐use context. Details on the theoretical underpinnings are given in Chapter 3, including a collection of ab initio information theory tools to help “find your way around in the dark.” One of the main ab initio approaches is to search for statistical anomalies using information measures, so various information measures will be described in detail [103–115].
The background on information theory and variational/statistical modeling has significant roots in variational calculus. Chapter 3 describes information theory ideas and the information “calculus” description (and related anomaly detection methods). The involvement of variational calculus methods and the possible parallels with the nascent development of a new (modern) “calculus of information” motivates the detailed overview of the highly successful physics development/applications of the calculus of variations (Appendix B). Using variational calculus, for example, it is possible to establish a link between a choice of information measure and statistical formalism (maximum entropy, Section 3.1). Taking the maximum entropy on a distribution with moment constraints leads to the classic distributions seen in mathematics and nature (the Gaussian for fixed mean and variance, etc.). Not surprisingly, variational methods also help to establish and refine some of the main ML methods, including Neural Nets (NNs) (Chapters 9, 13) and Support Vector Machines (SVM) (Chapter 10). SVMs are the main tool presented for both classification (supervised learning) and clustering (unsupervised learning), and everything in between (such as bag learning).
1.3 FSA‐Based Signal Acquisition and Bioinformatics
Many signal features of interest are time limited and not band limited in the observational context of interest, such as noise “clicks,” “spikes,” or impulses. To acquire these signal features a time‐domain finite state automaton (tFSA) is often most appropriate [116–124]. Human hearing, for example, is a nonlinear system that thereby circumvents the restrictions of the Gabor limit (to allow for musical geniuses, for example, who have “perfect pitch”), where time‐frequency acuity surpasses what would be possible by linear signal processing alone [116] , such as with Nyquist sampled linear response recording devices that are bound by the limits imposed by the Fourier uncertainty principle (or Benedick’s theorem) [117] . Thus, even when the powerful Fourier Transform or Hidden Markov Model (HMM) feature extraction methods are utilized to full advantage, there is often a sector of the signal analysis that is only conveniently accessible to analysis by way of FSAs (without significant oversampling), such that a parallel processing with both HMM and FSA methods is often needed (results demonstrating this in the context of channel current analysis [1–3] will be described in Chapter 14). Not all of the methods employed at the FSA processing stage derive from standard signal processing approaches, either, some are purely statistical such as with oversampling [118] (used in radar range oversampling [119, 120]) and dithering [121] (used in device stabilization and to reduce quantization error [122, 123]).
All of the tFSA signal acquisition methods described in Chapters 2–4 are O(L), i.e. they scan the data with a computational complexity no greater than that of simply seeing the data (via a “read” or “touch” command, O(L) is known as “order of,” or “big‐oh,” notation). Because the signal acquisition is only O(L) it is not significantly costly, computationally, to simply repeat the acquisition analysis multiple times with a more informed process with each iteration, to have arrived at a “bootstrap” signal acquisition process. In such a setting, signal acquisition is often done with bias to very high