Informatics and Machine Learning. Stephen Winters-Hilt

Чтение книги онлайн.

Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 18

Informatics and Machine Learning - Stephen Winters-Hilt

Скачать книгу

the gHMM is used for feature extraction on stochastic sequential data, while classification and clustering analysis are implemented using a SVM. In addition, the design of the ML‐based algorithms allow for scaling to large datasets, via real‐time distributed processing, and are adaptable to analysis on any stochastic sequential dataset. The ML software has also been integrated into the NTD Nanoscope [2] for “real‐time” pattern‐recognition informed (PRI) feedback [1–3] (see Chapter 14 for results). The methods used to implement the PRI feedback include distributed HMM and SVM implementations, which enable the processing speedup that is needed.

      1.9.2 Nanoscope Cheminformatics – A Case Study for Device “Smartening”

      ML provides a solution to the “Big Data” problem, whereby a vast amount of data is distilled down to its information essence. The ML solution sought is usually required to perform some task on the raw data, such as classification (of images) or translation of text from one language to another. In doing so, ML solutions are strongly favored where a clear elucidation of the features used in the classification are also revealed. This then allows a more standard engineering design cycle to be accessed, where the stronger features thereby identified may play a stronger role, or guide the refinement of related strong features, to arrive at an improved classifier. This is what is accomplished with the previously mentioned SSA Protocol.

      So, given the flexibility of the SSA Protocol to “latch on” to signal that has a reasonable set of features, you might ask what is left? (Note that, all communication protocols, both natural (genomic) and man‐made, have a “reasonable” set of features.) The answer is simply when the number of features is “unreasonable” (with enumeration not even known, typically). So instead of 100 features, or maybe 1000, we now have a situation with 100 000 to 100s of millions of features (such as with sentence translation or complex image classification). Obviously Big Data is necessary to learn with such a huge number of features present, so we are truly in the realm of Big Data to even begin with such problems, but now have the Big Features issue (e.g. Big Data with Big Features, or BDwBF). What must occur in such problems is a means to wrangle the almost intractable large feature set of information to a much smaller feature set of information, e.g. an intial layer of processing is needed just to compress the feature data. In essence, we need a form of compressive feature extraction at the outset in order to not overwhelm the acquisition process. An example from the biology of the human eye, is the layer of local neural processing at the retina before the nerve impulses even travel on to the brain for further layers of neural processing.

      Throughout the text an effort is made to provide mathematical specifics to clearly understand the theoretical underpinnings of the methods. This provides a strong exposition of the theory but the motivation for this is not to do more theory, but to then proceed to a clearly defined computational implementation. This is where mathematical elegance meets implementation/computational practicality (and the latter wins). In this text, the focus is almost entirely on elegent methods that also have highly efficient computational implementations.

      In this chapter, a review is given of statistics and probability concepts, with implementation of many of the concepts in Python. Python scripts are then used to do a preliminary examination of the randomness of genomic (virus) sequence data. A short review of Linux OS setup (with Python automatically installed) and Python syntax is given in Appendix A.

      Numerous prior book, journal, and patent publications by the author [1–68] are drawn upon extensively throughout the text. Almost all of the journal publications are open access. These publications can typically be found online at either the author's personal website (www.meta‐logos.com) or with one of the following online publishers: www.m‐hikari.com or bmcbioinformatics.biomedcentral.com.

      A “fair” die has equal probability of rolling a 1, 2, 3, 4, 5, or 6, i.e. a probability of 1/6 for each of the outcomes. Notice how the sum of all of the discrete probabilities for the different outcomes all add up to 1, this is always the case for probabilities describing a complete set of outcomes.

      A “loaded” die has a non‐uniform distribution, for prob = 0.5 to roll a “6” and uniform on the other die rolls you have loaded die_roll_probability = (1/10,1/10,1/10,1/10,1/10,1/2).

      The first program to be discussed is named prog1.py and will introduce the notion of discrete probability distributions in the context of rolling the familiar six‐sided die. Comments in Python are the portion of a line to the right of any “#” symbol (except for the first line of code with “#!.....”, that is explained later).

       -------------------------- prog1.py ------------------------- #!/usr/bin/python import numpy as np import math import re arr = np.array([1.0/10,1.0/10,1.0/10,1.0/10,1.0/10,1.0/2]) # print(arr[0]) shannon_entropy

Скачать книгу