Informatics and Machine Learning. Stephen Winters-Hilt
Чтение книги онлайн.
Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 19
The maximum Shannon entropy on a system with six outcomes, uniformly distributed (a fair die), is log(6). In the prog1.py program above we evaluate the Shannon entropy for a loaded die: (1/10.1/10,1/10,1/10,1/10,1/2). Notice in the code, however, that I use “1.0” not “1”. This is because if the expression only involves integers the mathematics will be done as integer operations (returning an integer, thus truncation of some sort). An expression that is mixed, some integer terms, some floating point (with a decimal), will be evaluated as a floating point number. So, to force recognition of the numbers as floating point the “1” value in the terms is entered as “1.0”. Further tests are left to the Exercises (Section 2.7).
A basic review of getting a linux system running, with it is standard Python installed, is described in the Appendix, along with a discussion of how to install added Python modules (added code blocks with very useful, pre‐built, data structures, and subroutines), particularly “numpy,” which is indicated as a module to be imported (accessed) by the program by the first Python command: “import numpy as np.” (We will see in the Appendix that the first line is not a Python command but a shell directive as to what program to use to process the commands that follow, and this is the mechanism whereby a system level call on the Python script can be done.)
Let us now move on to some basic statistical concepts. How do we know the probabilities for the outcomes of the die roll? In practice, you would observe numerous die rolls and get counts of how many times the various outcomes were observed. Once you have counts, you can divide by the total counts to have the frequency of occurrence of the different outcomes. If you have enough observational data, the frequencies then become better and better estimates of the true underlying probabilities for those outcomes for the system observed (a result due to the law of large numbers (LLN), which is rederived in Section 2.6.1). Let us proceed with adding more code in prog1.py that begins with counts on the different die rolls:
------------------ prog1.py addendum 1 ----------------------- rolls = np.array([3435.0,3566,3245,3600,3544,3427]) numterms = len(rolls) total_count = 0 for index in range(0,numterms): total_count += rolls[index] print(total_count) probs = np.array([0.0,0,0,0,0,0]) for index in range(0,numterms): probs[index] = rolls[index]/total_count; print(probs) -------------------- end prog1.py addendum 1 -----------------
Some notes on syntax: “len” is a Python function that returns the length of (number of items in) an array (from the numpy module). Notice how the probs array initialization has one entry as 1.0 and the others just 1. Again, this is an instance where the data structure must have components of the same type and if presented with mixed type will promote to a default type that represents the least loss of information (typically), in this instance, the “1.0” forces the array to be an array of floating point (decimal) numbers, with floating point arithmetic (for the division in the frequency evaluation used as the estimate of the probability in the “for loop”).
At this point we can estimate a new probability distribution based on the rolls observed, for which we are interested in evaluating the Shannon entropy. To avoid repeatedly copying and pasting the above code for evaluating the Shannon entropy, let us create a subroutine, called “shannon” that will do this standard computation. This is a core software engineering process, whereby tasks that are done repeatedly become recognized as such, and become rewritten as subroutines, and then need no longer be rewritten. Subroutines also avoid clashes in variable usage, compartmentalizing their variables (whose scope is only in their subroutine), and more clearly delineate what information is “fed in” and what information is returned (e.g. the application programming interface, or API).
----------------------- prog1.py addendum 2 ------------------ def shannon( probs ): shannon_entropy = 0 numterms = len(probs) print(numterms) for index in range(0, numterms): print(probs[index]) shannon_entropy += probs[index]*math.log(probs[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) return shannon_entropy shannon(probs) value = shannon(probs) print(value) -------------------- end prog1.py addendum 2 -----------------
If we do another set of observations, getting counts on the different rolls, we then need to repeat the process of converting those counts to frequencies… so it is time to elevate the count‐to‐frequency computation to subroutine status as well, as is done next. The standard syntactical structure for defining a subroutine in Python is hopefully starting to become apparent (more detailed Python notes are in Appendix A).
------------------- prog1.py addendum 3 ---------------------- def count_to_freq( counts ): numterms = len(counts) total_count=0 for index in range(0,numterms): total_count+=counts[index] probs = counts # to get memory allocation for index in range(0,numterms): probs[index] = counts[index]/total_count return probs probs = count_to_freq(rolls) print(probs) ----------------- end prog1.py addendum 3 --------------------
Is genomic DNA random? Let us read thru a dna file, consisting of a sequence of a,c,g, and t's, and get their counts… then compute the shannon entropy vs. random (uniform distribution, e.g. p = 1/4 for each of the four possibilities). In order to do this we must learn file input/output (i/o) to “read” the data file:
------------------ prog1.py addendum 4 ----------------------- fo = open("Norwalk_Virus.txt", "r+") str = fo.read() # print(str) fo.close() ---------------- end prog1.py addendum 4 ---------------------
Notes on syntax: the example above shows the standard template for reading a data file, where the datafile's name is Norwalk_Virus.txt. The subroutine “open” is a Python command that handles file i/o. As its name suggests, it “opens” a datafile.
Figure 2.1 The Norwalk virus genome (the “cruise ship virus”).
The Norwalk virus file has nonstandard format and is shown in its entirety in Figure 2.1 (split into two columns). The Escherichia coli genome (Figure 2.2), on the other hand, has standard FASTA format. (FASTA is the name of a program (~1985), where a file format convention was adopted, allowing “flat‐file” record access that has been used in similar form ever since.)
The E. coli genome file is shown only for the first part (it is 4.6 Mb) in Figure 2.2, where the key feature of the FASTA file is apparent on line 1, where a “>” symbol should be present, indicating a label (or comment – information that will almost always be present):
Python has a powerful regular expression module (named “re” and imported in the first code sample of prog1.py). Regular expression processing of strings of characters is a mini programming language in its own right, thus a complete list of the functionalities of the re module will not be given here. Focusing on the “findall” function, it does as its name suggests, it finds all entries matching the search string characters in an array in the order they are encountered in the string. We begin with the string comprising the data file read in the previous example. We now traverse the string searching for characters matching those specified in the pattern field. The array of a,c,g,t's that results has conveniently stripped any numbers, spaces, or line returns in this process, and a straightforward count can be done: