Informatics and Machine Learning. Stephen Winters-Hilt

Чтение книги онлайн.

Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 19

Informatics and Machine Learning - Stephen Winters-Hilt

Скачать книгу

shannon_entropy += arr[index]*math.log(arr[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) ----------------------- end prog1.py ------------------------

      The maximum Shannon entropy on a system with six outcomes, uniformly distributed (a fair die), is log(6). In the prog1.py program above we evaluate the Shannon entropy for a loaded die: (1/10.1/10,1/10,1/10,1/10,1/2). Notice in the code, however, that I use “1.0” not “1”. This is because if the expression only involves integers the mathematics will be done as integer operations (returning an integer, thus truncation of some sort). An expression that is mixed, some integer terms, some floating point (with a decimal), will be evaluated as a floating point number. So, to force recognition of the numbers as floating point the “1” value in the terms is entered as “1.0”. Further tests are left to the Exercises (Section 2.7).

      Let us now move on to some basic statistical concepts. How do we know the probabilities for the outcomes of the die roll? In practice, you would observe numerous die rolls and get counts of how many times the various outcomes were observed. Once you have counts, you can divide by the total counts to have the frequency of occurrence of the different outcomes. If you have enough observational data, the frequencies then become better and better estimates of the true underlying probabilities for those outcomes for the system observed (a result due to the law of large numbers (LLN), which is rederived in Section 2.6.1). Let us proceed with adding more code in prog1.py that begins with counts on the different die rolls:

       ------------------ prog1.py addendum 1 ----------------------- rolls = np.array([3435.0,3566,3245,3600,3544,3427]) numterms = len(rolls) total_count = 0 for index in range(0,numterms): total_count += rolls[index] print(total_count) probs = np.array([0.0,0,0,0,0,0]) for index in range(0,numterms): probs[index] = rolls[index]/total_count; print(probs) -------------------- end prog1.py addendum 1 -----------------

      At this point we can estimate a new probability distribution based on the rolls observed, for which we are interested in evaluating the Shannon entropy. To avoid repeatedly copying and pasting the above code for evaluating the Shannon entropy, let us create a subroutine, called “shannon” that will do this standard computation. This is a core software engineering process, whereby tasks that are done repeatedly become recognized as such, and become rewritten as subroutines, and then need no longer be rewritten. Subroutines also avoid clashes in variable usage, compartmentalizing their variables (whose scope is only in their subroutine), and more clearly delineate what information is “fed in” and what information is returned (e.g. the application programming interface, or API).

       ----------------------- prog1.py addendum 2 ------------------ def shannon( probs ): shannon_entropy = 0 numterms = len(probs) print(numterms) for index in range(0, numterms): print(probs[index]) shannon_entropy += probs[index]*math.log(probs[index]) shannon_entropy = -shannon_entropy print(shannon_entropy) return shannon_entropy shannon(probs) value = shannon(probs) print(value) -------------------- end prog1.py addendum 2 -----------------

       ------------------- prog1.py addendum 3 ---------------------- def count_to_freq( counts ): numterms = len(counts) total_count=0 for index in range(0,numterms): total_count+=counts[index] probs = counts # to get memory allocation for index in range(0,numterms): probs[index] = counts[index]/total_count return probs probs = count_to_freq(rolls) print(probs) ----------------- end prog1.py addendum 3 --------------------

      Is genomic DNA random? Let us read thru a dna file, consisting of a sequence of a,c,g, and t's, and get their counts… then compute the shannon entropy vs. random (uniform distribution, e.g. p = 1/4 for each of the four possibilities). In order to do this we must learn file input/output (i/o) to “read” the data file:

       ------------------ prog1.py addendum 4 ----------------------- fo = open("Norwalk_Virus.txt", "r+") str = fo.read() # print(str) fo.close() ---------------- end prog1.py addendum 4 ---------------------

Schematic illustration of the Norwalk virus genome.

      The E. coli genome file is shown only for the first part (it is 4.6 Mb) in Figure 2.2, where the key feature of the FASTA file is apparent on line 1, where a “>” symbol should be present, indicating a label (or comment – information that will almost always be present):

Скачать книгу