Informatics and Machine Learning. Stephen Winters-Hilt
Чтение книги онлайн.
Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 20
Figure 2.2 The start of the E. coli genome file, FASTA format.
---------------------- prog1.py addendum 5 ------------------- pattern = '[acgt]' result = re.findall(pattern, str) seqlen = len(result) # sequence = "" # sequence = sequence.join(result) # print(sequence) print("The sequence length of the Norwalk genome is: ") print(seqlen) a_count=0 c_count=0 g_count=0 t_count=0 for index in range(0,seqlen): if result[index] == 'a': a_count+=1.0 elif result[index] == 'c': c_count+=1.0 elif result[index] == 'g': g_count+=1.0 elif result[index] == 't': t_count+=1.0 else: print("bad char\n") norwalk_counts = np.array([a_count, c_count, g_count, t_count]) print(norwalk_counts) norwalk_probs = np.array([0.0,0,0,0]) norwalk_probs = count_to_freq(norwalk_counts) value = shannon(norwalk_probs) print(value) -------------------- end prog1.py addendum 5 -----------------
We now traverse the array of single acgt's extracted from the raw genome data file, and increment counters associated with the acgt's as appropriate. At the end we have gotten the needed counts, and can then use our subroutines to see what Shannon entropy occurs.
Note on the informatics result: notice how the Shannon entropy for the frequencies of {a,c,g,t} in the genomic data differs only slightly from the Shannon entropy that would be found for a perfectly random, uniform, probability distribution (e.g. log(4)). This shows that although the genomic data is not random, i.e. the genomic data holds “information” of some sort, but we are only weakly seeing it at the level of single base usage.
In order to see clearer signs of nonrandomness, let us try evaluating frequencies at the base‐pair, or dinucleotide, level. There are 16 (4 × 4) dinucleotides that we must now get counts on:
------------------ prog1.py addendum 6 ----------------------- di_uniform = [1.0/16]*16 stats = {} for i in result: if i in stats: stats[i]+=1 else: stats[i]=1 #for i in sorted(stats, key=stats.get): for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) stats = {} for index in range(1,seqlen): dinucleotide = result[index-1] + result[index] if dinucleotide in stats: stats[dinucleotide]+=1 else: stats[dinucleotide]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) -------------------- end prog1.py addendum 6 -----------------
In the above example we see our first use of hash variables (“stats”) in keeping tabs on counts of occurrences of various outcomes. This is a fundamental way to perform such counts without enumerating all of the outcomes beforehand (which results in what is known as the “enumeration problem,” which is not really a problem, just a poor algorithmic approach). Further discussion of the enumeration “problem” and how it can be circumvented with use of hash variables will be described in Section 2.2.
The sequence information is traversed in a manner such that each of the dinucleotides is counted in the order seen, where the dinucleotide is extracted as a “window” of width two bases is slid across the genomic sequence. Each dinucleotide is entered into the count hash variable as a “key” entry, with the associated “value” being an increment on the count already seen and held as the old “value.” These counts are then transferred to an array to make use of our prior subroutines count_to_freq and Shannon.
In the results for Shannon entropy on dinucleotides, we still do not see clear signs of nonrandomness. Similarly, let us try trinucleotide level. There are 64 (4 × 4 × 4) trinucleotides that we must now get counts on:
-------------------- prog1.py addendum 7 --------------------- stats = {} order = 3 for index in range(order-1,seqlen): xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) ---------------- end prog1.py addendum 7 ---------------------
Still do not see real clear signs of non‐random at tribase‐level! So let us try 6‐nucleotide level. There are 4096 6‐nucleotides that we must now get counts on:
----------------- prog1.py addendum 8 ------------------------ def shannon_order( seq, order ): stats = {} seqlen = len(seq) for index in range(order-1,seqlen): xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 nonzerocounts = len(stats) print("nonzerocounts=") print(nonzerocounts) counts = np.empty((0)) for i in sorted(stats): counts = np.append(counts,stats[i]+0.0) probs = count_to_freq(counts) value = shannon(probs) print "The shannon entropy at order", order, "is:", value, "." shannon_order(result,6) ------------------- end prog1.py addendum 8 ------------------
2.1.1 Sample Size Complications
The 6‐nucleotide statistics analyzed in prog1.py in the preceding is typically called a hexamer statistical analysis. Where the window‐size for extracting the substrings has “‐mer” appended, thus six‐mer or hexamer. The term “‐mer” comes from oligomer, a polymer containing a small number of monomers in its specification. In the case of the hexamers we saw that there were 4096 possible hexamers, or length six substrings, when the “alphabet” of monomer types consists of four elements: a,c,g, and t. In other words, there are 46 = 4096 such substrings. In the Norwalk virus analysis this large number of different things to count, versus sample size overall, raises sampling questions. The Norwalk virus has a genome that is only 7654 nucleotides long. As we sweep the six‐base window over that string to extract all of the hexamer counts we then obtain only 7654 − 5 = 7649 hexamer samples! Even with uniform distribution we will be getting barely two counts for most of the different hexamer types! Limitations due to sample size play a critical role in these types of analysis.
The Norwalk virus genome is actually smaller than the typical viral genome, which ranges between 10 000 and 100 000 bases in length. Prokaryotic genomes typically range between 1 and 10 million bases in length. While the human genome is approximately three billion bases in length (3.23 Gb per haploid genome, 6.46 Gb total diploid). To go forward with a “strong” statistical analysis in the current discussion, the key as with any statistical analysis, is sample size, which is obviously dictated in this analysis by genome size. So to have “good statistics” meaning to have sufficient samples that frequencies of outcomes provide a good estimation of the underlying probabilities for those outcomes, we will apply the methods developed thus far to a bacterial genome in Chapter 3 (the classic model organism, E. coli.). In this instance the genome size will be approximately four and a half million bases in length, so much better counts should result than with the 7654 base Norwalk virus genome.
2.2 Counting, the Enumeration Problem, and Statistics
In the example in the previous section we left off with counts on all 4096 hexamers seen in a given genome. If we go from counts on substrings of length 6 to substrings of length 30 we run into a problem – there are now a million million million (1018) substrings to get counts on. No genome is even remotely this large, so when getting counts on substrings in this situation most substring counts will necessarily be zero. Due to the large number of substrings, this is often referred to as “the enumeration problem,” but since counts need only be maintained that are nonzero, we are bounded by genome size, for which there is no enumeration problem. The main mechanism for capturing count information on substrings without dedicated (array) memory, is by use of associative memory constructs, such as the hash variable, and this technique is employed in the code examples.