Informatics and Machine Learning. Stephen Winters-Hilt

Чтение книги онлайн.

Читать онлайн книгу Informatics and Machine Learning - Stephen Winters-Hilt страница 20

Informatics and Machine Learning - Stephen Winters-Hilt

Скачать книгу

illustration of the start of the E. coli genome file, FASTA format."/>

       ---------------------- prog1.py addendum 5 ------------------- pattern = '[acgt]' result = re.findall(pattern, str) seqlen = len(result) # sequence = "" # sequence = sequence.join(result) # print(sequence) print("The sequence length of the Norwalk genome is: ") print(seqlen) a_count=0 c_count=0 g_count=0 t_count=0 for index in range(0,seqlen): if result[index] == 'a': a_count+=1.0 elif result[index] == 'c': c_count+=1.0 elif result[index] == 'g': g_count+=1.0 elif result[index] == 't': t_count+=1.0 else: print("bad char\n") norwalk_counts = np.array([a_count, c_count, g_count, t_count]) print(norwalk_counts) norwalk_probs = np.array([0.0,0,0,0]) norwalk_probs = count_to_freq(norwalk_counts) value = shannon(norwalk_probs) print(value) -------------------- end prog1.py addendum 5 -----------------

      Note on the informatics result: notice how the Shannon entropy for the frequencies of {a,c,g,t} in the genomic data differs only slightly from the Shannon entropy that would be found for a perfectly random, uniform, probability distribution (e.g. log(4)). This shows that although the genomic data is not random, i.e. the genomic data holds “information” of some sort, but we are only weakly seeing it at the level of single base usage.

      In order to see clearer signs of nonrandomness, let us try evaluating frequencies at the base‐pair, or dinucleotide, level. There are 16 (4 × 4) dinucleotides that we must now get counts on:

       ------------------ prog1.py addendum 6 ----------------------- di_uniform = [1.0/16]*16 stats = {} for i in result: if i in stats: stats[i]+=1 else: stats[i]=1 #for i in sorted(stats, key=stats.get): for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) stats = {} for index in range(1,seqlen): dinucleotide = result[index-1] + result[index] if dinucleotide in stats: stats[dinucleotide]+=1 else: stats[dinucleotide]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) -------------------- end prog1.py addendum 6 -----------------

      The sequence information is traversed in a manner such that each of the dinucleotides is counted in the order seen, where the dinucleotide is extracted as a “window” of width two bases is slid across the genomic sequence. Each dinucleotide is entered into the count hash variable as a “key” entry, with the associated “value” being an increment on the count already seen and held as the old “value.” These counts are then transferred to an array to make use of our prior subroutines count_to_freq and Shannon.

      In the results for Shannon entropy on dinucleotides, we still do not see clear signs of nonrandomness. Similarly, let us try trinucleotide level. There are 64 (4 × 4 × 4) trinucleotides that we must now get counts on:

       -------------------- prog1.py addendum 7 --------------------- stats = {} order = 3 for index in range(order-1,seqlen): xmer = "" for xmeri in range(0,order): xmer+=result[index-(order-1)+xmeri] if xmer in stats: stats[xmer]+=1 else: stats[xmer]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) ---------------- end prog1.py addendum 7 ---------------------

      Still do not see real clear signs of non‐random at tribase‐level! So let us try 6‐nucleotide level. There are 4096 6‐nucleotides that we must now get counts on:

      2.1.1 Sample Size Complications

      The Norwalk virus genome is actually smaller than the typical viral genome, which ranges between 10 000 and 100 000 bases in length. Prokaryotic genomes typically range between 1 and 10 million bases in length. While the human genome is approximately three billion bases in length (3.23 Gb per haploid genome, 6.46 Gb total diploid). To go forward with a “strong” statistical analysis in the current discussion, the key as with any statistical analysis, is sample size, which is obviously dictated in this analysis by genome size. So to have “good statistics” meaning to have sufficient samples that frequencies of outcomes provide a good estimation of the underlying probabilities for those outcomes, we will apply the methods developed thus far to a bacterial genome in Chapter 3 (the classic model organism, E. coli.). In this instance the genome size will be approximately four and a half million bases in length, so much better counts should result than with the 7654 base Norwalk virus genome.

      In the example in the previous section we left off with counts on all 4096 hexamers seen in a given genome. If we go from counts on substrings of length 6 to substrings of length 30 we run into a problem – there are now a million million million (1018) substrings to get counts on. No genome is even remotely this large, so when getting counts on substrings in this situation most substring counts will necessarily be zero. Due to the large number of substrings, this is often referred to as “the enumeration problem,” but since counts need only be maintained that are nonzero, we are bounded by genome size, for which there is no enumeration problem. The main mechanism for capturing count information on substrings without dedicated (array) memory, is by use of associative memory constructs, such as the hash variable, and this technique is employed in the code examples.

Скачать книгу