Biological Language Model. Qiwen Dong
Чтение книги онлайн.
Читать онлайн книгу Biological Language Model - Qiwen Dong страница 7
where xr is in the data set whose rank is r, and C and α are constants which denote features of Zipf’s law. It can be rewritten as
This equation implies that the xr versus r plot on a log–log scale will be a straight line.
In natural language, the words’ frequency and their ranks follow Zipf’s law. Especially in English, Zipf’s law can be applicable to words, parts of speech, sentences and so on.
Zipf’s law of n-grams has been analyzed using the results of ngram statistics. Figure 2-2 shows the log–log plot of n-gram frequency versus their rank for A_thaliana (A) and Human (B). When n is larger than 4, the plot is similar to a straight line and the value of α is close to 0.5. We can claim that the n-grams of whole genome protein sequences approximately follow Zipf’s law when n is larger than 4.
A statistical measure giving partial information about the degree of complexity of a symbolic sequence is obtainable by calculating the n-gram entropy of the analyzed text. The Shannon n-gram entropy is defined as
Figure 2-2 Zipf’s Law analysis for A_thaliana (A) and Human (B).
where Pi is the frequency of the n-gram and λ is the number of letters of the alphabet.
From the n-gram entropy, one can obtain the redundancy R represented in any text. The redundancy is given as
where K = log2 λ. The redundancy is a manifestation of the flexibility of the underlying language.
To test whether the n-gram Zipf law could be explained by chance sampling, random genome protein sequences have been generated that have the same sequence length and frequency of amino acids as the natural genome. The process used to generate such random genome sequences is the same as the one used by Chatzidimitriou.3
The n-gram redundancy of natural and artificial genome protein sequences have been calculated for different values of n (see Fig. 2-3); n-gram redundancy can be approximately expressed as
Here, the alphabets are amino acids, and so the value of λ is 20.
From Fig. 2-3, one can see that the n-gram redundancy of the natural genome is larger than that of the artificial genome. This means that the n-gram entropy of the natural genome is small and that a “language” may exist in the protein sequence.
2.4Distinguishing the Organisms by Uni-Gram Model
Here, perplexity is used to distinguish the different organisms. Perplexity represents the predictive ability of a language model on a testing text. Let W = w[1], w[2] . . . w[n] denote a sequence of words in the testing text. Let Ck(i) be the context the language model chooses for the prediction of the word w[i]. Furthermore p(w[i] | ck(i)) denotes the probability assigned to the ith word by the model.
Figure 2-3 The n-gram redundancy comparison of a natural and random genome for A_thaliana (A) and Human (B).
The total probability (TP) of the sequence is
Then, the perplexity PP is
where n is the length of the total sequences.
A simple uni-gram (context-independent amino acid) model was trained by the 90 percent proteins from Borrelia_burgdorferi. The perplexity of the other 10 percent proteins and proteins from the other 19 organisms were calculated. Table 2-2 provides detailed results. Different organisms have different perplexities, which indicates that the different “dialects” in proteins may be embodied in the organisms. Another important phenomenon is that the perplexity is independent of the size of the testing proteins. To validate this, another experiment was carried out. The proteins of A_thaliana were used to train the uni-gram model, and then the human proteins were split into 10 shares randomly so as to calculate the perplexity. The results obtained were: 18.2049 18.2091 18.153 18.1905 18.2698 18.2101 18.1556 18.1495 18.3173 18.1925. These perplexities change at a small scale. So, the perplexity is related to the uni-gram model and the organism used to test it and has no relation to the size of testing proteins.
Table 2-2 The perplexities of different organisms.
2.5Conclusions
In this chapter, the n-gram and linguistic features of whole genome protein sequences have been analyzed. The results show that (1) the n-grams of whole genome protein sequences approximately follow Zipf’s law when n is larger than 4, (2) the Shannon n-gram entropy of the natural genome proteins is lower than that of artificial proteins, (3) a simple uni-gram model can distinguish different organisms, (4) there are organism-specific usages of “phrases” in protein sequences. Further work will aim at detailed identification of these “phrases” and the building of a “biological language” which has special words, phrases and syntaxes to map out the relationship of protein sequence, structure and function.
References
[1]Anfinsen C.B. Principles that govern the folding of protein chains. Science, 1973, 181(4096): 223–230.
[2]Mantegna R.N., Buldyrev S.V., Goldberger A.L., Havlin S., Peng C.K., Simons M., Stanley H.E. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 1995, 52(3): 2939–2950.
[3]Chatzidimitriou-Dreismann C.A., Streffer R.M., Larhammar D. Lack of biological significance in the ‘linguistic features’ of noncoding DNA — A quantitative analysis. Nucleic Acids Res, 1996, 24(9): 1676–1681.
[4]Tsonis A.A., Elsner J.B., Tsonis P.A. Is DNA a language? J Theor Biol, 1997, 184(1): 25–29.
[5]Voss R.F. Comment on “Linguistic features of noncoding DNA sequences”. Phys Rev Lett, 1996, 76(11): 1978.
[6]Burge