Biological Language Model. Qiwen Dong
Чтение книги онлайн.
Читать онлайн книгу Biological Language Model - Qiwen Dong страница 8
[7]Ganapathiraju M., Weisser D., Rosenfeld R., Carbonell J., Reddy R., Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In Proceedings of the Human Language Technologies Conference, San Diego 2002, pp. 1367–1375.
[8]Rigoutsos I., Huynh T., Floratos A., Parida L., Platt D. Dictionary-driven protein annotation. Nucleic Acids Res, 2002, 30(17): 3901–3916.
[9]Ganapathiraju M., Klein-Seetharaman J., Balakrishnan N., Reddy R. Characterization of protein secondary structure — Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine, 2004, 21: 78–87.
[10]Coin L., Bateman A., Durbin R. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc Natl Acad Sci USA, 2003, 100(8): 4516–4520.
[11]Charniak E. Statistical Language Learning. 1996. Cambridge, MA: MIT Press, p. 192.
[12]Manber U., Myers G. Suffix arrays: A new method for on-line string searches. SIAM J Comput, 1993, 22(5): 935–948.
[13]Kasai T., Lee G., Arimura H., Arikawa S., Park K. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching. 2001, Jerusalem, Israel: Springer-Verlag, pp. 181–192.
[14]Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O’Donovan C., Phan I., Pilbout S., Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 2003, 31(1): 365–370.
[15]Kingsley Zipf G. Human Behavior and the Principle of Least Effort. SERBIULA (sistema Librum 2.0), 1948, II.
Chapter 3
Amino Acid Encoding for Protein Sequence
3.1Motivation and Basic Idea
The digital representation of amino acids is usually called feature extraction, the amino acid encoding scheme, the residue encoding scheme, etc. Here, we use amino acid encoding as the terminology of choice. It should be noted that amino acid encoding is different from protein sequence encoding. Protein sequence encoding represents the entire protein sequence by using an n-dimensional vector, such as the n-gram,1 pseudo amino acid composition,2,3 etc. Since the amino acid-specific information is lost, protein sequence encoding can be only used to predict sequence-level properties (i.e. protein fold recognition). Amino acid encoding represents each amino acid of a protein sequence by using different n-dimensional vectors; thus, its vector space for a protein sequence is n∗L (L denotes the length of the protein sequence). By combining with different machine learning methods, amino acid encoding can be used in protein property prediction both at the residue level and the sequence level (i.e. protein fold recognition, secondary structure prediction, etc). In the past decades, various amino acid encoding methods have been proposed from different perspectives.4–6 The most widely used encodings are one-hot encoding, position-specific scoring matrix (PSSM) encoding, and some physicochemical property encodings. In addition to those encodings, some other encodings have also been proposed, such as the encoding estimated from interresidue contact energies,7 the encoding learned from protein structure alignments8 and the encoding learned from sequence context.9 These encoding methods explore amino acid encoding from new perspectives, and can be the complement of the above encodings. Kawashima et al.10 have proposed a database of numerical indices of amino acids and amino acid pairs, and this contains information on the physicochemical and biochemical properties of amino acids.
3.2Related Work
3.2.1 Binary encoding
The binary encoding methods use multidimensional binary digits (0 and 1) to represent amino acids in protein sequences. The most commonly used binary encoding is one-hot encoding, which is also called orthogonal encoding.5 For one-hot encoding, each of the 20 amino acids is represented by a 20-dimensional binary vector. Specifically, the 20 standard amino acids are fixed in a specific order, and then the ith amino acid type is represented by 20 binary bits with the ith bit set to “1” and others to “0”. There is only one bit equal to “1” for each vector; hence, it is called “one-hot”. For example, the twenty standard amino acids are sorted as [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y]; the one-hot code of “A” is 100000000000000000000, that of “C” is 01000000000000000000, and so on. Since protein sequences may contain some unknown amino acids, it should be noted that one more bit is needed to represent the unknown amino acid type in some cases, and the dimension of its binary vector will be 21.11
Because one-hot encoding is a high-dimensional and sparse vector representation, there is a simplified binary encoding method based on conservative replacements through evolution.12 Deriving from the point accepted mutation (PAM) matrices,13 the 20 standard amino acids are divided into six groups: [H, R, K], [D, E, N, Q], [C], [S, T, P, A, G], [M, I, L, V] and [F, Y, W]. Six dimensional binary vectors are used to represent amino acids based on their groups. Another low-dimensional binary encoding scheme is the binary 5-bit encoding introduced by White and Seffens.14 Theoretically, the binary 5-bit code could represent 32 (25 = 32) possible amino acid types. In order to represent the 20 standard amino acids, the ones encoded by all 0s, the ones encoded by all 1s and those encoded with 1 or 4 ones (5 + 5 = 10) are removed, finally leaving 20 encodings (32 − 1 − 1 − 10 = 20). This binary 5-bit encoding uses a 5-dimension binary vector to take the place of the 20-dimension vector of one-hot encoding, which may lead to less model complexity.5
3.2.2 Physicochemical properties encoding
From the perspective of molecular composition, a typical amino acid generally contains a central carbon atom (C) which is attached with an amino group (NH2), a hydrogen atom (H), a carboxyl group (COOH) and a side chain (R). The side chains (R) are usually carbon chains or rings (except for proline) which are attached to various functional groups.5 The physicochemical properties of those components play critical roles in the formation of protein structures and functions; thus, these properties can also be used as features for protein structure and function prediction.15
Among various physicochemical properties, the hydrophobicity of the amino acid is believed to play a fundamental role in organizing the self-assembly of a protein.16 Based on the propensity of the amino acid side chain to be in contact with a polar solvent like water, the 20 amino acids can be classified as either hydrophobic or hydrophilic. The free energy of amino acid side chains transferring from cyclohexane to water can be used to represent its hydrophobicity in a quantifiable manner.6 If the free energy is a positive value, the amino acid is hydrophobic, while negative values indicate hydrophilic amino acids. Hydrophobic amino acids are usually buried inside the protein core in protein three-dimensional structures, while the hydrophilic amino acids preferentially cover the surface of the protein three-dimensional structures. Furthermore, the hydrophilic amino acids are called polar amino acids. In a typical biological environment, some polar amino acids carry a charge, Lysine (+), Histidine (+), Arginine (+), Aspartate (−) and Glutamate (−), while other polar amino acids, Asparagine, Glutamine, Serine, Threonine and Tyrosine, are neutral.17 A detailed classification of the hydrophobic properties of the 20 standard acid sides is shown in Table