Biological Language Model. Qiwen Dong

Чтение книги онлайн.

Читать онлайн книгу Biological Language Model - Qiwen Dong страница 14

Biological Language Model - Qiwen Dong East China Normal University Scientific Reports

Скачать книгу

structure and function. With the growing of number of proteins with known structure, the future prospect of structure-based encodings is considerable. Furthermore, the encodings reflecting function potentials may be more useful than others for protein function prediction; thus, exploring function-based encoding methods is a worthwhile topic. Third, the machine-learning encoding methods can be promising topics for future studies. As the amino acid encoding is an open problem, most encoding methods are based on an artificially defined basis, i.e. the physicochemical property encodings are constructed from protein fold-related properties observed by researchers, which will inevitably bring some unknown deviations. However, the machine-learning methods can avoid those artificial deviations by learning the amino acid encoding from biological data automatically. The protein sequences and natural languages share some similarities to a certain extent; for instance, the protein sequences can be comparable to sentences, and the amino acid or polypeptide chains can be comparable to words in languages. Considering that the word distributed representation has achieved comprehensive improved performances in natural language processing tasks, the protein sequences should also gain improvements by using the distributed representations of amino acids or n-gram amino acids. Some recent studies have demonstrated the potential of amino acid-distributed representations in protein family classification, disordered protein identification and protein functional property prediction, but most of these methods are concerned with the n-gram amino acid-distributed representations that cannot be directly used to predict the residue-level properties. Thus, residue-level distributed representations of amino acid is a topic that needs more attention.

      [1]Liu B., Wang X., Lin L., Dong Q., Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinfo, 2008, 9(1): 510.

      [2]Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.-C. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res, 2015, 43(W1): W65–W71.

      [3]Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings in Bioinformatics, 2019, 20(4): 1280–1294.

      [4]Zamani M., Kremer S.C. Amino acid encoding schemes for machine learning methods. In the 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2011, pp. 327–333.

      [5]Yoo P.D., Zhou B.B., Zomaya A.Y. Machine learning techniques for protein secondary structure prediction: An overview and evaluation. Curr Bioinfo, 2008, 3(2): 74–86.

      [6]Hu H.-J., Pan Y., Harrison R., Tai P.C. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans NanoBiosci, 2004, 3(4): 265–271.

      [7]Miyazawa S., Jernigan R.L. Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins, 1999, 34(1): 49–68.

      [8]Lin K., May A.C.W., Taylor W.R. Amino acid encoding schemes from protein structure alignments: Multi-dimensional vectors to describe residue types. J Theor Biol, 2002, 216(3): 361–365.

      [9]Asgari E., Mofrad M.R.K. Continuous distributed representation of biological sequences for deep proteomics and genomics. Plos One, 2015, 10(11): e0141287.

      [10]Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res, 2008, 36(suppl 1): D202–D205.

      [11]Wang S., Peng J., Ma J., Xu J. Protein secondary structure prediction using deep convolutional neural fields. Sci Rep, 2016, 6.

      [12]Wang J.T.L., Ma Q., Shasha D., Wu C.H. New techniques for extracting features from protein sequences. IBM Syst J, 2001, 40(2): 426–441.

      [13]Dayhoff M.O. A model of evolutionary change in proteins. Atlas Prot Seq Struct, 1978, 5: 89–99.

      [14]White G., Seffens W. Using a neural network to backtranslate amino acid sequences. Electronic J Biotechnol, 1998, 1(3): 17–18.

      [15]Atchley W.R., Zhao J., Fernandes A.D., Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci USA, 2005, 102(18): 6395–6400.

      [16]Rose G., Geselowitz A., Lesser G., Lee R., Zehfus M. Hydrophobicity of amino acid residues in globular proteins. Science, 1985, 229(4716): 834–838.

      [17]Betts M.J., Russell R.B. Amino acid properties and consequences of substitutions. Bioinfo Genet, 2003, 317: 289.

      [18]Fauchère J.-L., Charton M., Kier L.B., Verloop A., Pliska V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Chem Biol Drug Design, 1988, 32(4): 269–278.

      [19]Radzicka A., Wolfenden R. Comparing the polarities of the amino acids: side-chain distribution coefficients between the vapor phase, cyclohexane, 1-octanol, and neutral aqueous solution. Biochemistry, 1988, 27(5): 1664–1670.

      [20]Reinhard L., Gisbert S., Dirk B., Paul W. A neural network model for the prediction of membrane spanning amino acid sequences. Prot Sci, 1994, 3(9): 1597–1601.

      [21]Elofsson A. A study on protein sequence alignment quality. Proteins, 2002, 46(3): 330–339.

      [22]Oren E.E., Tamerler C., Sahin D., Hnilova M., Seker U.O.S., Sarikaya M., Samudrala R. A novel knowledge-based approach to design inorganic-binding peptides. Bioinformatics, 2007, 23(21): 2816–2822.

      [23]Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA, 1992, 89(22): 10915–10919.

      [24]Henikoff S., Henikoff J.G. Automated assembly of protein blocks for database searching. Nucleic Acids Res, 1991, 19(23): 6565–6572.

      [25]Stormo G.D., Schneider T.D., Gold L., Ehrenfeucht A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res, 1982, 10(9): 2997–3011.

      [26]Altschul S.F., Koonin E.V. Iterated profile searches with PSI-BLAST — A tool for discovery in protein databases. Trends Biochem Sci, 1998, 23(11): 444–447.

      [27]Remmert M., Biegert A., Hauser A., Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Meth, 2012, 9(2): 173.

      [28]Tanaka S., Scheraga H.A. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules, 1976, 9(6): 945–950.

      [29]Miyazawa S., Jernigan R.L. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules, 1985, 18(3): 534–552.

      [30]Miyazawa S., Jernigan R.L. Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol, 1996, 256(3): 623–644.

      [31]Skolnick J., Godzik A., Jaroszewski L., Kolinski A. Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Prot Sci, 1997, 6(3): 676–688.

      [32]Simmons, K.T., Ingo R., Charles K., A. F.B., Chris B., David B. Improved recognition of nativelike protein structures using

Скачать книгу