Biological Language Model. Qiwen Dong
Чтение книги онлайн.
Читать онлайн книгу Biological Language Model - Qiwen Dong страница 10
3.2.5 Machine-learning encoding
Different from earlier manually defined encoding methods, the machine-learning based encoding methods learn amino acid encodings from protein sequence or structure data by using machine learning methods, typically using artificial neural networks. In order to reduce the complexity of the model, the neural network for learning amino acid encodings is weightsharing for 20 amino acids. In general, the neural network contains three layers: the input layer, the hidden layer and the output layer. The input layer corresponds with the original encoding of the target amino acid, which can be one-hot encoding, physicochemical encoding, etc. The output layer also corresponds with the original encoding of the related amino acids. The hidden layer, which represents the new encoding of the target amino acid, usually has a reduced dimension compared with the original encoding.
To our knowledge, the earliest concept of learning-based amino acid encodings was proposed by Riis and Krogh.35 In order to reduce the redundancy of one-hot encoding, they used a 20 ∗ 3 weightsharing neural network to learn a 3-dimensional real number representation of 20 amino acids from one-hot encoding. Later, Jagla and Schuchhardt36 also used the weight sharing artificial neural network to learn a 2-dimensional encoding of amino acids for human signal peptide cleavage site recognition. Meiler et al.37 used a symmetric neural network to learn reduced representations of amino acids from amino acid physicochemical and statistical properties. The parameter representations were reduced from five and seven dimensions, respectively, to 1, 2, 3 or 4 dimensions, and then these reduced representations were used for ab initio prediction of protein secondary structure. Lin et al.8 used an artificial neural network to derive encoding schemes of amino acids from protein three-dimensional structure alignments, and each amino acid is described using the values taken from the hidden units of the neural network.
In recent years, several new machine-learning-based encoding methods9,38,39 have been proposed with reference to distributed word representation in natural language processing. In natural language processing, the distributed representation of words has been proven to be an effective strategy for use in many tasks.40 The basic assumption is that words sharing similar contexts will have similar meanings; therefore these methods train the neural network model by using the target word to predict its context words or by predicting the target word from its context words. After training on unlabeled datasets, the weights of the hidden units for each word are used as its distributed representation. In protein-related studies, a similar strategy has been used by assuming that the protein sequences are sentences, and that the amino acids or sub-sequences are words. In previous researches, these distributed representations of amino acids or sub-sequences show potential in protein family classification and disordered protein identification,9 protein function site predictions,38 protein functional property prediction,39 etc.
3.3Discussion
In this section, we will make a theoretical discussion of amino acid-encoding methods. First of all, we investigate the classification criteria of amino acid-encoding methods; second, we discuss the theoretical basis of these methods, and then analyze their advantages and limitations. Finally, we review and discuss the criteria for measuring an amino acid encoding method.
As introduced above, amino acid encoding methods have been divided into five categories according to their information sources and methodologies. However, it should be noted that the methods in one category are not completely different from those in others, and that there are some similarities between the encoding methods belonging to different categories. For example, the 6-bit one-hot encoding method proposed by Wang et al.12 is a dimension-reduced representation of the common one-hot encoding, but it is based on the six amino acid exchange groups which are derived from PAM matrices.13 There is another classification criterion based on position relevance. In an earlier section, evolution-based encoding methods were discussed, and it was mentioned that they are divided into two categories: position-independent methods and position-dependent methods. We can also group all of the amino acid encoding methods into these position-independent and position-dependent categories. Except for the position-specific scoring matrix (PSSM) and other similar encoding techniques that extract evolution features from multiple sequence alignments which are position-dependent methods, all the other amino acid encoding methods are position-independent methods. The position-dependent methods can capture homologous information, while position-independent ones can reflect the basic properties of amino acids. To some extent, these two types of methods can be complementary to each other. In practice, the combination of position-independent encoding and position-dependent encoding is often used, such as combining one-hot and PSSM,41 combining physicochemical properties encoding and PSSM,42 etc.
Theoretically, the functions of a protein are closely related to its tertiary structure, and its tertiary structure is mostly determined by the physicochemical properties of its amino acid sequence.43 From this perspective, all of the evolution-based encoding, structure-based encoding and machine-learning encoding methods extract information based on the physicochemical properties of the amino acid by using difference strategies. Specifically, different amino acids may have different mutation tendencies in the evolutionary process due to their hydrophobicity, polarity, volume and other properties. These mutation tendencies will be reflected in the sequence alignments and are detected by the evolution-based encoding methods. Similarly, the physicochemical properties of amino acids could affect the inter-residue contact potentials in tertiary protein structures, which form the basis of the structure-based encoding methods. And the machine-learning encoding methods also learn amino acid encoding from its physicochemical representation or evolution information (such as homologous protein structure alignments), which can be seen as another variant of physicochemical properties. Despite the fact that these encoding methods share a similar theoretical basis, their performance is different due to the restrictions in their implementation. As regards the one-hot encoding method, there is no artificial correlation between amino acids, but it is highly sparse and redundant, which leads to a complex machine learning model. The physicochemical properties of amino acids play fundamental roles in the protein folding process; theoretically, the physicochemical property encoding methods should be effective. However, as the protein folding-related physicochemical properties and their digital metrics are unknown, developing an effective physicochemical property encoding method is still an unresolved problem. The evolution-based encoding methods extract evolution information using just protein sequences, which could thus benefit from the dividends of large-scale protein sequence data. In particular, PSSM has shown significant performance in many studies.44 However, for those proteins without homologous sequences performances of evolution-based methods are limited. The structure-based encoding methods encode amino acids based on the potential of inter-residue contact, which denotes a low-dimensional representation of protein structure. Because of the limited number of known protein structures, their performance scope is limited. Early machine-learning encoding methods also face the problem of insufficient data samples, but several methods developed recently have overcome this problem by taking advantage of unlabeled sequence data.9,38,39
As discussed, different amino acid encoding methods have specific advantages and limitations; so, what is the most effective encoding method? According to Wang et al.,12 the best encoding method should significantly reduce the uncertainty of the output of the prediction model, or the encoding could capture both the global similarity and the local similarity of protein sequences; here, the global similarity refers to the overall similarity among multiple sequences while the local similarity refers to motifs in the sequences. Riis and Krogh35 proposed that redundancy