Biological Language Model. Qiwen Dong
Чтение книги онлайн.
Читать онлайн книгу Biological Language Model - Qiwen Dong страница 12
In order to use the information of neighboring residues, many previous protein secondary structure prediction methods apply the sliding window scheme and have demonstrated considerably good results.48 Referring to those methods, we also used the sliding window scheme to evaluate different amino acid encoding methods, and the diagram for this is shown in Fig. 3-3. The evaluation is based on the Random Forests method from the Scikit-learn toolboxes,58 the window size is 13 and the number of trees in the forest is 100. The comparison results are shown in Table 3-3.
Figure 3-3 The diagram of the sliding window scheme by using the Random Forests classifier for protein secondary structure prediction. The two target residues are Leu (L) and Ile (I) separately, the input for each target residue is independent.
First, we analyze and discuss the performance of different methods in the same category. For the binary encoding methods, one-hot encoding is the most widely used encoding method. The one-hot (6-bit) encoding and the binary 5-bit encoding are two dimension-reduced representations of the one-hot encoding. As can be seen from Table 3-3, the best performance is achieved by the one-hot encoding method, which demonstrates that some effective information could be lost after the artificial dimension reduction for one-hot (6-bit) encoding and binary 5-bit encoding. For the physicochemical properties encodings, the hydrophobicity matrix just contains hydrophobicity-related information and performs poorly, while the Meiler parameters and the Acthely factors are constructed from multiple physicochemical information sources and perform better. This shows that the integration of multiple physicochemical information sources and parameters is valuable. For evolution-based encodings, it is obvious that the position-dependent encodings (PSSM and HMM) are much more powerful than position-independent encodings (PAM250 and BLOSUM62), which shows that the homologous information is strongly associated with the protein structures. For the two structure-based encodings, they have comparative performances. For the three machine-learning encodings, the ANN4D performs better than the AESNN3 and the ProtVec, while the ProtVec-3mer encoding achieves similar performance compared with the ProtVec encoding. Second, on the whole, the position-dependent evolution-based encoding methods (PSSM and HMM) achieved the best performance. This result suggests that the evaluation information extracted from the MSAs is more conserved than the global information extracted from other sources. Third, the performances of different encoding methods show a certain degree of correlation with encoding dimensions, and the low-dimensional encodings, i.e. the one-hot (6-bit), binary 5-bit and two machine-learning encodings, have poorer performances than the high-dimensional encodings. This correlation could be due to the sliding window scheme and Random Forests algorithm; larger feature dimension is more conducive to recognizing the secondary structure states, but too large of a dimension will lead to poor performance (ProtVec and ProtVec-3mer).
Table 3-3 Protein secondary structure prediction accuracy of 16 amino acid encoding methods by using the Random Forests method.
3.4.4 Performance comparison by using the BRNN method
In recent years, deep learning-based methods for protein secondary structure prediction have achieved significant improvements.48 One of the most important advantages of deep learning methods is that they can capture both neighboring and long-range interactions, which could avoid the shortcomings of sliding window methods with handcrafted window size. For example, Heffernan et al.42 have achieved state-of-the-art performances by using the long short-term memory (LSTM) bidirectional recurrent neural networks. Therefore, to exclude the potential influence of the handcrafted window size, we also perform an assessment by using the bidirectional recurrent neural networks (BRNN) with long short-term memory cells. The model used here is similar to the model used in Heffernan’s work,42 as shown in Fig. 3-4, which contains two BRNN layers with 256 LSTM cells and two fully connected (dense) layers with 1024 and 512 nodes, and it is implemented based on the open-sourced deep learning library TensorFlow.59
The corresponding comparison results of the 16 selected encoding methods are shown in Table 3-4. From the overall view, the BRNN-based method was found to have better performance compared with the Random Forests-based method, but there are also some specific similarities and differences between them. For binary encoding methods, one-hot encoding still shows the best performance, which once again confirms the information loss of the one-hot (6-bit) and the binary 5-bit encoding methods. For the physicochemical property encodings, the Meiler parameters do not perform as well as the Acthely factors, suggesting that the Acthely factors are more efficient for deep learning methods. For the evolution-based encodings, the PSSM encoding achieves the best accuracy, while the HMM encoding just achieves as much accuracy as those position-independent encodings (PAM250 and BLOSUM62). The difference could be due to the different levels of homologous sequence identity. The HMM encoding is extracted from the UniProt20 database with 20% sequence identity, while the PSSM encoding is extracted from the UniRef90 database with 90% sequence identity. Therefore, for a certain protein sequence, its MSA from the UniProt20 database mainly contains remote homologous sequences, while its MSA from the UniRef90 database usually contains more homologous sequences. From the results in Table 3-4, the evaluation information of homologous sequences is more powerful for distinguishing different protein secondary structures than that of remote homologous sequences. For the structure-based encodings, the Micheletti potentials have much better performance when the BRNN method is used than when the Random Forests method is used. For machine-learning encodings, the ProtVec and ProtVec-3mer achieve significantly better performance compared with the values given in Table 3-4, which demonstrates the potential of machine-learning encoding. It is worth noting that ProtVec-3mer has better performance than ProtVec on the BRNN algorithm, corresponding to the authors’ recent work.49 Overall, for the deep learning algorithm BRNN, the position-dependent PSSM encoding still performs best among all encoding methods. For the position-independent encoding methods, the Micheletti potentials achieve the best performance, which demonstrates that the structure-related information has application potential in protein structure and function studies.
Figure 3-4 The architecture of the long short-term memory (LSTM) bidirectional recurrent neural networks for protein secondary structure prediction.
Table 3-4 Protein secondary structure prediction accuracy of 16 amino acid encoding methods by using the BRNN method.
3.5Assessments of Encoding Methods for Protein Fold Recognition
In addition to the protein sequence labeling tasks, protein sequence classification tasks have also received a lot of attention, such as protein remote homology detection60 and protein fold recognition.61,62 Here, we perform another assessment of the selected 16 amino acid encoding methods based on the protein fold recognition task. Many machine learning methods have been developed to classify protein sequences into different fold categories for protein fold recognition.60 The deep learning methods can