Biological Language Model. Qiwen Dong
Чтение книги онлайн.
Читать онлайн книгу Biological Language Model - Qiwen Dong страница 13
Figure 3-5 The architecture of the one-dimensional deep convolution neural network for protein fold recognition.
3.5.1 Benchmark datasets for protein fold recognition
The most commonly used dataset to evaluate protein fold recognition methods is the SCOP database63 and its extended version, the SCOPe database.64 The SCOP is a manual structural classification of proteins whose three-dimensional structures have been determined. All of the proteins in SCOP are classified into four hierarchy levels: class, fold, superfamily and family. Folds represent the main characteristics of protein structures, and the protein fold could reveal the evolutionary process between the protein sequence and its corresponding tertiary structure.65 Here we use the F184 dataset which was constructed by Xia et al.66 based on the SCOPe database. The F184 dataset contains 6451 sequences with less than 25% sequence identity from 184 folds. Each fold contains at least 10 sequences, which could ensure that there are enough sequences for training and test purposes. Then we randomly selected 20% of the sequences as test data from each fold, leaving 80% of the sequences as training data. Finally, we got 5230 sequences for training and 1221 sequence for testing.
3.5.2 Performances of different encodings on protein fold recognition task
The comparison results of 16 selected encoding methods for protein fold recognition are listed in Table 3-5. It should be noted that the training process for each encoding method is repeated 10 times to eliminate stochastic effects. Different from the performances of protein secondary structure prediction, the performances of most position-independent encoding methods are similar. All of the binary, physicochemical and machine-learning-based encoding methods (except the ProtVec) achieve about 30% mean accuracies, demonstrating that the position-independent encodings could just offer limited information for protein fold classification. The two structure-based encodings have better accuracies — near 33% — demonstrating that the structure potential is more related with the protein fold type. The two evolution-based methods PAM250 and BLOSUM62 perform best among the 12 position-independent encoding methods, which means the evaluation information is more coupled with the protein structure. The position-dependent encoding methods PSSM and HMM achieve better performances, especially PSSM. It again indicates that the protein evaluation information is tightly coupled with the protein structure, and the homologous information is more useful than remote homologous information. The machine-learning-based AESNN3 and ANN4D encodings achieve comparable performances with other position-independent encoding methods but have much lower dimensions (3 for the AESNN3 and 4 for the ANN4D), showing its potential for further application. The performance of the ProtVec encoding is poor, and this could be caused by the overlapping strategy that has also been mentioned by the author.9 The ProtVec-3mer encoding has better performance, demonstrating the effectiveness of the combination of ProtVec and 3-mer.
Table 3-5 The performance differences between the various kinds of encodings.
Notes: Top 1: the accuracy calculated in the case that the first predicted folding type is the actual folding type. Top 5: the accuracy calculated in the case that the top 5 predicted fold types contain the actual fold type. Top 5: the accuracy calculated in the case that the top 10 predicted fold types contain the actual fold type. Mean: the mean value of accuracies on Top 1, Top 5, and Top 10.
It should be noted that the benchmark presented here is based on the DCNN method, and these encodings may achieve different performances by using other machine learning methods. The DCNN method could handle variable-length sequences and achieve significant success on fold recognition tasks, which are the main reasons for its selection here.
3.6Conclusions
Amino acid encoding is the first step of protein structure and function prediction, and it is one of the foundations to achieve final success in those studies. In this chapter, we proposed the systematic classification of various amino acid encoding methods and reviewed the methods of each category. According to information sources and information extraction methodologies, these methods are grouped into five categories: binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding and machine-learning encoding. To benchmark and compare different amino acid encoding methods, we first selected 16 representative methods from those five categories. And then, based on the two representative protein-related studies, protein secondary structure prediction and protein fold recognition, we construct three machine learning models referring to the state-of-the-art studies. Finally, we encoded the protein sequence and implemented the same training and test phase on the benchmark datasets for each encoding method. The performance of each encoding method is regarded as the indicator of its potential in protein structure and function studies.
The assessment results show that the evolution-based position-dependent encoding method PSSM consistently achieves the best performance both on protein secondary structure prediction and protein fold recognition tasks, suggesting its important role in protein structure and function prediction. However, another evolution-based position-dependent encoding method — HMM — does not perform well, and the main reason for this could be that the remote homologous sequences only provide limited evaluation information for the target residue. For the one-hot encoding method, it is highly sparse and leads to complex machine learning models, while its two compressed representations, one-hot (6-bit) encoding and binary 5-bit encoding, lose more or less valuable information and cannot be widely used in related researches. More reasonable strategies to reduce the dimension of one-hot encoding need to be developed. For the physicochemical property encodings, the variety of properties and the extraction methodologies are two important factors needed to construct a valuable encoding. Structure-based encodings and machine-learning encodings achieve comparable or even better performances when compared with other widely used encodings, suggesting more attention needs to be paid to these two categories.
In a time when the dividends of data and algorithms have been highly released, exploring more effective encoding schemes for amino acids should be a key factor to further improve the performance of protein structure and function prediction. In the following, we provide some perspectives for future related studies. First, updated position-independent encodings should be constructed based on new protein datasets. Except for one-hot encoding, all other position-independent encoding methods construct their encodings based on the information extracted from the native protein sequences or structures. There is no doubt that random errors are unavoidable for those encodings and larger datasets will help to reduce those errors. As the development of sequencing and structure detection techniques has progressed and continues to progress, the number of protein sequences and structures has grown rapidly in the past years. Considering that most of the position-independent encoding methods were proposed one decade ago, it would be valuable to reconstruct them by using new datasets. Second, structure-based or function-based encoding methods require more attention. It has been demonstrated that structure-based encoding methods have ability in protein secondary structure prediction and protein fold recognition. These encodings reflect the structural potential