Biological Language Model. Qiwen Dong

Чтение книги онлайн.

Читать онлайн книгу Biological Language Model - Qiwen Dong страница 11

Biological Language Model - Qiwen Dong East China Normal University Scientific Reports

Скачать книгу

model to be overfitting, and thus it needs to be simplified. Meiler et al.37 also tried to use reduced representations of amino acids’ physicochemical and statistical properties for protein secondary structure prediction. Zamani and Kremer4 stated that an effective encoding must store information associated with the problem at hand while diminishing superfluous data. In summary, an effective amino acid encoding method should be information-rich and non-redundant. “Information-rich” means the encoding contains enough information that is highly relevant to the protein structure and function, such as the physicochemical properties, evolution information, contact potential, and so on. “Non-redundant” means the encoding is compact and does not contain noise or other unrelated information. For example, in neural network-based protein structure and function prediction, redundancy encoding will lead to complicated networks with a very large number of weights, which leads to overfitting and restricts the generalization ability of the model. Therefore, under the premise of containing sufficient information, a more compact encoding will be more useful and generate more results.

      Over the past two decades, several studies have been proposed to investigate effective amino acid encoding methods.5 David45 examined the effectiveness of various hydrophobicity scales by using a parallel cascade identification algorithm to assess the structure or functional classification of protein sequences. Zhong et al.46 compared orthogonal encoding, hydrophobicity encoding, BLOSUM62 encoding and PSSM encoding utilizing the Denoeux belief neural network for protein secondary structure prediction. Hu et al.6 combined orthogonal encoding, hydrophobicity encoding and BLOSUM62 encoding to find the most optimal encoding scheme by using the SVM with a sliding window training scheme for protein secondary structure prediction. From their test results, it can be seen that the combination of orthogonal and BLOSUM62 matrices showed the highest accuracy compared with all other encoding schemes. Zamani and Kremer4 investigated the efficiency of 15 amino acid encoding schemes, including orthogonal encoding, physicochemical encoding, and secondary structures- and BLOSUM62-related encoding, by training artificial neural networks to approximate the substitution matrices. Their experimental results indicate that the number (dimension) and the types (properties) of amino acid encoding methods are the two key factors playing a role in the efficiency of the encoding performance. Dongardive and Abraham47 compared the orthogonal, hydrophobicity, BLOSUM62, PAM250 and hybrid encoding schemes of amino acids for protein secondary structure prediction and found that the best performance was achieved using the BLOSUM62 matrix. These studies thus explored amino acid encoding methods from different perspectives, but they all just evaluated one part of the encoding methods on small datasets. To present a comprehensive and systematic comparison, in this chapter, we performed a large-scale comparative assessment of various amino acid encoding methods based on two tasks — protein secondary structure prediction and protein fold recognition — proposed in the following sections. It should be noted that our aim is assessing how much effective information is contained in different encoding methods, rather than exploring the optimal combination of encoding methods.

      In computational biology, protein sequence labeling tasks, such as protein secondary structure prediction, solvent accessibility prediction, disorder region prediction and torsion angle prediction, have gained a great deal of attention from researchers. Among those sequence labeling tasks, protein secondary structure prediction is the most representative task,48 and several previous amino acid encoding studies have also paid attention to this topic.6,35,46,47 Therefore, we first assess the various amino acid encoding methods based on the protein secondary structure prediction task.

       3.4.1 Encoding methods selection and generation

      To perform a comprehensive assessment of different amino acid encoding methods, we select 16 representative encoding methods from each category for evaluation. A brief introduction of the 16 selected encoding methods is shown in Table 3-2. Except for PSSM and HMM encodings, most of these encodings are position-independent encodings and can be used directly to encode amino acids. It should be noted that some protein sequences may contain unknown amino acid types; these amino acids will be expressed by the average value of the corresponding column if the original encodings do not deal with this situation. For the ProtVec,9 which is a 3-gram encoding, we encode each amino acid by adding its left and right adjacent amino acid to form the corresponding 3-gram word. Since the start and end amino acids do not have enough adjacent amino acids to form 3-grams, they are represented by the “<unk>” encoding in ProtVec. Recently, further work on ProtVec (ProtVecX49) has demonstrated that the concatenation of ProtVec and k-mers could achieve better performance; here, we also evaluate the performance of ProtVec concatenated with 3-mers (named as ProtVec-3mer). For position-dependent encoding methods PSSM and HMM, we follow the common practice of generating them. Specifically, for the PSSM encoding of each protein sequence, we ran the PSI-BLAST26 tool with an e-value threshold of 0.001 and three iterations against the UniRef950 sequence database which is filtered at 90% sequence identity. HMM encoding is extracted from the HMM profile by running HHblits27 against the UniProt2050 protein database with parameters “-n 3 -diff inf -cov 60”. According to the HHsuite user guide, we use the first 20 columns of the HMM profile and convert the integers in the HMM profile to amino acid emission frequencies by using the formula: hfre = 2−0.001∗h, where h is the initial integer in the HMM profile and hfre is the corresponding amino acid emission frequency. h is set to 0 if it is an asterisk.

image image

       3.4.2 Benchmark datasets for protein secondary structure prediction

      Following several representative protein secondary structure prediction works11,42,51 published in recent years, we use the CullPDB dataset52 as training data and use four widely used test datasets — the CB513 dataset,53 the CASP10 dataset,54 the CASP11 dataset55 and the CASP12 dataset56 — as test data to evaluate the performance of different features. The CullPDB dataset is a large non-homologous sequence set produced by using the PISCES server,52 which culls subsets of protein sequences from the Protein Data Bank based on sequence identity and structural quality criteria. Here, we retrieved a subset of sequences that have structures with better than 1.8 Å resolution and share less than 25% sequence identity with each other. We also remove those sequences sharing more than 25% identity with sequences from the test dataset to ensure there is no homology between the training and the test datasets, and finally the CullPDB dataset contained 5748 protein sequences with lengths ranging from 18 to 1455. The CB513 dataset contains 513 proteins with less than 25% sequence similarity. The Critical Assessment of techniques for protein Structure Prediction (CASP) is a highly recognized community experiment to determine state-of-the-art methods in protein structure prediction from amino acids56; the recently released CASP10, CASP11 and CASP12 datasets are adopted as test datasets. It should be noted that the protein targets from CASP used here are based on the protein domain. Specifically, the CASP10 dataset contains 123 protein domains whose sequence lengths range from 24 to 498, the CASP11 dataset contains 105 protein domains whose sequence lengths range from 34 to 520, and the CASP12 dataset contains 55 protein domains whose sequence lengths range from 55 to 463.

      Protein secondary structure labels are inferred by using the DSSP program57 from corresponding experimentally determined structures. The DSSP specifies 8 secondary structure states to each residue; here, we adopt 3-state secondary structure prediction as a benchmark task by converting 8 assigned states to 3 states: G, H, and I to H; B and E to E; and S, T, and C to C.

Скачать книгу