Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
Чтение книги онлайн.
Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 6
Binary interactions between the proteins in a pulled-down protein complex are inferred using two models: matrix and spoke. In the matrix model, a binary interaction is inferred between every pair of proteins within the complex, whereas in the spoke model interactions are inferred only between the bait and all its preys. Since all pairs of proteins within a complex do not necessarily interact, the matrix model is usually an overestimation of the total number of binary interactions, whereas the spoke model is an underestimation. Therefore, usually a balance is struck between the two models that is close enough to the estimated total number of interactions for the species or organism.
Table 1.2 Numbers of mapped physical interactions between proteins across different model and higher-order organisms
Organism | No. of Interactions | No. of Proteins |
A. thaliana | 34,320 | 9,240 |
C. elegans | 5,783 | 3,269 |
D. rerio | 188 | 181 |
D. melanogaster | 36,741 | 8,071 |
E. coli | 99 | 104 |
H. sapiens | 230,843 | 20,006 |
M. musculus | 18,465 | 8,611 |
R. norvegicus | 4,537 | 3,328 |
S. cerevisiae | 82,327 | 6,278 |
S. pombe | 9,492 | 2,944 |
X. laevis | 532 | 471 |
Based on BioGrid version 3.4.130 (November 2015) [Stark et al. 2011, Chatr-Aryamontri et al. 2015].
Despite differences in procedures and technologies, the use of different experimental protocols can effectively complement one another in detecting interactions. While TAP can be more specific and detect mainly stable (co-complexed) protein interactions, Y2H can be more exhaustive and detect even transient and between-complex interactions. Based on BioGrid version 3.4.130 (November 2015) (http://thebiogrid.org/) [Stark et al. 2011, Chatr-Aryamontri et al. 2015], the numbers of mapped physical interactions range from 99 in E. coli to ~82,300 in S. cerevisiae and ~230,900 in H. sapiens (summarized in Table 1.2). It remains to be seen how many of these interactions actually occur in the physiological contexts of living cells or cell types, how many are subject to genetic and physiological variations, and how many still remain to be mapped.
The binary interactions inferred from the different experiments are assembled into a protein-protein interaction network, or simply, PPI network. The PPI network presents a global or “systems” view of the interactome, and provides a mathematical (topological) framework to analyze these interactions. Protein complexes are expected to be embedded as modular structures within the PPI network [Hartwell et al. 1999, Spirin and Mirny 2003]. Topologically, this modularity refers to densely connected subsets of proteins separated by less-dense regions in the network [Newman 2004, Newman 2010]. Biologically, this modularity represents division of labor among the complexes, and provides robustness against disruptions to the network from internal (e.g., mutations) and external (e.g., chemical attacks) agents. Computational methods developed to identify protein complexes therefore mine for modular subnetworks in the PPI network. While this strategy appears reasonable in general, limitations in PPI datasets, arising due to the shortcomings highlighted above in experimental protocols, severely restrict the feasibility of accurately predicting complexes from the network. Specifically, the limitations in existing PPI datasets that directly impact protein complex prediction include:
1. presence of a large number of spurious (noisy) interactions;
2. relative paucity of interactions between “complexed” proteins; and
3. missing contextual—e.g., temporal and spatial—information about the interactions.
These limitations translate to the following three main challenges currently faced by computational methods for protein complex prediction:
1. difficulty in detecting sparse complexes;
2. difficulty in detecting small (containing fewer than four proteins) and sub-complexes; and
3. difficulty in deconvoluting overlapping complexes (i.e., complexes that share many proteins), especially when these complexes occur under different cellular contexts.
While the interactome coverage can be improved by integrating multiple PPI datasets, the lack of agreement between the datasets from different experimental protocols [Von Mering et al. 2002, Bader et al. 2004], and the multifold increase in accompanying noise (spurious interactions), tend to cancel out the advantage gained from the increased coverage. Consequently, the confidence of each interaction has to be assessed (confidence scoring) and low-confidence interactions have to be first removed from the datasets (filtering) before performing any downstream analysis. To summarize, computational identification of protein complexes from interaction datasets follows these steps (Figure 1.1):
1. integrating interactions from multiple experiments and stringently assessing the confidence (reliability) of these interactions;
2. constructing a reliable PPI network using only the high-confidence interactions;
Figure 1.1 Identification of protein complexes from protein interaction data. (a) A high-confidence PPI network is assembled from physical interactions between proteins after