Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
Чтение книги онлайн.
Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 16
These schemes use external or independence evidence to estimate confidence of interactions in the PPI dataset. For example, these evidence may include Gene Ontology (GO) annotations [Ashburner et al. 2000, Mi et al. 2013] for protein functions and localization (compartmentalization), and co-complex memberships in validated complexes. Some of the methods are learning-based; for example, given a (training) set of known interacting pairs and GO annotations, the methods learn (train on) the conditional probability distribution for interacting pairs with and without similar GO annotations (e.g., similar functions or localization), and using this learned distribution, the methods estimate the probability of interaction for the proteins in each pair. In Krogan et al.’s study [Krogan et al. 2006], a machine learning approach using Bayesian networks and C4.5-decision trees trained on validated physical interactions and functional evidence—co-occurrence in manually curated complexes from MIPS—was used to estimate confidence for protein pairs in a spoke-modeled experimental dataset. Collins et al. [2007] developed Purification Enrichment (PE) scoring to generate a “Consolidated network” using the matrix-modeled relationships from Gavin et al. and Krogan et al. datasets. The PE scoring is based on a Bayes classifier trained on manually curated co-complexed protein pairs, GO annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. The Consolidated network was shown to be of high quality, comparable to that of PPIs derived from low-throughput experiments.
In other methods, explicit learning may not be involved, but instead the evidence is directly used to assess the interaction confidence of protein pairs. For example, Resnick’s measure [Resnick 1995] for computing the semantic similarity between annotation terms has been adopted to compute confidence based on GO annotations between the proteins [Xu et al. 2008, Pesquita et al. 2008, Jain and Bader 2010]. Specifically, the semantic similarity between two ontology terms (S, T) having a set A of common ancestors in the GO graph is given as the information content,
where p(A) is the fraction of proteins annotated to term A and all its descendants in the GO graph. Suppose that proteins u and v are annotated to sets of GO terms S and T, respectively. Then the semantic similarity between u and v is defined as the maximum information content (Resnick’s measure) of the set S × T,
The interaction confidence between u and v is then estimated as sim(u, v).
GO graphs tend to be unbalanced with some paths containing more details (depth) compared to others, which stems from the complex biological structure of the GO annotations. However, this creates a bias against terms that do not represent such complex structures, i.e., terms that do not have sufficient depth in the GO graph. To account for this topological imbalance of the GO graph, Jain and Bader [2010] developed Topological Clustering Semantic Similarity (TCSS) which collapses subgraphs that define similar concepts. Terms that are lower down the GO tree have higher information content (i.e., more specific) than the terms at higher levels (i.e., less specific). A cut-off is used to identify subgraphs—subgraph root terms and all their children—with high information content. Since GO terms often have multiple parents, it is likely that this results in overlapping subgraphs. Overlaps between subgraphs are removed in two steps: edge removal by transitive reduction and term duplication. In the edge-removal step, a reduction is performed on the subgraphs: If nodes u and v are connected both via a directed edge u → v as well as a directed path u → w1, …, wk → v, then a transitive reduction is performed to preserve only u → v. After this step, if a term still belongs to more than one subgraph, then the term and all its descendents are duplicated across the subgraphs. The similarity between two proteins is measured using this reduced GO graph using Resnick’s similarity between their GO terms, as described above.
Topology-Based Schemes
These schemes analyze the topology of the PPI network—usually in the immediate neighborhood of each protein pair—to estimate interaction confidence for the protein pairs. If the proteins in an interacting pair have many common neighbors in the network, the proteins and their shared neighbors have similar functions and/or are co-localized [Batada et al. 2004], and it is likely that the observed interaction between the proteins is true. This can be understood from the following simple example. Two proteins need to be localized to the same compartment to interact physically. Let us assume that a PPI screen with a false-positive rate of p reports that protein A interacts with C1, …, Cn and D1, …, Dm, and another protein B interacts with C1, …, Cn and E1, …, Em',. Suppose that each of these proteins can be localized to say h subcellular compartments with equal chance. The probability that A and B interact with a Ci in two different places is therefore x = (1 − p) · (1 − p) · (h − 1)/h. Thus, the probability that A and B interact with each C1, …, Cn in two different compartments is xn, and the probability that A and B interact with some Ci in the same compartment is 1 − xn, which monotonically increases with n. That is, the more common partners A and B have (as reported by the screen), the higher the chance they are in the same compartment. Thus, the more common partners the two proteins have in the PPI network, the higher the chance they satisfy the co-localization requirement for interaction. In other words, topology-based schemes based on counting common partners can be viewed as ways to guess whether A and B are likely to be in the same compartment (without needing to know explicitly which compartment); that is, these indices are in fact based on exploiting the biological fact that two proteins must be in the same compartment to physically interact (i.e., a piece of biological information). Therefore, Functional Similarity Weighting (FS Weight) [Chua et al. 2006], Iterative Czekanowski-Dice (CD) weight [Liu et al. 2008], and Dice coefficient [Zhang et al. 2008] can be classified as both topology-based as well as biological evidence-based schemes.
FS Weight, proposed by Chua et al. [2006], is inspired from the graph theory measure Czekanowski-Dice (CD) distance, which is given by
where Nu includes u and all the neighbors of u, similarly for Nv, and NuΔNv = (Nu − Nv) ∪ (Nv − Nu) is the symmetric difference between the sets of neighbors Nu and Nv. CD is a distance (dissimilarity) measure. If Nu = Nv, then CD(u, v) = 0; and if Nu ∩ Nv = ∅, then CD(u, v) = 1.
FS Weight takes inspiration from CD to use common neighbors, and estimates the confidence of the physical interaction between u and v based on their common neighbors. Chua et al. [2006] show that proteins share functions with their strictly indirect neighbors more than with their strictly direct neighbors, and proteins share functions with even higher likelihood with proteins that are simultaneously their direct and indirect neighbors than with proteins that are either strictly their direct or strictly their indirect neighbors. The weight FS(u, v) of the interaction between u and v is estimated as
where λu,v and λv,u are used to penalize protein pairs with very few neighbors, and are given by
where navg is the average number of level-1 neighbors that a protein has in the network. FS Weight thus assigns higher weight to protein pairs with larger