Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari

Чтение книги онлайн.

Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 18

Автор:
Жанр:
Серия:
Издательство:
Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari ACM Books

Скачать книгу

that using the square root of the path length, Image, where pathij denotes the path length between i and j, is a good function for the purpose.

      Conditional probabilities p(dij|interact) and p(dij|noninteract) are learned for distances dij from the embedding. Here, p(dij|interact) is the probability density function describing the distances between pairs of proteins which are known to interact, and p(dij|noninteract) is the probability density function describing the distances between pairs of proteins which do not interact in the dataset (and are known to not interact). Given a distance threshold δ, these probabilities are used to compute the posterior probabilities p(interact|dijδ) and p(noninteract|dijδ) for protein pairs to interact or not interact. For each protein pair (i, j) within the δ-threshold, the weight of the interaction between i and j is then estimated as

Image

      A threshold on this estimated weight is applied to remove false-positive interactions from the network.

       Combining Confidence Scores

      Interactions scored high by all or a majority of confidence-scoring schemes (consensus interactions) are likely to be true interactions. Therefore, a simple majority-voting scheme counts the number of times each interaction is assigned a high score (above the recommended cut-off) by each scheme, and considers only the interactions scored high by a majority of the schemes.

      Chua et al. [2009] integrated multiple scoring schemes using a naïve Bayesian approach. For an interaction (u, v) that is assigned scores pi(u, v) by different schemes i (assuming the scores are in the same range [0,1]), the combined score can be computed as 1 − Πi(1 − pi(u, v)). However, even within the same range [0,1], usually different scoring schemes tend to have different distribution of scores they assign to the interactions. Some schemes assign high scores (close to 1) to most or a sizeable fraction of the interactions, whereas other schemes are more conservative and assign low scores to many interactions. To account for the variability in distributions of scores, it is important to consider the relative ranking of interactions within each scoring scheme instead of their absolute scores.

      Chua et al. [2009] therefore proposed a ranked-based combination scheme, which works as follows. For each scheme i, the scored interactions are first binned in increasing order of their scores: The first 100 interactions are placed in the first bin, the second 100 interactions in the second bin, and so on. For each bin k in scheme i, a weight p(i, k) is assigned based on the number of interactions from the bin that match known interactions from an independent dataset:

Image

      While combining the scores for interaction (u, v), the Bayesian weighting is modified to: 1 − Π(i,k)∈D(u,v)(1 − p(i, k)), where D(u, v) is the list of scheme-bin pairs (i, k) that contain (u, v) across all schemes. This ensures that, irrespective of the scoring distribution of the schemes, if an interaction belongs to reliable bins, it is assigned a high final score.

      Yong et al. [2012] present a supervised maximum-likelihood weighting scheme (SWC) to combine PPI datasets and to infer co-complexed protein pairs. The method uses a naïve Bayes maximum-likelihood model to derive the posterior probability that an interaction (u, v) is a co-complexed interaction based on the scores assigned to (u, v) across multiple data sources. These data sources include PPI databases, namely BioGrid [Stark et al. 2011], IntAct [Hermjakob et al. 2004, Kerrien et al. 2012], MINT [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007], and STRING [Von Mering et al. 2003, Szklarczyk et al. 2011], and evidence from cooccurrence of proteins in the PubMed literature abstracts (http://www.ncbi.nlm.nih.gov/pubmed). The set of features is the set of these data sources, and a feature F has value f if proteins u and v are related by data source F with score f, else f = 0. The features are discretized using minimum-description length supervised discretization [Fayyad and Irani 1993]. Using a reference set of protein complexes, each (u, v) in the training set is given a class label co-complex if both u and v are in the same complex, otherwise it is given the class label non-co-complex. The maximum-likelihood parameters are learned for the two classes,

Image

      where Nc is the number of interactions with label co-complex, Nc, F = f is the number of interactions with label co-complex and feature value F = f, and likewise for interactions with label non-co-complex, N¬c, F = f′. After learning the maximum-likelihood model, the score for each interaction is computed as the posterior probability of being a co-complex interaction based on the naïve Bayes assumption of independence of the features:

Image

      where Z is the normalizing factor:

Image

      Table 2.5 displays some of the publicly available databases that integrate PPI datasets from multiple sources. However, a common problem with integrating multiple scored-datasets is the low agreement between the schemes or experiments used to produce these data. Moreover, most scoring methods favor high abundance proteins and are not effective enough to filter out common contaminants [Pu et al. 2015]. Therefore, better scoring and integrating schemes for PPI datasets are always required.

      In addition to the presence of spurious interactions, another limitation in existing PPI datasets is the lack of coverage for true interactions among certain kinds of proteins (the “sparse zone” [Rolland et al. 2014]). This is in part due to limitations in experimental protocols (e.g., washing away of weakly connected proteins during purification of pull-down complexes in TAP experiments), and in part due to the under-representation of certain groups of proteins in these experiments (e.g., membrane proteins). The paucity of true interactions can considerably affect downstream analysis including protein complex prediction. For example, in an analysis by Srihari and Leong [2012a] using protein complexes from MIPS and CYC2008, it was found that many true complexes are embedded in sparse and disconnected regions of the PPI network, thereby altering their dense connectivity and modularity. As we shall see in a subsequent chapter, many computational methods find it difficult to identify these sparse complexes.

      Computational prediction of protein interactions can be a good alternative to experimental protocols for enriching the PPI network with true interactions, and to “densify” regions of the network that are sparsely connected. However, accurate prediction of physical interactions between proteins is a difficult problem in itself, and as several studies have noted [Von Mering et al. 2003, Szklarczyk et al. 2011, Srihari and Leong 2012a] most predicted interactions tend to be “functional associations”—that is, relationships connecting functionally similar pairs of proteins—instead of actual physical interactions between the proteins. Nevertheless, if these functional interactions are successful in “topologically enhancing” the PPI network, these can still aid downstream analysis including protein complex prediction.

Скачать книгу