Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari

Чтение книги онлайн.

Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 14

Автор:
Жанр:
Серия:
Издательство:
Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari ACM Books

Скачать книгу

miss many true interactions or produce too many spurious interactions. Therefore, a combination of the spoke and matrix models is used, where a balance is sought between the two models using weighting of interactions (C). Interactions with low weights are discarded to give the final set of high-confidence inferred interactions (D).

       Spurious Interactions

      Spurious or false-positive interactions in high-throughput screens may arise from technical limitations in the underlying experimental protocols, or limitations in the (computational) inference of interactions from the screen. For example, the Y2H system, despite being in vivo, does not consider the localization (compartmentalization), time, and cellular context while testing for binding partners. Since all proteins are tested within one compartment (the nucleus), the chances that two proteins, belonging to two different compartments and are not likely to meet during their lifetimes in live cells, end up testing positive for interaction, is high. Similarly, in vitro TAP pull downs are carried out using cell lysates in an environment where every protein is present in the same “uncompartmentalized soup” [Mackay et al. 2007, Welch 2009]. Therefore, even though two proteins interact under these laboratory conditions, it is not certain that they will ever meet or interact during their life times in live cells. Opportunities are high for “sticky” molecules to function as bridges between proteins causing these proteins to interact promiscuously with partners that never interact with in live cells [Mackay et al. 2007]. Once these complexes are pulled down, the model used to infer binary interactions—between bait and prey or between preys—can also result in inference of spurious interactions (further discussed below). Recent analyses showed that only 30–50% of interactions inferred from high-throughput screens actually occur within cells [Shoemaker and Panchenko 2007, Welch 2009], while the remaining interactions are false positives.

       Missing Interactions and Lack of Concordance Between Datasets

      Comparisons between datasets from different techniques have shown a striking lack of concordance, with each technique producing a unique distribution of interactions [Shoemaker and Panchenko 2007, Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Moreover, certain interactions depend on post-translational modifications such as disulfide-bridge formation, glycosylation, and phosphorylation, which may not be supported in the adopted system. Many of these techniques also show bias toward abundant proteins (e.g., soluble proteins) and bias against certain kind of proteins (e.g., membrane proteins). For example, AP/MS screens predict relatively few interactions for proteins involved in transport and sensing (trans-membrane proteins), while Y2H screens being targeted in the nucleus fail to cover extracellular proteins [Shoemaker and Panchenko 2007]. These limitations effectively result in a considerable number of missed interactions in interactome datasets.

      Welch [2009] summed up the status of interactome maps, based on these above limitations, as “fuzzy,” i.e., error-prone, yet filled with promise.

       Estimating Reliabilities of Interactions

      The coverage of true interactions can be increased by integrating datasets from multiple experiments. This integration ensures that all or most regions of the interactome are sufficiently represented in the PPI network. However, overcoming spurious interactions still remains a challenge, which is further magnified when datasets are integrated. Therefore, estimating the reliabilities of interactions becomes necessary, thereby keeping only the highly reliable interactions while discarding the spurious or less-reliable ones.

      Confidence or reliability scoring schemes offer a score (weight) to each interaction in the PPI network. For an interaction (u, v) ∈ E in the scored (weighted) PPI network G = 〈V, E, w 〉, the score w(u, v) encodes the confidence for the physical interaction between the two proteins u and v. The scoring function w: V × V → R accounts for the biological variability and technical limitations of the experiments used to infer the interactions. The scoring schemes can be classified into three broad categories (Table 2.4): (i) sampling or counting-based, (ii) biological evidence-based, and (iii) topology-based schemes.

ClassificationScoring SchemeReference
Sampling or counting basedBootstrap sampling[Friedel et al. 2009]
Comparative Proteomic Analysis (ComPASS)Dice coefficientaHypergeometric samplingSignificance Analysis of INTeractions (SAINT)Socio-affinity scoring[Sowa et al. 2009][Zhang et al. 2008][Hart et al. 2007][Choi et al. 2011, Teo et al. 2014][Gavin et al. 2006]
Independent evidence-basedBayesian networks and C4.5 decision treesTopological Clustering Semantic Similarity (TCSS)Purification Enrichment (PE)[Krogan et al. 2006][Jain and Bader 2010][Collins et al. 2007]
Topology-basedCollaborative Filtering (CF)Functional Similarity (FS) Weight aGeometric embeddingIterative Czekanowski-Dice(ICD) distance aPageRank affinity[Luo et al. 2015][Chua et al. 2006][Pržulj et al. 2004, Higham et al. 2008][Liu et al. 2008][Voevodski et al. 2009]

      a. Dice coefficient, FS Weight, and Iterative CD scoring schemes can also be considered as independent evidence-based schemes, because if a pair of proteins have several common partners then these proteins most likely perform the same or similar functions and/or are present in the same cellular compartment (a biological evidence).

       Sampling or Counting-Based Schemes

      These schemes estimate the confidence of protein pairs by measuring the number of times each protein pair is observed to interact across multiple trials against what would be expected by chance given the abundance of each protein in the library. If the protein pairs are coming from the same experiment, the counting is performed across multiple purifications of the experiment. Given multiple PPI datasets, this idea can be extended to score interacting pairs by measuring the number of times each pair is observed across the different datasets against what would be expected from random given the number of times these proteins appear across the datasets. However, if the PPI datasets come from different experiments (e.g., Y2H and TAP/MS-based), which is usually the case, then it is useful to capture the relative reliability of each experimental technique or source of the datasets into this computation. For example, if Y2H is believed to be less reliable than TAP/MS-based techniques, then protein pairs can be assigned lower weights when observed in Y2H datasets, but assigned higher weights when observed in TAP/MS datasets.

      In the study by Gavin et al. [2006], a “socio-affinity” scheme based on this counting idea was used to estimate confidence for interactions inferred from pulled-down complexes detected from TAP purifications. The interactions within the pulled-down complexes are inferred as a combination of spoke and matrix-modeled relationships. A socio-affinity index SA(u, v) then quantifies the tendency for two proteins u and v to identify each other when tagged (spoke model, S) and to co-purify when other proteins are tagged (matrix model, M):

Image

      where, for the spoke model (S), Image is the number of times that u retrieves v when u is tagged; Image is the fraction of purifications when u was bait; Image is the fraction of all retrieved preys that were v; nbait is the total number of purifications (i.e., using baits); and Image is the number of preys retrieved with u as bait. These terms

Скачать книгу