Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari

Чтение книги онлайн.

Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 20

Автор:
Жанр:
Серия:
Издательство:
Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari ACM Books

Скачать книгу

ancestor species, and that gene fusion has occurred in another species to optimize the transcription and to produce a single multidomain protein [Marcotte et al. 1999]. These fused proteins are referred to as chimeric or Rosetta Stone proteins [Marcotte et al. 1999]. The Rosetta Stone approach [Enright and Ouzounis 2001, Suhre 2007] infers protein interactions by detecting fusion events between protein sequences across species. In E. coli, this approach identified 6,809 putative interacting pairs of proteins, wherein both proteins from each pair had significant sequence similarity to a single (fused) protein from at least one other species (genome). The analysis of these interacting pairs revealed that, for more than half of these pairs, both the proteins were functionally related [Marcotte et al. 1999].

      PPI Network Topology. The pattern of interactions between proteins in a PPI network says a lot about how proteins interact, and provides a way to predict new interactions. For example, if a pair of proteins have many common neighbors in the PPI network, then most likely the two proteins in the pair and their common neighbors are involved in the same or similar function(s). Therefore, one may infer a direct physical interaction between the two proteins based on the number of neighbors and/or functions the two proteins share. Chua et al. [2006] used FS Weight interaction-scoring approach in this manner to predict interactions between level-2 neighbors (connected via one other protein) in the PPI network. This is based on the observation that level-2 neighbors in the PPI network show the same or similar annotations for functions and/or cellular compartment, and therefore these are more likely to interact compared to random pairs of proteins in the network. These FS-weighted predicted interactions between level-2 neighbors are added back to the PPI network after removing low-weighted interactions. Using the same rationale, one can predict new interactions using other topology-based (common-neighbor counting) schemes including Dice coefficient [Zhang et al. 2008] and Iterative CD [Liu et al. 2008]. Likewise, the geometric embedding model [Pržulj et al. 2004, Higham et al. 2008] can also be used to predict new interactions: Proteins that are ϵ-close in the geometric embedding of the PPI network are more likely to interact compared to random pairs of proteins and proteins that are farther than ϵ-distance away in the embedding.

      Functional Features. Interacting proteins are often involved in the same or similar functions. Therefore, if a pair of proteins are annotated with the same or similar functions, one could, with some degree of accuracy, infer a physical interaction between the two proteins. This is often referred to as “guilt by association,” which refers to the principle that genes or proteins with related functions tend to share properties such as genetic or physical interactions [Oliver 2000]. This inference can be further enhanced by combining other evidence that supports their functional similarity—for example, if the genes coding for the two proteins are located close by on the genome or are transcribed as an operonic unit (for prokaryotes) [Dandekar et al. 1998, Kumar et al. 2002], or the coding genes are co-transcribed or co-expressed [Huynen et al. 2000, Bowers et al. 2004, Jansen et al. 2002], or show similar phylogenetic profiles [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. Proteins within the same protein complex (co-complexed proteins) show a strong tendency to share functions and cellular localization and therefore physically interact. On the other hand, proteins from different cellular compartments most likely do not meet and therefore do not interact in vivo during their lifetimes. Jansen et al. [2003] used interactions between co-complexed proteins from the MIPS protein complex catalog [Mewes et al. 2006] as the positive training set, and non-interacting pairs of proteins as the negative training set, in a Bayesian framework, to predict new interactions in yeast. Blohm et al. [2014] present a dataset, the “Negatome,” of protein pairs that are highly unlikely to interact, which can be used as a negative training set.

      The Gene Ontology graph [Ashburner et al. 2000] integrates information on the functional and localization properties of proteins, and therefore provides a way to predict new interactions. For example, the TCSS approach by Jain and Bader [2010] can be used to compute similarity between pairs of proteins using the GO graph, and protein pairs showing high GO-semantic similarity can be predicted to physically interact. Likewise, multiple pieces of experimental and functional information can be combined to predict new interactions. For example, GeneMANIA (http://www.genemania.org/) [Warde-Farley et al. 2010] combines experimentally detected interactions from BioGrid [Stark et al. 2011, Chatr-Aryamontri et al. 2015], pathway annotations from Pathway Commons (http://www.pathwaycommons.org/) [Cerami et al. 2011], and information on evolutionary conservation of interactions from the Interologous Interaction Database (I2D) [Brown and Jurisica 2005], along with GO-based similarity, to predict new interactions (GeneMANIA and I2D are also listed in Table 2.5). The HumanNet [Lee et al. 2011] is a human functional interaction network which includes predicted interactions based on guilt by association for genes involved in human diseases.

      Structural Information on Proteins. 3D structures of proteins provide first-hand evidence for protein interaction sites and binding surfaces of proteins. Therefore, by assessing the compatibility between the binding surfaces between two proteins, one can predict whether the two proteins interact or not. For example, Zhang et al. [2012, 2013] analyzed 3D structures of proteins from the Protein Data Bank (PDB) (http://www.rcsb.org/pdb/home/home.do) [Berman et al. 2000], a database which stores 3D structures for over 600 of the ∼6,000 characterized yeast proteins (∼10%), to predict new interactions between proteins in yeast. However, since the 3D structures are available for only a small fraction of proteins, using this approach for prediction of interactions on a larger scale is not feasible. Zhang et al. proposed to overcome this limitation to some extent by deriving homology models for proteins without available 3D structures. Homology models were derived for an additional ∼3,600 yeast proteins using the ModBase (http://modbase.compbio.ucsf.edu/) [Pieper et al. 2006] and Skybase (http://skybase.c2b2.columbia.edu/pdb60_new/struct_show.php) [Mirkovic et al. 2007] databases. Given a query protein, these databases predict the most likely 3D structure for the protein based on its sequence similarity with templates built from proteins with available 3D structures. The final set of structurally predicted interactions from the Zhang et al. study is available in the PrePPI database (http://bhapp.c2b2.columbia.edu/PrePPI/). Struct2Net (http://groups.csail.mit.edu/cb/struct2net/webserver/) [Singh et al. 2010] uses a structure-threading approach to predict interactions between proteins. Given two protein sequences, Struct2Net “threads” the sequences to known 3D structures from PDB, and then based on the best-matching structures, estimates the interaction between the two proteins. PredictProtein (http://www.predictprotein.org/) [Yachdav et al. 2014] combines structure and GO-based methods to predict new interactions. Wang et al. [2012] curated a 3D-structure resolved dataset of 4,222 high-quality human PPIs enriched for human disease genes by examining relationships between 3,949 genes, 62,663 mutations, and 3,453 associated with human disorders.

      Literature Mining. Interactions missed in PPI datasets but have direct or indirect reference in scientific publications can be identified by mining the literature. For example, these references may include abstracts or full-texts of publications maintained in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) by the National Center for Biotechnology Information (NCBI). Text-mining tools based on natural language processing (NLP) and other machine-learning techniques mine for co-occurrence of protein names in these literature sources, and proteins frequently referenced together can be predicted to interact. For example, the PubGene tool, which is a part of COREMINE (http://www.coremine.com/medical/), mines for information on genes and proteins including their co-occurrences in abstracts of publications, their sequence homology,

Скачать книгу