Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
Чтение книги онлайн.
Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 15
Friedel et al. [2009] combined the bait–prey relationships detected from the Gavin et al. [2006] and Krogan et al. [2006] experiments, and used a random sampling-based scheme to estimate confidence of interactions. In this approach, a list Φ= (ϕ1, …, ϕn) purifications were generated where each purification ϕi consisted of one bait bi and the preys pi,1, …, pi,m identified for this bait in the purification: ϕi = 〈bi, [pi,1 …, pi,m]〉. From, Φ, l = 1000 bootstrap samples were created by drawing n purifications with replacement. This means that the bootstrap sample Sj(Φ) contains the same number of purifications as Φ and each purification ϕi can be contained in Sj(Φ) once, multiple times, or not at all, with multiple copies being treated as separate purifications. Interaction scores for the protein pairs are then calculated from these l bootstrap samples using socio-affinity scoring as above, where each protein pair is counted for the number of times the pair appeared across randomly sampled sets of interactions against what would be expected for the pair from random based on the abundance of each protein in the two datasets.
Zhang et al. [2008] modeled each purification as a bit vector which lists proteins pulled down as preys against a bait across different experiments. The authors then used the Sørensen-Dice similarity index [Sørensen 1948, Dice 1945] between the vectors to estimate co-purification of preys across experiments, and thus the interaction reliability between proteins. Specifically, the pull-down data is transformed into a binary protein pull-down matrix in which a cell [u, i] in the matrix is 1 if u is pulled down as a prey in the experiment or purification i, and a zero otherwise. For two protein vectors in this matrix, the Sørensen-Dice similarity index, or simply the Dice coefficient, is computed as follows:
where q is the number of the matrix elements (experiments or purifications) that have ones for both proteins u and v; r is the number of elements where u has ones, but v has zeroes; and s is the number of elements where v has ones, but u has zeroes. If u and v indeed interact (directly or as part of a complex), then most likely the two proteins will be frequently co-purified in different experiments. The Dice coefficient therefore estimates the fraction of times u and v are co-purified in order to estimate the interaction reliability between u and v.
Hart et al. [2007] generated a Probabilistic Integrated Co-complex (PICO) network by integrating matrix-modeled relationships from the Gavin et al. [2002], Gavin et al. [2006], Krogan et al. [2006], and Ho et al. [2002] datasets using hypergeometric sampling. Specifically, the significance (p-value) for observing an interaction between the proteins u and v at least k times in the dataset is estimated using the hypergeometric distribution as
where k is the number of times the interaction between u and v is observed, n and m are the total number of interactions for u and v, respectively, and N is the total number of interactions in the entire dataset. The lower the p-value, the lesser is the chance that the observed interaction between u and v is random, and therefore higher is the chance that the interaction is true.
Methods such as Significance Analysis of INTeractome (SAINT) are based on quantitative analysis of mass spectrometry data. SAINT, developed by Choi et al. [2011] and Teo et al. [2014], assigns confidence scores to interactions based on the spectral counts of proteins pulled down in AP/MS experiments. The aim is to convert the spectral count Xij for a prey protein i identified in a purification of bait j into the probability of true interaction between the two proteins, P(True|Xij). For this, the true and false distributions, P(Xij|True) and P(Xij|False), and the prior probability πT of true interactions in the dataset are inferred from the spectral counts of all interactions involving prey i and bait j. Essentially, SAINT assumes that, if proteins i and j interact, then their “interaction abundance” is proportional to the product XiXj of their spectral counts Xi and Xj. To compute P(Xij|True), the spectral counts Xi and Xj are learned not only from the interaction between i and j, but also from all bona fide interactions that involve i and j. The same principle is applied to compute P(Xij|False) for false interactions. These probability distributions are then used to calculate the posterior probability of true interaction P(True|Xij). The interactions are then ranked in decreasing order of their probabilities, and a threshold is used to select the most likely true interactions.
Comparative Proteomic Analysis (ComPASS) [Sowa et al. 2009] employs a comparative methodology to assign scores to proteins identified within parallel proteomic datasets. It constructs a stats table X[k × m] where each cell X[i, j] = Xi, j is the total spectral count (TSC) for an interactor j (arranged as m rows) in an experiment i (arranged as k columns). ComPASS uses a D-score to normalize the TSCs across proteins such that the highest scores are given to proteins in each experiment that are found rarely, found in duplicate runs, and have high TSCs—all characteristics that qualify proteins to be candidate high-confidence interactors. The D-score is a modification of the Z-score, which weights all interactors equally regardless of the number of replicates or their TSCs. Let X̄j be the average TSC for interactor j across all the experiments,
The Z-score is computed as
where σj is the standard deviation for the spectral counts of interactor j across the experiments. The D-score improves the Z-score by incorporating the uniqueness, the TSC, and the reproducibility of the interaction to assign a score to each protein within each experiment. The D-score first rescales Xi, j as
and p is the number of replicate runs in which the interactor is present. A D-score distribution is generated using a simulated random dataset, and a D-score threshold DT is determined below which 95% of this randomized data falls. A normalized D-score is then computed using this threshold as