Computational Prediction of Protein Complexes from Protein Interaction Networks. Sriganesh Srihari
Чтение книги онлайн.
Читать онлайн книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari страница 17
In Iterative CD, proposed by Liu et al. [2008], the weight for each interaction is computed using the number of neighbors shared between the two interacting proteins, in a manner similar to FS Weight. However, Iterative CD then iteratively corrects these weights for the interactions, such that the weights computed in an iteration uses the weights computed in the previous iteration. By doing so, Iterative CD progressively reinforces the weights of true interactions and dampens the weights of false-positive interactions with each iteration. Iterative CD begins with an unweighted network (all weights set to 1) or a network with weights coming from some prior evidence (e.g., reliabilities of experiments from which the protein pairs were inferred). The weight wk(u, v) for each protein pair (u, v) in the k-th (k > 1) iteration is estimated as
where w1(u, v) = 1 if the interaction (u, v) exists in the original network, w1(u, v) = 0 if the interaction (u, v) does not exist; alternatively, w1(u, v) = r(e)(u,v), which is the reliability of the experiment e used to infer (u, v). The parameters λu and λv penalize protein pairs with very few common neighbors. Liu et al. show that the iterative procedure converges in two iterations for typical PPI networks, but 10–30 iterations may be necessary if the network has high levels of noise. The procedure produces a weight between 0 and 1 for each interaction, and a cut-off (recommended 0.2) is used to filter out low-scoring interactions.
Luo et al. [2015] scored interactions based on collaborative-filtering (CF). This approach is inspired by personalized recommendation in e-commerce where the problem is to identify useful patterns reflecting the connection between users and items from their usage history, and make reliable predictions for possible user-item links based on these patterns [Herlocker et al. 2002]. The CF scheme first computes the similarity sim(u, v) between a pair of interacting proteins u and v in the PPI network using the cosine distance of their adjacency vectors,
where 〈…〉 denotes the inner product between the vectors, and ||.|| denotes the Euclidean norm of the vectors. Therefore, the closer sim(u, v) is to 1, more the common neighbors that u and v share. This cosine distance is then rescaled by incorporating a Cγ parameter which, similar to λ in FS Weight and Iterative CD, is used to account for spurious neighbors in the network,
Like in e-commerce where people judge an item based on multiple reviews, preferably from known contacts, the score sim(u, v) from Equation (2.21) is rescaled as:
where (ru,v)d (d being a tunable parameter) is the mean of the scores of interactions with common neighbors, {(u, x) : x ∈ N(u) ∩ N(v)} ∪ {(v, x) : x ∈ N(u) ∩ N(v)}, in the PPI network.
Voevodski et al. [2009] developed PageRank Affinity, a scoring scheme inspired from Google’s PageRank algorithm [Haveliwala 2003], to score interactions in the PPI network. PageRank Affinity uses random walks to estimate the interconnectedness of protein interactions in the network. A random walk is a Markov process where in each step the walk moves from the current protein to the next with a certain preset probability. PageRank Affinity begins with an unweighted network G (with all weights set to 1) given by the adjacency matrix
The transition probability matrix for the random walks (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one,
where D is the degree matrix, which is a diagonal matrix containing the degrees of the proteins,
Therefore, the transition probabilities are given by the matrix WG, where the transition probabilities pt+1 at step t + 1 are simply pt+1 = ptWG. PageRank Affinity repeatedly simulates random walks beginning from each protein in the network. Let pr(s) be the steady-state probability distribution of a random walk with restart probability α, where the starting vector s gives the probability distribution upon restarting. Then, pr(s) is the unique solution for the linear system given by
The starting vector here is set as follows:
Therefore, pr(su) is the steady-state probability distribution of a random walk that always restarts at u. Let pr(su)[v] represent the steady-state probability value that a protein v has in the distribution vector pr(su). This can be thought of as the probability contribution that u makes to v, and is denoted as pr(u → v). This provides the affinity between u and v when the walks always restart at u. The final affinity between u and v is computed as the minimum of pr(u → v) and pr(v → u).
Pržulj et al. [2004] and Higham et al. [2008] argue that PPI networks are best modeled by geometric random graphs instead of scale-free or small-world graphs, as suggested by earlier works [Watts and Strogatz 1998, Barabási 1999]. A geometric random graph G = 〈V, ϵ〉 with radius ϵ is a graph with node set V of points in a metric space and edge set E = {(u, v) : u, v ∈ V ; 0 < ||u − v|| ≤ ϵ}, where ||.|| is an arbitrary distance norm in this space. The authors show that a PPI network can be represented as a geometric random graph by embedding the proteins of the PPI network in metric spaces R2, R3, or R4, and finding an ϵ such that any two proteins are connected in the PPI network if and only if the proteins are ϵ-close in the chosen metric space. This embedding can be used to not only weight all interactions, but also predict new interactions between protein pairs and remove spurious interactions from the PPI network: Interactions between proteins that are farther than ϵ-distance away in the metric embedding are pruned off as noise, whereas non-interacting protein pairs that are ϵ-close are predicted to be interacting.
The embedding of the PPI network in a chosen metric space is described briefly as follows. Given a PPI network of N proteins, the interaction weights for protein pairs (i, j) are interpreted as pairwise distances dij, and the task is to find locations in m-dimensional Euclidean space (vectors