Diffusion Distance

Introduction

We want to find novel proteins that are ‘central’ in the sense of being closely connected to the three proteins of the telomerase complex (EST1, EST2, EST3). The shortest path is not suitable for this task, since the network is highly connected – there are hundreds of proteins with shortest path length 2 from EST1, for example. In other words, we need a more sophisticated distance metric to capture fine-grained structures of the network. The diffusion distance, which uses random walks and is employed in the implementation of Walktrap, is therefore suitable for our purpose, since it is known from (1) to return a continuous range of values that allows much finer distinctions of similarity between nodes.

We implement diffusion distance with the R package diffudist, the only package of the sort available online. In particular, we use the function get_distance_matrix which, for a given time parameter $t$ , returns a matrix whose entries give the diffusion distance between pairs of nodes based on random walks of time length $t$ .

Setup

import networkx as nx
import numpy as np
import igraph as ig; 
G0 = nx.read_weighted_edgelist("4932.protein.links.v12.0.txt",comments="#",nodetype=str)
threshold_score = 700 #HIGH CERTAINTY
for edge in G0.edges: 
    weight = list(G0.get_edge_data(edge[0],edge[1]).values())
    if(weight[0] <= threshold_score):
        G0.remove_edge(edge[0],edge[1])

largest_cc = max(nx.connected_components(G0),key=len)
G = G0.subgraph(largest_cc)

g = ig.Graph.from_networkx(G)

h = g.community_walktrap().as_clustering()
mem = h.membership
walktrapG = [[] for i in range(len(h))]
i = 0 
for node in nx.nodes(G):
    walktrapG[mem[i]].append(node)
    i += 1
walktrapG.sort(key=len, reverse=True)
telomerase_community = G.subgraph(walktrapG[5])

Diffusion Distance Methods

The package calculates the diffusion distance using the Poissonian node-centric continuous time random walk model, as discussed in (2). Let $G$ be our network, $A$ its adjacency matrix, and $D$ the matrix with the degrees of vertices on the diagonal and zero elsewhere. Further, assume that the time a random walker stays at each node obeys an exponential distribution $f(x;\lambda)=\lambda e^{-\lambda x},$ where $\lambda$ may be normalised to 1 since it merely sets the time scale. The probability vector of the continuous random walk is then characterised by the master equation $\frac{d\mathbf{p}(t)}{dt}=-\mathbf{p}(t)\tilde{L}$ where $\tilde L = I-D^{-1}A$ is called the normalised Laplacian matrix. Solving the differential equation, we find that given an initial position $i$ , the probability vector at time $t$ is simply the $i$ -th row of $\mathbf{P}= e^{-t\tilde{L}}$ , which we denote by $\mathbf{P}(t|i)$ . Having computed $\mathbf{P}$ , we can then define the family of diffusion distances between nodes $i$ and $j$ as $D_t(i,j):=|\mathbf{p}(t|i)-\mathbf{p}(t|j)|_2,$ where the norm is the usual Euclidean 2-norm. The diffusion distance thus measures the similarity between two nodes by considering the probability vectors of random walkers starting at these nodes, after $t$ amount of time has elapsed. We will discuss the importance of the parameter $t$ below.

A Note on the Jaccard Ceofficient

Calculations of the matrix exponential can be computationally costly, making it unrealistic to evaluate diffusion distances for the entire network. Our method is thus to find a subgraph containing the telomerase community and calculating diffusion distances over this subgraph. To this end, we use the Walktrap community finding algorithm to generate a list of communities. Having established that the telomerase complex resides in the 5th community, we want to find the communities closest to it. To quantify the closeness between communities, we used the link-based Jaccard coefficient proposed in (3), described in the section on Telomerase in the Community Network. As mentioned, we also tested the ‘average restricted distance’ discussed in (4). The code for both is given below:

Jaccard Distance Measure

def num_edges(A,B):
    num = 0
    intraedge = 0
    for node in A + B:
        num += G.degree[node]
        for node1 in A + B:
            if node == node1:
                continue
            else:
                intraedge += G.number_of_edges(node,node1)
    return num - intraedge/2 

def interedges(A,B):
    num = 0
    cut = [*nx.edge_boundary(G,A,B)]
    return len(cut)

def jdistance(A,B):
    return interedges(A,B)/(num_edges(A,B))
    
jdists = []
for i in range(0,len(walktrapG)):
    jdists.append(jdistance(walktrapG[5],walktrapG[i]))
for i in range(10):
    print(f'Community {np.argsort(jdists)[::-1][i]}: Jaccard index = {jdists[np.argsort(jdists)[::-1][i]]}')

Restricted Distance Measure

def distance(A,B):
    distances = []
    nodes = A + B
    AB = G.subgraph(nodes)
    if not nx.is_connected(AB):
        return 0
    for nodeA in A:
        for nodeB in B:
            distances.append(nx.shortest_path_length(AB,nodeA,nodeB))
    return np.average(distances)
rdists = []
for i in range(0,len(walktrapG)):
    rdists.append(distance(walktrapG[5],walktrapG[i]))
for i in range(10):
    print(f'Community {np.argsort(rdists)[i]} has distance {rdists[np.argsort(rdists)[i]]} from community 5')

Implementation

By considering the nodes of 1. the community containing the telomerase proteins and 2. the nine communities closest to it, we obtained an induced subgraph of $G$ with 2786 nodes that contains all of $G$ ’s largest communities. We then identified proteins most closely connected to the telomerase proteins (EST1, EST2, and EST3) by evaluating the distance matrix of this network and selecting the proteins whose sum of distances to the three telomerase proteins is the smallest.

Creating subgraph with closest communities

# converting the subgraph induced by the top communities into a graph with integer coefficients, and exporting the edgelist
acom = G.subgraph(walktrapG[5]+walktrapG[0]+walktrapG[1]+walktrapG[2]+walktrapG[3]+walktrapG[4]+walktrapG[15]+walktrapG[7]+walktrapG[9]+walktrapG[17])
acom = nx.convert_node_labels_to_integers(acom,ordering="sorted",label_attribute='name')
nx.write_edgelist(acom,'ten_edgelist.txt',data=False)
for i in acom.nodes:
    if acom.nodes[i]['name'] == 'YLR233C':
        print(f'Est1: {i}')
    if acom.nodes[i]['name'] == 'YLR318W':
        print(f'Est2: {i}')
    if acom.nodes[i]['name'] == 'YIL009C-A':
        print(f'Est3: {i}')
print(acom.number_of_nodes())
# EST1: 1783
# EST2: 1821
# EST3: 1252

Calculating findings

library(igraph)
library(diffudist)
library(stringr)
library(strex)

# I slightly modifed the function definition; this is basically just the function get_distance_matrix
distance <- function (g, tau, type = "Normalized Laplacian", weights = NULL, 
                      as_dist = FALSE, verbose = TRUE) 
{
  expL <- get_diffusion_probability_matrix(g, tau, type, weights, 
                                           verbose)
  node_labels <- as.character(igraph::V(g)$name)
  if (is.null(node_labels)) {
    node_labels <- as.character(igraph::V(g)$label)
  }
  if (is.null(node_labels)) {
    node_labels <- as.character(igraph::V(g))
  }
  colnames(expL) <- rownames(expL) <- node_labels
  if (verbose) {
    cat(paste("Building distance matrix...\n"))
  }
  if (requireNamespace("parallelDist", quietly = TRUE)) {
    DM <- parallelDist::parDist(expL)
  }
  else {
    DM <- stats::dist(expL)
  }
  return(as.matrix(DM))
}

H = read_graph('ten_edgelist.txt',directed=FALSE)
gd <- distance(H, 300, type = "Normalized Laplacian", weights = NULL, 
                          as_dist = TRUE, verbose = TRUE)
ordered <- order((gd2[1821,]+gd2[1783,]+gd2[1252,]), decreasing =FALSE)

Choosing the Parameter $t$

Since the diffusion distance is actually a family of metrics, we need to choose the parameter $t$ that best reflects the network structure. (5) explained that we can interpret it as follows: for small values of $t$ the distances capture information about local structures, and as $t$ increases more macroscopic scale information is captured; when $t$ is large, only the most macro scale is retained.

As we experimented with different values of $t$ , we found that for small values of $t$ (roughly less than 600), the results were stable in the sense that varying $t$ slightly returns highly similar, if not identical results. For example, for $t$ in the range $[300,400]$ , the thirty nodes with the smallest distance to the telomerase proteins are completely identical. For larger values of $t$ , however, the results became unstable: there is no overlap in the top 30 nodes obtained with $t=600$ and $t=601$ . We think a possible explanation for this is that for our densely connected network (with 2786 nodes) a random walk of time length $t\approx 600$ is about the longest possible to still retain useful information about a protein’s connection to others. Longer paths give too much weight to remotely connected proteins: they ‘wash out’ useful information and give rise to chaotic behaviour. We therefore sample different resolutions, at 50, 150, 250, 350, 450 and 550, and take the most similar proteins found using these values: these are the most similar proteins as we vary how much weight we give to the ‘local’ structure.

Findings

t = 50: RIM15, RPS8A, NSA2, SCH9, MRPS17, RIM, CCR4, SWE1, TOR1, UBC5
t = 150: NSA2, ZDS2, YHC3, NAT4, BUR2, PUP3, RRP17, EAP1, PRP5, MRPS17
t = 250: PRP5, RFA3, RPL34A, ROK1, ZDS2, BUR2, NAP1, EDC1, RPN2,REI1
t = 350: ROK1, RPL34A, RPN2, PRP5, RFA3, NAP1, REI1, BUR2, EDC1, ZDS2
t = 450: RPN2, PRP5, RFA3, ROK1, RPL34A,NAP1,REI1, EDC1, BUR2, ZDS2
t = 550: IKI1, RFC2, NUG1, ATX1, RPS5, RPL34A, ECM32, ISW2, YLH47, YFH1

References

Cao HAP Mengfei AND Zhang. Going the distance for protein function prediction: A new distance metric for protein interaction networks. PLOS ONE [Internet]. 2013 Oct;8(10):1–12. Available from: 10.1371/journal.pone.0076339

Masuda N, Porter MA, Lambiotte R. Random walks and diffusion on networks. Physics Reports [Internet]. 2017;716-717:1–58. Available from: https://www.sciencedirect.com/science/article/pii/S0370157317302946

Shibata N, Kajikawa Y, Sakata I. Measuring relatedness between communities in a citation network. Journal of the American Society for Information Science and Technology. 2011;62(7):1360–9.

Adali S, Lu X, Magdon-Ismail M. Deconstructing centrality: Thinking locally and ranking globally in networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining [Internet]. New York, NY, USA: Association for Computing Machinery; 2013. p. 418–25. (ASONAM ’13). Available from: 10.1145/2492517.2492531

De Domenico M. Diffusion geometry unravels the emergence of functional clusters in collective phenomena. Phys Rev Lett [Internet]. 2017 Apr;118:168301. Available from: https://link.aps.org/doi/10.1103/PhysRevLett.118.168301