The WalkTrap Method is a hierarchical clustering algorithm proposed by Pons & Latapy in 2005 in the paper Computing Communities in Large Networks (1). In essence, this method leverages the tendency for short random walks to stay within their local neighbourhoods for community detection.
Random Walks
A walker starting at a random node moves to a neighbouring node, with the probability of choosing each neighbour equal. That is, the probability a walker moves from node i to j is given by Pij=kiAij. Note that P is known as the transition matrix. The probability of going from i to j during a random walk of length t steps is (Pt)ij.
Pons & Latapy developed the following measure of distance based on random walks: rij=l=1∑nkl(Pilt−Pjlt)2
Steps of the WalkTrap Algorithm
Assign each node to its own community
Compute the distance between all adjacent communities, that is communities which are linked by an edge.
Choose two communities which minimizes σ=n1C∈P∑i∈C∑riC2 where P is the current partition of the network into communities, and riC is a modified measure of distance mentioned above using PCjt=∣C∣1∑i∈CPijt.
Merge the two communities
Update the distances between adjacent communities of the newly formed community.
Repeat until there is one community
Find at which partition modularity was maximised, take this partition.
Implementation
WalkTrap Implementation
# run walktrap algorithm and save as VectorClustering objecth = g.community_walktrap().as_clustering()# membership vector (a vector of each node containing which community it is a part of)mem = h.membership# list of lists of community nodeswalktrapG = [[] for i inrange(len(h))]i =0for node in nx.nodes(G): walktrapG[mem[i]].append(node) i +=1walktrapG.sort(key=len, reverse=True)# Find communities containing target proteinsprint('~ Proteins of Interest WalkTrap Communities ~')for p in poi: i =0while i <len(walktrapG):if p.p_id in walktrapG[i]:print(p.p_name +" is in community "+str(i)) i +=1# community in networkXtelomerase_community = G.subgraph(walktrapG[5])# basic communtiy analysisprint('\n'+'Walktrap partitions the network into '+str(len(walktrapG)) +' communities with the following sizes:')walktrapG_len = []for comm in walktrapG: walktrapG_len.append(len(comm))print(walktrapG_len)
We used the community_walktrap() function included in the igraph package using random walks of length 4. The documentation for the same algorithm in C noted that a walk length between 3-8 yielded the best results. From testing the modularity of walk lengths between 3-8, lengths 6 and 8 resulted in a slightly higher modularity than length 4, but led to community sizes that were significantly larger (with multiple 800+ size groups). We aimed for smaller communities in hopes of detecting finer-grain functional groups. On the other hand, we would expect shorter walk lengths would give communities that are too small (mainly complexes). We therefore decided on a walk length of 4, however this is an area where future experimentation would be useful, especially since these communities are vital for our target validation.
With this, we were able to establish communities containing our proteins of interest (EST1, EST2, EST3, TEL1 and POP1). We labelled the community containing EST1, EST2 and EST3 proteins the telomerase community.
Results
WalkTrap partitioned the PPI network into 580 communities ranging in size from 811 nodes to 1 node.
The telomerase complex and the protein TEL1 are contained within one community of size 204, we denote this community as the ‘telomerase community’.
The protein POP1 is contained within the largest community of size 811.s
Community Validation
Annotation based validation
We took proteins with the 5 highest and 5 lowest degrees in the telomerase community, by looking at their annotations from the STRING database and in conjunction with the BCMB students we propose that the telomerase community is a functional module within the cell focused on cell replication and DNA repair, including telomerase and its cell functions.
While this category is not highly specific it is coherent, implying WalkTrap has successfully partitioned the PPI network into biologically relevant communities.
We also further looked at the 5 highest degree proteins in each of the 6 largest communities and found that they all broadly correspond to functional groups of proteins serving a similar role within the cell.
Annotation-based Validation
# highest degree nodes of walktrap telomerase commmunity telomerase_community = G.subgraph(walktrapG[5])telomerase_community_high_degree_nodes =sorted(list(telomerase_community.nodes()), key=telomerase_community.degree(), reverse=True)print("~~ HIGHEST DEGREE NODES OF THE WALKTRAP TELOMERASE COMMUNITY ~~")# print annotations of the highest degree nodes in the community to scan if they are related to TELOMERASEi=0while i <5: pro = pro_info(P, telomerase_community_high_degree_nodes[i])print(pro.p_name +"\n"+ pro.annot +"\n") i +=1i=0# lowest degree nodes of walktrap telomerase communityprint("~~ LOWEST DEGREE NODES OF THE WALKTRAP TELOMERASE COMMUNITY ~~")i =1while i <6: pro = pro_info(P, telomerase_community_high_degree_nodes[-i])print(pro.p_name +"\n"+ pro.annot +"\n") i +=1
Keyword based validation
Using the annotations we analysed what proportion of nodes within the telomerase community reference ‘telomere’ or ‘telomerase’ as compared with the whole network. Ideally a network will be relatively small but contain a relatively large portion of these nodes.
The WalkTrap telomerase community is 3% of the network but captures 15% of the proteins which reference ‘telomere’ or ‘telomerase’ in their annotations. 17% of the proteins within the telomerase community reference ‘telomere’ or ‘telomerase’ compared with 4% of the whole network.
Keyword-based Validation
# please see full code submission for auxillary functions# telomerase community ProteinInfo object listP_t = []for t in telomerase_community: P_t.append(pro_info(P, t))print('Nodes in network: '+str(len(P)))print("Nodes in network refering to 'telomerase' or 'telomere' in annotation: "+str(len(find_p(P, 'telomer'))))print('Nodes in telomerase community: '+str(len(P)))print("Nodes in telomerase community refering to 'telomerase' or 'telomere' in annotation: "+str(len(find_p(P, 'telomer'))))
Comparing to other community finding algorithms
In this project, we explored a few different methods for community detection–including the Louvain method. While the Louvain method often achieved community partitioning that had slightly greater modularity than the WalkTrap method, in fact Louvain tested worse in both the annotation based validation and the keyword based validation of the telomerase community. It also most often produced significantly larger communities than WalkTrap, which was at odds with our goal of finding fine grained functional groups. In the end, the group preferred WalkTrap since the deterministic algorithm was convenient, making it easy to replicate the community partitioning and therefore share code between group members.
Pons P, Latapy M. Computing communities in large networks using random walks. In: Computer and information sciences-ISCIS 2005: 20th international symposium, istanbul, turkey, october 26-28, 2005 Proceedings 20. Springer; 2005. p. 284–93.