The WalkTrap Method

The WalkTrap Method is a hierarchical clustering algorithm proposed by Pons & Latapy in 2005 in the paper Computing Communities in Large Networks (1). In essence, this method leverages the tendency for short random walks to stay within their local neighbourhoods for community detection.

Random Walks

A walker starting at a random node moves to a neighbouring node, with the probability of choosing each neighbour equal. That is, the probability a walker moves from node $i$ to $j$ is given by $P_{ij} = \frac{A_{ij}}{k_i}$ . Note that $P$ is known as the transition matrix. The probability of going from $i$ to $j$ during a random walk of length $t$ steps is $(P^t)_{ij}$ .

Pons & Latapy developed the following measure of distance based on random walks: $r_{ij} = \sqrt{\sum_{l = 1}^{n}\frac{(P^t_{il}-P^t_{jl})^2}{k_l}}$

Steps of the WalkTrap Algorithm

Assign each node to its own community
Compute the distance between all adjacent communities, that is communities which are linked by an edge.
Choose two communities which minimizes $\sigma = \frac{1}{n}\sum_{C \in \mathcal{P}}\sum_{i \in C}r^2_{iC}$ where $\mathcal{P}$ is the current partition of the network into communities, and $r_{iC}$ is a modified measure of distance mentioned above using $P^t_{Cj} = \frac{1}{|C|}\sum_{i \in C}P^t_{ij}$ .
Merge the two communities
Update the distances between adjacent communities of the newly formed community.
Repeat until there is one community
Find at which partition modularity was maximised, take this partition.

Implementation

WalkTrap Implementation

# run walktrap algorithm and save as VectorClustering object
h = g.community_walktrap().as_clustering()

# membership vector (a vector of each node containing which community it is a part of)
mem = h.membership

# list of lists of community nodes
walktrapG = [[] for i in range(len(h))]

i = 0 
for node in nx.nodes(G):
    walktrapG[mem[i]].append(node)
    i += 1
walktrapG.sort(key=len, reverse=True)

# Find communities containing target proteins
print('~ Proteins of Interest WalkTrap Communities ~')
for p in poi:
    i = 0
    while i < len(walktrapG):
        if p.p_id in walktrapG[i]:
            print(p.p_name + " is in community " + str(i))
        i += 1

# community in networkX
telomerase_community = G.subgraph(walktrapG[5])

# basic communtiy analysis
print('\n' + 'Walktrap partitions the network into ' + str(len(walktrapG)) + ' communities with the following sizes:')
walktrapG_len = []
for comm in walktrapG:
    walktrapG_len.append(len(comm))
print(walktrapG_len)

We used the community_walktrap() function included in the igraph package using random walks of length 4. The documentation for the same algorithm in C noted that a walk length between 3-8 yielded the best results. From testing the modularity of walk lengths between 3-8, lengths 6 and 8 resulted in a slightly higher modularity than length 4, but led to community sizes that were significantly larger (with multiple 800+ size groups). We aimed for smaller communities in hopes of detecting finer-grain functional groups. On the other hand, we would expect shorter walk lengths would give communities that are too small (mainly complexes). We therefore decided on a walk length of 4, however this is an area where future experimentation would be useful, especially since these communities are vital for our target validation.

With this, we were able to establish communities containing our proteins of interest (EST1, EST2, EST3, TEL1 and POP1). We labelled the community containing EST1, EST2 and EST3 proteins the telomerase community.

Results

WalkTrap partitioned the PPI network into 580 communities ranging in size from 811 nodes to 1 node.

The telomerase complex and the protein TEL1 are contained within one community of size 204, we denote this community as the ‘telomerase community’.

The protein POP1 is contained within the largest community of size 811.s

Community Validation

Annotation based validation

We took proteins with the 5 highest and 5 lowest degrees in the telomerase community, by looking at their annotations from the STRING database and in conjunction with the BCMB students we propose that the telomerase community is a functional module within the cell focused on cell replication and DNA repair, including telomerase and its cell functions.

While this category is not highly specific it is coherent, implying WalkTrap has successfully partitioned the PPI network into biologically relevant communities.

We also further looked at the 5 highest degree proteins in each of the 6 largest communities and found that they all broadly correspond to functional groups of proteins serving a similar role within the cell.

Annotation-based Validation

# highest degree nodes of walktrap telomerase commmunity 
telomerase_community = G.subgraph(walktrapG[5])
telomerase_community_high_degree_nodes = sorted(list(telomerase_community.nodes()), key=telomerase_community.degree(), reverse=True)

print("~~  HIGHEST DEGREE NODES OF THE WALKTRAP TELOMERASE COMMUNITY  ~~")
# print annotations of the highest degree nodes in the community to scan if they are related to TELOMERASE
i=0
while i < 5:
    pro = pro_info(P, telomerase_community_high_degree_nodes[i])
    print(pro.p_name + "\n" + pro.annot + "\n")
    i += 1
i=0

# lowest degree nodes of walktrap telomerase community
print("~~  LOWEST DEGREE NODES OF THE WALKTRAP TELOMERASE COMMUNITY  ~~")
i = 1
while i < 6:
    pro = pro_info(P, telomerase_community_high_degree_nodes[-i])
    print(pro.p_name + "\n" + pro.annot + "\n")
    i += 1

Keyword based validation

Using the annotations we analysed what proportion of nodes within the telomerase community reference ‘telomere’ or ‘telomerase’ as compared with the whole network. Ideally a network will be relatively small but contain a relatively large portion of these nodes.

The WalkTrap telomerase community is 3% of the network but captures 15% of the proteins which reference ‘telomere’ or ‘telomerase’ in their annotations. 17% of the proteins within the telomerase community reference ‘telomere’ or ‘telomerase’ compared with 4% of the whole network.

Keyword-based Validation

# please see full code submission for auxillary functions

# telomerase community ProteinInfo object list
P_t = []
for t in telomerase_community:
    P_t.append(pro_info(P, t))

print('Nodes in network: ' + str(len(P)))
print("Nodes in network refering to 'telomerase' or 'telomere' in annotation: " + str(len(find_p(P, 'telomer'))))
print('Nodes in telomerase community: ' + str(len(P)))
print("Nodes in telomerase community refering to 'telomerase' or 'telomere' in annotation: " + str(len(find_p(P, 'telomer'))))

Comparing to other community finding algorithms

In this project, we explored a few different methods for community detection–including the Louvain method. While the Louvain method often achieved community partitioning that had slightly greater modularity than the WalkTrap method, in fact Louvain tested worse in both the annotation based validation and the keyword based validation of the telomerase community. It also most often produced significantly larger communities than WalkTrap, which was at odds with our goal of finding fine grained functional groups. In the end, the group preferred WalkTrap since the deterministic algorithm was convenient, making it easy to replicate the community partitioning and therefore share code between group members.

References

Pons P, Latapy M. Computing communities in large networks using random walks. In: Computer and information sciences-ISCIS 2005: 20th international symposium, istanbul, turkey, october 26-28, 2005 Proceedings 20. Springer; 2005. p. 284–93.