CS 3824: Assignment 2

In this assignment, you will implement the Louvain and Leiden algorithms and compare the results on two protein interaction networks. Get started early. This assignment is significantly more difficult than Assignment 1!

1 Implement helper functions and data structures

You will need a class or a data structure to store a set of communities in a graph. You can give each community an integer-valued id, if that helps. This data structure has two components. (i) Given a node $v$, you will need to determine quickly the id of the community containing $v$. (ii) Given a community id $C$, you will need to determine quickly the set of nodes in $C$. "Quickly" here means that you should not be iterating over all nodes or all clusters to answer any of this query.
You will need to implement functions to compute the modularity of a set of communities and to efficiently compute the change in the modularity when we move a node from one community to another. For the second function, you can implement the formulae given in the slides on the Louvain and Leiden algorithms (pages 31-34).
Also implement functions to compute the Constant Potts Model (CPM) value for a set of communities and to efficiently compute the change in the CPM value when we move a node from one community to another. We did not derive any formulae for this change in class. It should not be difficult to figure out, implement, and test them yourself.
These functions should take $\gamma$ as a parameter.

2 Implement the `MoveNodes` function, i.e., one iteration of Phase 1 of the Louvain algorithm

MoveNodes refers to the function in the pseudocode on page 28 of the slides.
The $\cal H$ function here could be either the modularity or the CPM function, so you will have to pass this quality function (by pointer or by name) as an argument. There may be other clever ways of implementation as well.
Note that line 16 is somewhat different from the illustrations of Phase 1 earlier in the slides. Specifically, for each node $v$, you must examine the possibility of moving it to its own cluster (with just that node) or to any other cluster.
The rest of the pseudcode should be clear, especially if you paid attention in class.
What sort of sanity checks can you try out to get confidence in your implementation?
- In the guts of line 16, every time you move $v$ to a different community, check if the change in quality function that you have computed efficiently does equal the difference in values of the quality function computed using the full formula (inefficiently).
- You can explicitly compute the quality value using the full formular (from scratch) and print it at line 19 or check if it has increased.

3 Implement the `Louvain` function

I am referring to the top-level function on page 27.
Do not implement lines 5-8.
Instead, simply pass the result of MoveNodes back to the function repeatedly until the value of the quality function does not improve.
To decide when to stop these iterations, you can use the following criterion: the relative change in quality between two consecutive iterations falls below some threshold, e.g., $10^{-3}$.

4 Implement the `MoveNodesFast` function, i.e., one iteration of the Leiden algorithm

The pseudocode for this function is at the bottom of page 44 of the slides.
$Q$ is supposed to be a queue but in line 21, you should add to $Q$ only those nodes in $N$ that are not already in $Q$. Clearly, iterating over $Q$ for this step is inefficient. Therefore, you may not want to use an actual queue to implement $Q$.
Otherwise, this function should also be easy to implement.

5 Implement the `RefinePartition` and `MergeNodeSubsets` functions

Yes, I said in class that I will not ask you to implement these functions but I realised later that they are very important for the success of the Leiden algorithm. The pseudocode is on slide 50 of the slides. Use $\theta = 0.1$.
RefinePartition should be easy to implement. Hint: For the sake of convenience, when you are invoking MergeNodeSubsets, do not pass the entire partition $P_{\mathrm{refined}}$. Instead pass only the nodes in the community $C$ but with each node in an individual community.
Since MergeNodeSubsets looks very complex, you can break up the implementation into pieces. Let us name its arguments as $G$, $P$, and $S$.
1. Write a function CheckGammaConnected that takes two arguments $T$ and $S$. Both are sets of nodes where $T$ is a subset of $S$. This function simply checks if the number of edges that connect the nodes in $T$ to the nodes in $S-T$ is at least as large as the product of $\gamma$, the number of nodes in $T$, and the number of nodes in $S-T$.
2. To implement step 34, you can invoke CheckGammaConnected repeatedly with each node $v$ in $S$ and $S$ as the two arguments.
3. To implement step 37, you can invoke CheckGammaConnected repeatedly with each community in $P$ and $S$ as the two arguments. Here, be sure that each community in $P$ that you consider is a subset of $S$. This condition should hold true if you follow the hint above.
4. In line 38, you can compute the value on the right hand side of big curly brace for every community in $\cal T$ (all these communities have passed the test in line 37). To select one of these communities, you have to flip a coin with probabilities proportional to these values. In Python, you can numpy.random.choice where these probabilities will be the final argument.

6 Implement the `Leiden` function

The pseudocode for this function is on page 44 of the slides.
Do not implement lines 7-8.
Instead, simply pass the result of RefinePartition back to the Leiden function repeatedly until the value of the quality function does not improve.
You can use the same criterion for stopping as for Louvain.

7 Compute disconnected and badly connected communities

Given a partition, i.e., a set of communities computed by any algorithm, for each community, determine if it is disconnected.
Further, for each community $c$, run the Leiden algorithm on just the subgraph induced by the nodes in $c$. If one iteration of Leiden breaks $c$ up into smaller communities, then deem $c$ to be badly connected.

8 Read input graphs

Now turn your attention to running your implementations of the Louvain and Leiden algorithms on protein interaction networks.
Read the following two networks:
- The weighted human protein interaction network. Each line represents a directed edge from the node in the first column to the node in the second column with a weight equal to the value in the third column. For this assignment, you should ignore the weight column. You should also ignore the fourth column. Since the Louvain and Leiden algorithms operate on undirected networks, store this directed graph as an undirected graph. In other words, if the graph contains both the directed edges $(u, v)$ and $(v, u)$, create one undirected edge $(u,v)$. If the graph just contains the directed edge $(u,v)$ (and none in the other direction), create one undirected edge $(u,v)$.
- The unweighted STRING network. Each line represents an undirected edge with the two node identifiers separated by a comma.

9 Execute and compare algorithms

Run your implementation of the Leiden and Louvain algorithms on each of these networks to optimize the (a) modularity and (b) CPM value. For each, try three values of $\gamma$: 0.1, 1, and 10. Therefore, there are 24 combinations (2 graph $\times$ 2 algorithms $\times$ 2 functions $\times$ 3 values of $\gamma$).
Record the running time and modularity or CPM value after each iteration. Here, one iteration means one call to MoveNodes or MoveNodesFast. Here it is important to record the time taken by the process itself rather than the clock time. Find out how to do it accurately in the language you use.

10 What to submit

A PDF file on Canvas that contains the following plots:

A pointer to your code that is accessible both by tmmurali@gmail.com and mrumi@vt.edu.
(20 points for Louvain results, 30 points for Leiden results) Change in the modularity values and CPM values with increasing number of iterations. The plots are like Figure 9 in the paper. The $x$-axis is the running time. The $y$-axis is the modularity or CPM value. Each point in a curve is quality achieved at a particular iteration, so reading along a curve from left to right, the $i$th point shows the running time and quality value after iteration $i$. Show all values of $\gamma$ for one quality function in the same plot. Show both algorithms on the same plot. Show the results for each graph and for each quality function on a different plot. Therefore, there will be four plots each with six curves. Make sure you use colours and thicknesses to differentiate between the curves clearly. For example, one colour for the Louvain algorithm and one colour for the Leiden algorithm and different edge style or thicknesses for the different values of $\gamma$.
(30 points) Change in the fraction of disconnected and badly connected nodes with increasing number of iterations. These plots are like Figure 4 in the paper. The $x$-axis shows the iteration number. The $y$-axis shows the percentage of communities (that are disconnected or badly connected). Create these plots only for $\gamma = 1$. Each plot should be for a (graph, quality function) pair. So there will be four plots. In each plot, display the fraction of disconnected communities computed by the Louvain algorithm, the fraction of badly-connected communities computed by the Louvain algorithm, and the fraction of badly-connected communities computed by the Leiden algorithm, each for the first four iterations.
(10 points) Any observations you have on the results.
(10 points) Any observations you have on the assignment and the challenges you face.

Grading for items 4 and 5 will be somewhat subjective.

11 Tips

This assignment is challenging. Here are some ideas to help you succeed.

A high priority must be debugging the functions you implement to efficiently compute the change in the modularity or CPM when you move a node from one community to another. Debug these functions first! In line 16 of the pseudocode for MoveNodes, you will test moving the node $v$ to every community $C$, including the empty community. Here, you should create temporary variables that store the new partition where $v$ is in $C$ and compute the modularity/CPM value of this new partition explicitly using the quadratic time algorithm. Compare the difference of this value and $h_{old}$ to the value you compute for the change in modularity/CPM value using the efficient function.
Test your functions on small graphs first where you know or can figure out the best modularity. For example, a path, a clique or a barbell graph. There are also special graphs such as the karate club graph or the Les Miserables graph on which other scientists have computed the optimal modularity.
When you get to the PPI network in this assignment, your code will be slow. It may take hours to complete even one call to MoveNodes. To speed it up, change the implementation of line 16 of the pseudocode for MoveNodes and line 17 of MoveNodesFast. Instead of trying to move node $v$ to every cluster in the partition, consider only those clusters that contain a node $u$ that is a neighbour of $v$. This the number of clusters to which you try to move $v$ will change to the number of neighbours of $v$. Consider moving $v$ to an empty cluster as well.
For further speed, consider the formulae for change in modularity that I posted on Piazza. One of them is $\Delta{\cal H}_{\cal P}(v \rightarrow \emptyset) = - \frac{1}{m}\sum_{u \in D} a(u,v) + \frac{\gamma d(v)}{2m^2}\sum_{u \in D} d(u)$. Precompute each of the two sums and cache them so that you can look them up in the functions to the compute the change in modularity. If you implement these ideas, be sure to write new functions for MoveNodes and MoveNodesFast so that you still retain the old, slower, but bug-free implementations.
- Let us call the first sum $d_{in}(v, C)$, the number of edges that connect $v$ to the nodes in community $C$ (this community may be the one that contains $v$). When does the value of $d_{in}(v, C)$ change? Only when you actually move $v$ out of $C$ or into $C$.
  - When you move $v$ out of $C$, $d_{in}(v, C)$ will become zero. Further, for every neighbour $u$ of $v$ in the graph that is also in $C$, $d_{in}(u, C)$ will decrease by 1.
  - When you move $v$ into $C$, $d_{in}(v, C)$ will become equal to the number of neighbours of $v$ in the graph that are also in $C$. For every such neighbour $u$ of $v$, $d_{in}(u, C)$ will increase by 1.
  - So you can store the $d_{in}$ in a two-level hashmap (dictionary) and update the appropriate entries whenever you move a node from one cluster to another. The keys of the first level of this hashmap can be the node id and keys of the second level hashmap can be the cluster id.
  - Therefore, inside MoveNodes or MoveNodesFast, you will only need to do a lookup in the hash table.
  - Of course, you will need to compute these values before you invoke MoveNodes or MoveNodesFast the first time but I am sure you can work out these initial values are simple.
- Consider the second sum $\sum_{u \in D} d(u)$. This value is different for different communities but stays fixed unless you move a node from one community to another.
  - Store these values (call them $d_{sum}$) in a hashmap keyed by cluster id.
  - Initially, when every node is in its own community, the $d_{sum}$ values are just the node degrees.
  - When you move a node $v$ out of a community $C$, $d_{sum}(C)$ decreases by $d(v)$.
  - When you move a node $v$ into a community $C$, $d_{sum}(C)$ increases by $d(v)$.
  - Therefore, inside MoveNodes or MoveNodesFast, you will only need to do a lookup in the hash table.
If you find implementing the ideas in #4 difficult, here is another approach to get results faster. Run your code on a smaller graph. Just taking the first $k$ lines from the file containing the interaction network is not appropriate since these edges may not form a connected graph, let alone a community. I suggest that you use the idea of $k$ cores starting at page 29 of http://bioinformatics.cs.vt.edu/~murali/teaching/2022-fall-cs3824/lectures/lecture-08-modules.pdf. networkx has functions to compute a $k$ core. You can try a value of $k$ such as 5 or 10 to see if your code runs efficiently. It will help you to print the number of nodes and edges in the cores you compute so that you get a sense of how the code scales as you decrease $k$; keep in mind that the $k$ core size will increase as you decrease $k$. I want you to submit in your report the smallest value of $k$ for which your code runs in a reasonable amount of time, e.g., one hour.