CS 3824: Assignment 1

1 Create the input datasets

Write code to parse four Netpath pathways: EGFR, TGF-beta, TNF-alpha, and Wnt. Each pathway has two files, e.g., Wnt-nodes.txt and Wnt-edges.txt. Your code should parse both files for each pathway. As an example, I explain the format of the files for the Wnt pathway. In Wnt-nodes.txt, ignore the third column. The first column contains node ids. The second column contains the node types. This column is very useful since it tells you in a node is a receptor (source), a transcription factor (tf, target) or an internal node (none). For each pathway, your code should record its receptors and its transcription factors in an appropriate data structure. In Wnt-edges.txt, the first three columns and the column with the header edge_type contain the information you need to construct a directed graph from each NetPath pathway.

  • If edge_type is physical, add two edges to the graph, one directed from tail to head and the other from head to tail.
  • If the edge_type is anything else, add only one edge, directed from tail to head.
  • (Additional rule based on question on Piazza) If a pair of nodes, for example "P49674" and "O14640" in the Wnt-edges.txt file have both a physical edge and a Phosphorylation edge, you should store only one copy. If at least one of the edge types is directed, store only the directed version. If all edge types are undirected, then store one edge in each direction.

You will need to store each NetPath pathway in its own graph data structure.

Now read the weighted human protein interaction network. This file contains all the "anonymous" interactions. Each line represents a directed edge from the node in the first column to the node in the second column with a weight equal to the value in the third column. You may ignore the fourth column.

Perform a sanity check: check that every NetPath pathway a subgraph of the human protein interaction network.

2 Parse the output datasets for PathLinker

I have provided you the files output by running PathLinker on every NetPath pathway. A file with a name like "Wnt-k\20000-ranked-edges.txt" contains the top 20,000 paths computed by PathLinker for the Wnt pathway. Each line contains a directed edge (which should be an edge in the human protein interacton network) and a rank. The rank is the index of the first path in which that edge appear when we run PathLinker. Your code for parsing this file should be simple since you need to record the edges in the order they appear in the file. Note that there may be multiple edges with the same rank. I will leave it to you to decide how to store them.

3 Compute precision-recall curves

For each pathway, your code should now compare the ranked list of edges from the previous step to the edges in the corresponding pathway that you created when you parsed in the pathway. At each unique rank, determine if each edge in that rank is present in the pathway. Compute the precision and recall at this rank. You have to be careful about the denominator for the precision: is it the current rank or the total number of edges you have processed till this rank? Now plot the precision against the recall.

You may display the precision-recall for all four pathways on a single plot.

4 Implement the shortest path algorithm

This algorithm is fairly simple. For each pathway, you should compute the shortest path from each source to each target. Output the union of these path. There are a few factors to consider.

  • You can use an implementation of Dijkstra's algorithm in a package or library. Look very carefully at the documentation of the method. By default, some implementations will only return the length of the shortest path. Here, we are also interested in outputting the edges in the shortest paths.
  • If there are \(k\) sources and \(l\) target, you can manage with just \(k\) invocations to Dijkstra's algorithm. In other words, you do not need \(kl\) invocations of the algorithm.
  • How will you deal with the possibility that the shortest path between two nodes is not unique? In other words, there may be more than one shortest path between two nodes. If you output just one, your precision and recall may be much smaller than if you output all the shortest paths.
  • Can you order the edges in the output in some meaningful way, taking inspiration from PathLinker, for example?

5 Implement the RWR algorithm

This algorithm may be more tricky to implement. Remember that we defined \(A\) to be the adjacency matrix of the graph, where \(A_{uv} = 1\) if and only there is a directed edge from node \(u\) to node \(v\).

  • You will have to convert whatever data structure in which you are storing the human protein interaction network into its adjacency matrix representation. The library you use may already have such a function.
  • Beware of the large size of the adjacency matrix, e.g, \(12,000 \times 12,000\). If you run into memory errors, you may have to consider a sparse matrix representation.
  • Compute the diagonal matrix \(D\), where the entry \(D_{uu}\) is the sum of all the non-diagonal entries in row \(u\) of \(A\).
  • Compute \((D^{-1}A)^T\), i.e., multiple the inverse of \(D\) by \(A\) and take the transpose of the product.
  • Next form the vector \(s\) we defined in class, where \(s(u) = 1/|S|\) if and only if \(u\) is one of the receptors in the current pathway you are processing and \(S\) is the set of all receptors in the current pathway. Otherwise \(s(u) = 0\).
  • Compute the vector \(p\) of stationary probabilities using \(p = (I - (1-q)(D^{-1}A)^T)^{-1}q s\).
  • Next for each directed edge \((u, v)\) in the graph, compute the flux \(f(u, v) = p(u) w(u,v)/d_{out}(u)\). Here \(p(u)\) is the stationary probability of \(u\) that you computed in the previous step, \(w(u,v)\) is the weight of the edge from \(u\) to \(v\), and \(d_{out}(u)\) is the out-degree of \(u\), i.e., the total weight of the edges leaving \(u\).
  • Output the edges in decreasing order of edge flux. Output each edge per line, tail, head, and edge flux into a tab-delimited file.

6 Tips

  • You are welcome to use external libraries and packages unless I explicitly say that you should implement something yourself. If in doubt, ask on Piazza!
    • For storing graphs, networkx is ideal if you are using Python. Other languages have excellent graph libraries as well.
    • For plotting in Python, matplotlib and seaborn are excellent.
    • For linear algebra and matrix routines, numpy and scipy are fantastic resources.

7 What should you submit?

  • A PDF file on Canvas that contains
    1. (20 points) One plot showing the precision-recall curves for PathLinker on the TGF-beta, TNF-alpha, and Wnt pathways.
    2. (60 points; 40 points for RWR and 20 points for shortest path) One plot for each of the TGF-beta, TNF-alpha, and Wnt pathways showing the precision-recall curves for PathLinker, shortest path, and RWR algorithms.
    3. (10 points) Any observations you have on the results.
    4. (10 points) Any observations you have on the assignment and the challenges you face.
    5. A link to your code on GitHub

Grading for items 3 and 4 will be somewhat subjective.