Write code to parse four Netpath pathways: EGFR, TGF-beta, TNF-alpha, and Wnt. Each pathway has two files, e.g., Wnt-nodes.txt and Wnt-edges.txt. Your code should parse both files for each pathway. As an example, I explain the format of the files for the Wnt pathway. In Wnt-nodes.txt, ignore the third column. The first column contains node ids. The second column contains the node types. This column is very useful since it tells you in a node is a receptor (source), a transcription factor (tf, target) or an internal node (none). For each pathway, your code should record its receptors and its transcription factors in an appropriate data structure. In Wnt-edges.txt, the first three columns and the column with the header edge_type
contain the information you need to construct a directed graph from each NetPath pathway.
edge_type
is physical, add two edges to the graph, one directed from tail to head and the other from head to tail.edge_type
is anything else, add only one edge, directed from tail to head.You will need to store each NetPath pathway in its own graph data structure.
Now read the weighted human protein interaction network. This file contains all the "anonymous" interactions. Each line represents a directed edge from the node in the first column to the node in the second column with a weight equal to the value in the third column. You may ignore the fourth column.
Perform a sanity check: check that every NetPath pathway a subgraph of the human protein interaction network.
I have provided you the files output by running PathLinker on every NetPath pathway. A file with a name like "Wnt-k\20000-ranked-edges.txt" contains the top 20,000 paths computed by PathLinker for the Wnt pathway. Each line contains a directed edge (which should be an edge in the human protein interacton network) and a rank. The rank is the index of the first path in which that edge appear when we run PathLinker. Your code for parsing this file should be simple since you need to record the edges in the order they appear in the file. Note that there may be multiple edges with the same rank. I will leave it to you to decide how to store them.
For each pathway, your code should now compare the ranked list of edges from the previous step to the edges in the corresponding pathway that you created when you parsed in the pathway. At each unique rank, determine if each edge in that rank is present in the pathway. Compute the precision and recall at this rank. You have to be careful about the denominator for the precision: is it the current rank or the total number of edges you have processed till this rank? Now plot the precision against the recall.
You may display the precision-recall for all four pathways on a single plot.
This algorithm is fairly simple. For each pathway, you should compute the shortest path from each source to each target. Output the union of these path. There are a few factors to consider.
This algorithm may be more tricky to implement. Remember that we defined \(A\) to be the adjacency matrix of the graph, where \(A_{uv} = 1\) if and only there is a directed edge from node \(u\) to node \(v\).
Grading for items 3 and 4 will be somewhat subjective.