Biorithm
1.1
|
NetworkLego is a general computational framework for
This system consolidates datasets that contain information on known molecular interactions into a universal network. When queried with a molecular profile (typically, a gene expression data set) representing a cell state, it searches the universal network using an algorithm for finding heavily-weighted subgraphs to retrieve an "active network," which is a sub-network whose constituent molecules act in concert (as determined by the combined similarity of their measured profiles) to determine the state of the cell. NetworkLego also works when there is no network of molecular interactions available. In this situation, it finds dense subgraphs in a network consisting of edges between pairs of genes with statistically significant correlations in the molecular profile data.
Download the Biorithm package and follow the installation instructions for Biorithm. The executable files will be available as active-networks/active-networks
To run NetworkLego, you usually need several input files:
The current NetworkLego pipeline runs as follows:
pathway -d . -C 100
Compute NetworkLego using this threshold. The standard incantation is
active-networks -d <directory> -n <network-fil> -f <functions-file> -B <OBO-file> -p <pvalue-threshold> -C 100 -m <multiple-hypothesis-correction>
The options mean the following:
Flag | Description |
---|---|
B | the file containing the structure of the Gene Ontology (in OBO format). This file is optional. |
c | the name of the file containing “class” information. If you do not use the -d option, you must use this option in conjunction with the -g option. |
C | the number of times NetworkLego should randomise the gene expression data to compute the null distribution of gene expression correlation values in "random" gene expression data. |
d | the directory containing dataset.txt and class.txt |
f | the file containing functional annotations. |
g | the name of the file containing gene expression data. If you do not use the -d option, you must use this option in conjunction with the -c option. |
m | a string describing the type of correction you want to perform when testing multiple hypothesis. Legal values of this string are "Bonferroni", "Holms", and "FDR". If you do not provide this option, NetworkLego will not perform any correction. Currently, this option only controls the correction done to select statistically significant co-expressed pairs of genes when you do not provide a network of molecular interactions. |
n | the file containing the interactions. |
p | the p-value cutoff that determines if a correlation value between two genes is statistically significant. The default is 0.01. |
r | the number of times you want to randomise the network to compute p-values for NetworkLego. |
t | the correlation threshold you computed in the previous step. |
z | just use this option blindly. It may go away in a future version of NetworkLego. |
The software typically produces numerous output files containing different types of information about the networks analysed. All files are tab-delimited, unless otherwise noted, and hence can be opened in a spreadsheet or slurped into a database. All files have a common prefix (e.g., what you supply through the --experiment-name
option or a string the software guesses from the argument to the --edges-file
option. For the purpose of this documentation, let us assume that this common prefix is "network-lego-". The files produced by the software are:
This file has two parts. The first part details various statistics about each active network or network lego in a tab-delimited (columnar format). The second part lists information about the connected components in each network.
Each line in the first part of the file corresponds to a single active network or network lego. The columns in the first part contain the following information (Note: only some columns may be relevant or useful for a particular application of NetworkLego):
Column Header | Description |
---|---|
ActiveNetworkName | An identifier for the active network. |
#Nodes | The number of nodes (genes, proteins, molecules, etc.) in the active network. |
#Edges | The number of edges (physical, regulatory, functional, or other type of interactions) in the active network. |
Total Edge Weight | The total weight of the edges in the network. |
Weighted density | The total weight of the edges in the network divided by the number of nodes in the network. |
Average Edge Weight | The total weight of the edges in the network divided by the number of edges in the network. |
Unweighted Density | The number of edges in the network divided by the number of nodes in the network. |
Unweighted Completeness | The number of edges in the network divided by the maximum number of edges possible (i.e., the number of pairs of nodes) in the network. |
Stouffer's z-score | When applicable, the Liptak-Stouffer z-score of the network. |
Gaussian p-value of Stouffer's z-score | The Gaussian p-value corresponding to the Liptak-Stouffer z-score. |
The second part of the file also contains one line per active network. The first three columns are identical to the first part of the file (i.e., "ActiveNetworkName", "#Nodes", and "#Edges"). The other two columns are:
Column Header | Description |
---|---|
#Components | The number of connected components in the active network. |
Component sizes (#nodes, #edges) | For each component, the number of nodes and edges in the component, separated by a comma and placed within parentheses. |
The edges in each network. Each line of the file contains information on one network and one edge in that network. The columns of the file are the name of the network, the identifiers of the two nodes incident on the edge, the type of the edge (e.g., PPI), and the weight of the edge.
The nodes in each network. Each line of the file contains information on one network and one node in that network. The columns of the file are the name of network and the identifier of the node.