Biorithm
1.1

This program implements several algorithms for performing functional enrichment. For this type of analysis, the user typically possesses a set of "interesting" genes, e.g., a collection of genes that are significantly differentially expressed between treatment and control samples. Now, given a collection of curated gene sets (e.g., annotations from the Gene Ontology), we can evaluate the functional coherence of the "interesting" genes by analyzing their relationship or overlap to each of the curated gene sets (or collections of curated gene sets) and by measuring the statistical significance of the overlap. Ultimately, each of these functional enrichment algorithms returns a list of the curated gene sets that relate closely to the interesting set of genes. Such analysis provides a highlevel summary of the interesting genes.
Download the Biorithm package and follow the installation instructions for Biorithm. The b>enrich
executable will be available in the enrich
subdirectory of Biorithm.
enrich includes several published functional enrichment algorithms: Fisher's exact test, GenGO [6], Parametric Analysis of Gene Set Enrichment (PAGE) [4], FuncAssociate [1], and two variations of NetworkBased Enrichment [14], which we developed. Below, we describe the general command for invoking the available enrichment algorithms through enrich. Since Fisher's Exact Test is perhaps the most wellknown enrichment method, we describe this algorithm in detail, with examples of how it can be executed using the enrich tool. Additionally, we elaborate on the two flavors of NetworkBased Enrichment (NBE), as these algorithms were developed by our research group. For details regarding the other enrichment algorithms, please visit their respective publications linked above.
enrich accepts four different categories of command line options. See the table below for details about all available command line options. We briefly describe each category of options:
Exactly one of these options is required to specify which enrichment algorithm to run. enrich will print an error to standard output if the user provides an incorrect number of algorithms.
The user must use exactly one of these options to specify the universal collection of genes (e.g., all genes tested in a treatmentcontrol microarray contrast). The universal collection of genes can be provided as simply a set of genes (S
), a network (N
), or a ranked list of genes (R
). Note that some algorithms require providing the universe through a specific option. For example, networkbased enrichment clearly requires a network; if the universe is not provided using the required option, enrich will issue an error message to the standard output. Many algorithms only require the set of universal genes (e.g., GenGO). In this case, if the universe is provided as a network or ranked list, enrich internally converts these to the appropriate set of genes defined by the nodes in the network or the ranked list, respectively.
Exactly one of these options is required to specify the collection of "interesting" genes on which the user wants to perform functional enrichment (e.g., the most differentially expressed genes from a treatmentcontrol contrast). This collection of genes should be a subset of the "Universe" set of genes. The "interesting" cluster of genes can be provided as simply a set of genes (s
), a network (n
), or a ranked list of genes (r
). Note that some algorithms require providing the cluster in a specific format. For example, networkbased enrichment clearly requires a network; if the cluster is not provided using the required option, enrich will issue an error message to the standard output. Many algorithms only require the set of genes in the cluster (e.g., GenGO). In this case, if the cluster is provided as a network or ranked list, enrich internally converts these to the appropriate set of genes defined by the nodes in the network or the ranked list, respectively. There is a fourth option (d
) used for algorithms that do not require any cluster. Currently, PAGE is the only algorithm to which this option applies.
These options control various parametric aspects of running the enrichment algorithms. Of these, the annotationsfile
(a
) is required to specify the curated gene sets (functional annotations). Many of the other options pertain only to certain enrichment algorithms, and these have reasonable default values. Please refer to the table below for detailed descriptions of these options.
In summary, a typical invocation of any enrichment algorithm will look like the following command (here, a set of flags inside square brackets indicates that the user should provide only one of the flags on the command line):
enrich <algorithm> [NSR] <universe> [nsrd] <cluster> a <annotations> o <output>
The detailed list of options supported by enrich
are given in the following table.
Flag  Long  Description  

ENRICHMENT ALGORITHMS    funcassociate  Run the FuncAssociate algorithm. 
  gengo  Run the GenGO algorithm. The user should provide the 'gengop' and 'gengoq' options along with this option since we have not implemented the extension to GenGO that can learn these parameters. By default, p and q take the values suggested in the GenGO publication.  
  hypergeometric  Perform the hypergeometric enrichment, i.e., Fisher's exact, test.  
  page  Run the Parametric Analysis of Gene Set Enrichment (PAGE) algorithm Note that no cluster is provided with this algorithm, so you must use the clusterdummy option.  
  randomfunction  Perform networkbased enrichment, using the random function algorithm.
 
  randomuniverse  Perform networkbased enrichment, using the random universe algorithm.
 
UNIVERSE  N  universenetwork  The name of the tabdelimited file containing the universal molecular interaction network. Each line represents a single edge in the network and may have 24 columns. The first two columns of each line must contain the identifiers of a pair of nodes connected by an edge. An optional third column can specify a type for the edge. An optional fourth column contains the weight of the edge. 
S  universeset  The name of the file containing the universal set of genes. Each line contains one gene name.  
R  universerankedlist  The name of a file containing a ranked universe of genes. The first line should contain header items; each subsequent line contains two tabdelimited items: identifier, value.  
CLUSTER  n  clusternetwork  The name of the tabdelimited file containing the cluster as an interaction network. Each line represents a single edge in the network and may have 24 columns. The first two columns of each line must contain the identifiers of a pair of nodes connected by an edge. An optional third column can specify a type for the edge. An optional fourth column contains the weight of the edge. 
s  clusterset  The name of the file containing the cluster set of genes. Each line contains one identifier.  
r  clusterrankedlist  The name of a file containing a ranked cluster of genes. The first line should contain header items; each subsequent line contains two tabdelimited items: identifier, value.  
d  clusterdummy  Use this option with no argument for algorithms that do not require a cluster (e.g., PAGE).  
ENRICHMENT OPTIONS  o  output  The name of the file in which to write the results. 
  randseed  An integer used to seed the C++ rand() function.  
  annotationscategory  The category of annotations on which to perform functional enrichment (e.g., 'p', 'f', or 'c' for biological process, molecular function, or cellular component when working with GO annotations). When left blank, all categories will be used.  
a  annotationsfile  Name of file containing a list of genes and their functional annotations or other attributes. The first line of the file contains the column names. Each tabdelimited line contains the gene id in column 'orf', the function id in column 'goid', the function category in column 'hierarchy', the evidence code in column 'evidencecode', and the annotation status 1, 0, or 1, in column 'annotation type'. If these are GO annotations, the program will transfer functional annotations up the GO DAG in accordance with the true path rule.  
  functionsrestrictionfile  Name of a file containing a list of gene functions. Enrichment will be restricted to only functions listed in this file that also appear in the annotations given with the a option.  
  gengop  The probability value p for the GenGO algorithm.  
  gengoq  The probability value q for the GenGO algorithm.  
  gengopenalty  The penalty value for the inclusion of a function in the GenGO algorithm.  
  numedgeswapiterations  For enrichment algorithms that require random graphs using the edgeswap graph randomization technique, this number specifies how many iterations to run the graph randomization algorithm. Note that the given integer k specifies that k*E iterations will be run.  
  numrandomgraphs  For enrichment algorithms that require graph randomization, this number specifies how many random graphs to generate.  
g  goobofile  The name of a file containing the definition of the Gene Ontology in OBO format.  
u  universetype  Describes how the universe of genes should be determined for enrichment methods. 'all' denotes that all genes in the given universe network (or gene set) will define the true universe. 'intersection' denotes that the intersection of annotated genes and genes in the universe network (or gene set) will define the true universe. 
A standard formulation of the problem of functional enrichment is the following. We are given:
We are interested in assessing the functional coherence of based on the annotation of its genes, given by . Fisher's exact test addresses this problem for each function by answering the following question: "If we select a set of genes uniformly at random (without replacement) from the set of all genes , what is the probability that the will contain or more genes?", where . It is wellknown that this probability can be determined by computing the tail of the hypergeometric distribution as follows:
enrich can be used to calculate this score for each function using the following command:
enrich hypergeometric S <universeset> s <clusterset> a <annotations> o hypresults.tsv
The resulting output file will contain one row for each function tested for enrichment, and several columns of information are printed for each row:
Column Header  Description 

algorithm  The enrichment algorithm used; in this case, 'hypergeometric' 
func_id  An identifier for the function provided in the annotations file 
func_category  The category of the function 
universe_size  The number of genes in the universal set of genes 
cluster_size  The number of genes in the interesting collection of genes 
V_f  The number of genes in the universe that are annotated with the function 
C_f  The number of genes in the cluster that are annotated with the function 
hyp_pvalue  The hypergeometric pvalue calculated using the four values from the previous four columns 
We developed two networkbased functional enrichment algorithms that take into account an underlying network of interactions among the genes being analyzed. These two methods can be accessed through the enrich tool using the randomfunction
or randomuniverse
options as the algorithm when invoking enrich. Since the two NBE algorithms operate on networks, they require that the universe and cluster are provided using the N
and n
options, respectively. Details and applications of the Random Function and Random Universe NBE methods to biological datasets are available in the publication. We briefly describe the differences between these algorithms below.
The random function NBE approach for computing the statistical significance of each function strongly parallels the samplingbased alternative to the analytical solution for the onesided version of Fisher's Exact Test. In the samplingbased solution for Fisher's exact test we repeatedly select a set of size uniformly at random from the universe of genes . We then calculate an empirical pvalue for the function , which represents the statistical siginificance of the size of , as the fraction of samples for which the size of the intersection between the randomly selected set and the interesting collection of genes is at least as large as . If we repeat this process many times, the empirical value converges to the analytical solution ( ). For the random function NBE method, rather than analyzing the statistical significance of the size of , we assess the statstical significance of the connectivity among the genes in .
N
option). We repeatedly generate a randomized universal network, , with the same degree sequence as , i.e., the degree of each node is the same in and in each randomized network . The randomization procedure preserves topological properties of molecular interaction networks, such as their scalefree degree distribution. The numedgeswapiterations
controls how long the underlying graph randomization algorithm should run before using the randomized network. We assess the significance of each function based on the fraction of random networks ( ) for which the connectivity among the genes in improved, compared to their original connectivity in the universal network ( ). Both algorithms utilize samplingbased procedures to estimate a pvalue representing the statistical significance of each function. Thus, these algorithms may be quite timeconsuming depending on the number of functions tested and the number of sampling iterations used. The number of sampling iterations can be controlled for either method using the numrandomgraphs
option.
The following sample command demonstrates how to execute the random universe NBE algorithm for 1000 random networks (i.e., minimum pvalue of 0.001):
enrich randomuniverse N <universalnetwork> n <clusternetwork> a <annotations> numrandomgraphs 1000 o nberesults.tsv
The resulting output file will contain one row for each function tested for enrichment, and several columns of information in each row. The same columns as those reported for the Fisher's Exact Test are included in the NBE output files. Additionally, several network statistics about are provided. Perhaps the most relevent of the network statistics columns is the one with the header "lex_components_by_nodes_pvalue". This column contains the pvalues that are discussed in the publication of these NBE methods.