Concise Functional Enrichment of Ranked Gene Lists

Introduction

This program implements the functional enrichment method CRFE, which provides a high-level interpretation of high-throughput gene expression data. For this type of analysis, the user possesses a ranked list of genes, ranked f.e. by differential expression between treatment and control samples. The genes above a certain threshold are considered differentially expressed or perturbed, all other genes form the background or the unperturbed genes. Given categories that describe the biological functions of different sets of genes (e.g., biological process terms and annotations from the Gene Ontology), we can ask the question which functional categories best explain the observed gene list. A good collection of functional categories explains as many as possible perturbed genes, and as few as possible unperturbed genes. In particular, particularly many highly perturbed genes should be explained and the collection should consist of only a small number of specific categories. CRFE returns a list of functional categories that satisfies all these requirements. Investigating this list is a good first step to interpreting the results of any high throughput gene expression experiment.

Download / Installation

A working Python distribution and the program CRFE.py are required to use CRFE. The program has been developed and tested extensively under Python 2.7 and Python 3.4 but should work for Python 2.6+. It requires the packages, sys, math, random, os, datetime, csv, cPickle (Python 2.x) / pickle (Python 3), optparse. The correct set-up can be tested by executing the example below and comparing the output to the provided output.

An additional program ontology_parser.py can be used to easily create the annotations file needed as an input for CRFE. This program transforms ontology and association files - directly downloaded from the Gene Ontology - into an annotation file, in which each line contains one functional category and all the genes (as Gene Symbol) annotated to this category.

Using CRFE

CRFE accepts different types of command line options. See the table below for details about all available command line options.

A typical invocation of CRFE will look like:

 python crfe.py -a <annotations file> -g <gene file> -k <number of categories> -b <belief parameter>

The following table contains a detailed list of options supported by CRFE.

	Flag	Long	Allowed Values	Default	Description
DATA	a	annotations_file	.txt or .csv		The name of the file that contains functional categories and the repsective gene annotations, txt (preferred) or csv. Each line contains the name of one functional category, plus all genes that are annotated to this category. See sample annotation file.
	g	gene_file	.txt or .csv		The name of the file that contains the ranked list of genes to be analyzed, txt (preferred) or csv. Each line contains one gene, plus one expression level (tab-separated). The expression level is optional. See sample gene file.
	G	gene_file_use	$\{0,1,2\}$	0	Describes how the gene file should be used, 0: use as is, 1: invert the gene list, 2: order by absolute value of expression level.
	t	threshold	(0,1) if threshold_type = 'proportion'	$0.3$	If threshold_type='proportion', threshold describes what proportion of total genes is considered perturbed. This option should be chosen if the user does not know which exact threshold to use. If threshold_type='value', all genes with a expression level above threshold are considered perturbed. This option should be chosen if the user wants to use a particular threshold, f.e. greater than 2-fold-change
	T	threshold_type	'proportion' or 'value'	'proportion'
	c	lower_cutoff	$\{1,2,\ldots,\infty\}$	$20$	Only gene sets that explain at least this many genes in the gene file are considered. Choose 1 if no lower bound is desired.
	C	upper_cutoff	$\{0,1,\ldots,\infty\}$	$200$	Only gene sets that explain at most this many genes in the gene file are considered. Choose 0 if no upper bound is desired.
DATA PARAMETERS	k	nr_categories	$\{0,1,\ldots,\infty\}$	$0$	The number of categories in which the list of perturbed genes is divided. Choose k=0 to put each perturbed gene into its own category.
DATA PARAMETERS	b	belief	$[1,\infty)$	$5$	Belief parameter that allows the user to tune the focus of the algorithm on highly perturbed genes. If $b=1$, all perturbed genes are treated the same. The larger the belief parameter, the more the algorithm focuses on explaining highly perturbed genes.
ALGORITHM PARAMETERS	n	repeats	$\{1,2,\ldots,\infty\}$	$1$	Number of times the MCMC process is repeated.
	s	burnin	$\{0,1,\ldots,\infty\}$	$10^5$	Number of MCMC steps that are performed before results are recorded. Allows the Markov chain to settle.
	S	MCMC_steps	$\{1,2,\ldots,\infty\}$	$10^6$	Number of MCMC steps that are recorded after an initial burnin period.
	x	alpha_beta_max	$(0,0.5]$	$0.5$	The maximal value that can be learned for the false positive rate, alpha, and the false negative rate, beta. Warning: This maximal value is also automatically restricted by the algorithm. Therefore, this option only has an effect if the chosen value is smaller than the automatic restriction.
	p	probability_parameter_ change	$(0,1)$	$0.2$	The proportion of MCMC steps where the parameters are changed instead of the set of gene sets
	r	random_initial_set	$[0,1) \cup \{1,2,\ldots,\infty\}$	$0$	Determines the initial set for the MCMC algorithm. If $r=0$, the initial set is empty. If $r\in (0,1)$, each process is in the initial set with probability r. If $r\geq 1$, a sample of r processes is randomly chosen as initial set.
ADDITIONAL PARAMETERS	o	output_folder	local folder	'output/'	All results are written to this local folder. If the folder does not exist, it is created.
	i	identifier	string	'' [empty]	A string that is attached to all files created by the algorithm. If running multiple data sets, use this option to distinguish the created files.
	v	verbose			Use -v or --verbose to print some output during the algorithm, mainly to allow estimatation of run time.
	R	seed	$\{-\infty,\ldots,\infty\}$	$-1$	If $R\neq -1$, this sets the seed for the random number generator. If $R=-1$, a random seed is used.

Example

Sample Annotation File

The annotation file contains one row for each functional category. Each row starts with the name of the functional category, followed by a tab and a list of all genes that are annotated to this functional category (space-separated).
Sample: sample_annotation_file.txt.
Sample: annotation_file_human_biological_process.txt (This annotation file has been used to create the results in the CRFE paper.)

The program ontology_parser.py can be used to easily create this annotation file. This file transforms ontology and association files - directly downloaded from the Gene Ontology - into an annotation file (Gene Symbols), which can then be used by CRFE. The two downloaded files need to be saved as txt files (f.e., by overwriting the extension to txt, or by opening in Excel and saving as tab-delimited txt file). A typical invocation of the Ontology Parser looks as follows

 python ontology_parser.py -o <ontology file> -a <association file> -n <namespace> -i <identifier> -save

The following table contains a detailed list of options supported by the Ontology Parser.

	Flag	Long	Allowed Values	Description
INPUTS	o	ontology_file	.txt	See sample ontology file.
	a	association_file	.txt	See sample association file.
	n	namespace	string	Describes the namespace that should be used: P for biological processes (default), F for molecular function, or C for cellular component
	c	combine		By default, functional categories that annotate exactly the same genes are combined into one. Use -c or --combine to keep them separate. Warning: Not combining equal functional categories into one can strongly deteriorate the performance of CRFE and any set-based enrichment method.
ADDITIONAL PARAMETERS	i	identifier	string	A string that is attached to all files created by the Ontology Parser. If running multiple data sets, use this option to distinguish the created files.
	v	verbose		Use -v or --verbose to suppress the output of some descriptive text while Ontology Parser runs.
	s	save		Use -s or --save to create three additional output files, which can be directly used by CRFE. The files are saved into the local folder 'saved_data/', which is created if it does not exist.

Sample Gene File

The gene file contains one row for each gene. Each row starts with the name of the gene, followed by a tab and the expression level of the gene. The expression level is optional. If no expression level is given, the given threshold should be a proportion.
Sample: sample_gene_file.txt.

Testing correct set-up

After saving the sample annotation file and the sample gene file into a local folder, the correct-up can be tested by running CRFE on this example,

 python crfe.py -asample_annotation_file.txt -gsample_gene_file.txt -c4 -C0 -t0.4 -n5 -s1000 -S10000 -iSAMPLE -R0

In this invocation, we consider only categories that annotate at least four genes, consider 40% of all genes as perturbed, run the MCMC process 5 times with 1000 unrecorded steps, before we record the next 10000 steps. The two output files are written to the local folder 'output/' (default) and include 'SAMPLE' in their file names. For reproducibility, we initiate the seed of the random number generator as 0.

This invocation produces two output files, which should look like this:

The first file provides information about which functional categories explain the data. The second file provides information about the parameter learning process.

Contact

Please contact Claus Kadelka, Madison Brandon or T. M. Murali with any questions about this program.