Biorithm
1.1
|
xMotif is a software package for computing biclusters in (real-valued) gene expression data. Optionally, given the class that each sample in the data belongs to, xMotif can construct classifiers to predict the class of a new sample. Each bicluster computed by xMotif consists of a subset of samples and a subset of genes. Each gene in the bicluster is expressed to (approximately) the same extent in all the samples in the bicluster. Different genes in a bicluster may be expressed at different levels. Thus, a bicluster may contain both up-regulated and down-regulated genes. The name xMotif comes from the fact that we think of such as biclusters as conserved gene eXpression MOTIFs.
When the samples in the gene expression data belong to different classes, xMotif attempts to compute homogeneous biclusters, i.e., those that contain samples from only one class. For such data, xMotif has the ability to construct nearest neighbour classifiers to predict which class a new sample belongs to. Biclusters-based classifiers are highly interpretable, since the genes involved in a bicluster can be simply read from the description of the bicluster.
The xMotif software was designed and implemented by Gregory Grothaus and T. M. Murali, based on earlier work by Murali and Kasif [8] . Grothaus's M.S. thesis introduces and discusses xMotif-based classifiers.
Download the Biorithm package and follow the installation instructions for Biorithm. The xmotif
executable will be available in the xmotif
subdirectory of Biorithm.
To run xMotif, you will need at least the following input files:
A file containing expression values of genes. The first row of the file contains the column names. Every other row contains a gene identifier followed by the expression values for that gene. The empty string denotes a missing gene expression value. This file should be tab-delimited. This file is required to run xMotif. A sample gene-expression file is attached here.
There are four other important choices to make when running xMotif: selecting a discriminant, computing per gene p-values, deciding which genes to include in the bicluster, and expanding the set of samples included in the bicluster. These choices are controlled by the -d
, -P
, -G
, and -s
command-line options, respectively. Therefore, the generic command to execute xMotif is the following:
xmotif -g <gene-expression-file> -c <class-file> -o <output-file> -d <discriminant-method> -P <p-value-method> -G <gene-selection-method> -s <expansion-method>
To understand the role played by these options, it is useful to discuss how the software detects biclusters. First, the software selects a random number of samples (the discriminant). Second, for every gene, the software computes a p-value that estimates how close the expression values of that gene are in the samples in the discriminant. Third, the software uses these p-values to determine which genes to include in the bicluster. Finally, the software adds more samples to the bicluster as long as they do not violate the constraints imposed by the expression ranges across the genes selected to be in the bicluster.
There are three methods for choosing a random discriminant to seed a bicluster.
Choose a discriminant at random from all samples: This method selects a discriminant uniformly at random from all the samples in the data. Since this option allows discriminants to span multiple classes, the resulting xMotifs may not be homogeneous, i.e., class-specific.
Choose a discriminant at random from a random class: This method selects a random discriminant from all the experiments in a single class. This choice enforces the property that samples in computed xMotifs are from a single class. The class selected is itself chosen uniformly at random from all the classes in the data.
The p-value calculation basically asks the question: given the discriminant and a gene g, what is the probability of finding a gene whose expression values within the sample in the discriminant are as close to each other as those of gene g? Lower probabilities are better in thes sense that they indicate the gene's expression values are very close to each other and unlikely to be seen at random. The software applies a Bonferroni correction to each of these p-values. The user has two options:
ttest: Perform a t-test to determine if the distribution of the expression values of given gene in the discriminant is distinct from the distribution in all the other samples.
rank: This approach computes the interval of the gene's expression values in samples in the discriminant. Note that samples not in the discriminant may have expression values that fall within this interval. Suppose the discriminant has k samples and the total number of samples within the interval is a. The method computes the probability that if we selected k samples uniformly at random from all samples and computed the interval spanned by the corresponding expression values, what is the probability that the interval would contain almost a values. Please see Chapter 3.3 of Grothaus's M.S. thesis for details.
This part of the xMotif algorithm decides which genes should be included in the bicluster. There are two options.
max: Include all genes whose p-values are less than the threshold specified by the p-value-threshold
argument.
Stouffers or stouffers: Include precisely those set of genes that maximise the Stouffer's Z score obtained by combining their p-values.
The purpose of this part of the xMotif algorithm is to add samples outside the discriminant that may legitimately belong to the bicluster. The user has six choices.
Expand samples within the class by matching gene ranges: is a simple algorithm which adds a sample to the xMotif if it fits within the range of gene expression values for each gene in the xMotif.
Expand samples by matching gene ranges: This approach is identical to the previous one, except that it does not require that experiments be of the same class.
Expand/reduce samples within class by gradient descent: This approach is a more complex and slower one that expands samples by recomputing the xMotif for each one-sample change from the current discriminant. A one-sample change is the removal or addition of one sample. After each modification, the algorithm takes the discriminant with the lowest p-value. Eventually this sequence of steps will converge to a local minimum.
Expand/reduce samples gradient descent: This algorithm is the same as the previous one, except that it does not require that samples be of the same class.
Expand samples within class by gradient descent: This algorithm is the same as algorithm #4, with the addition that experiments in the original discriminant cannot be removed by the algorithm.
Expand samples by gradient descent: This algorithm is the same as algorithm #5, except that it does not require that samples be of the same class.
xMotif has several command-line options, which are listed below.
Flag | Long | Description |
---|---|---|
a | annotations-file | Path of a file containing a list of genes and their functional annotations or other attributes. |
B | obo-file | The path of a file containing the definition of the Gene ontology in OBO format. The standard location for this file is http://www.geneontology.org/ontology/gene_ontology.obo |
D | enrichment-directory | The name of a directory to print web pages detailing the functional enrichment information in xMotifs. |
c | class-file | The name of the file containing class names for samples in gene expression data. The file is tab-delimited. Each line contains the name of a sample and the name of a class the sample belongs to |
C | build-classifier | Construct the classifier. Use this option when you want to construct the classifier without performing any self- or cross-validation. |
d | discriminant-selection | Specify how to select discriminants from the set of samples. There are four possibilities: (i) all: choose a discriminant at random from all samples. (ii) class: choose a discriminant at random from a random class of samples. (iii) class-weighted: like 'class' but but down-weights samples based on how many xMotifs they have already appeared in. See details in the Selecting Discriminants section. |
A | gene-alias-file | Path of a file containing aliases for the probes/genes in the gene expression file. Each line contains one probe/gene name and one alias. This file is useful if the annotations file does not use the same nomenclature for genes as the gene expression file. |
I | gene-info-file | Path of a file containing information on each row/probe/gene in the gene expression file. Each line contains one probe/gene name, a name (e.g., the gene symbol) to use for display purposes, and a description (e.g., the annotation) for the gene. This file is useful for converting probe ids to more common gene symbols in the HTML pages generatef by xMotif. |
g | gene-expression-file | Path of a file containing the gene expression data. This file is a tab-delimited text file. The first row of the file contains the column names. Every other row contains a gene identifier followed by the expression values for that gene. The empty string denotes a missing gene expression value. |
G | gene-selection | Specify how to include genes in an xMotif. There are two possibilities: (i) max: include all gene's whose p-values are less than the p-value-threshold argument. (ii) Stouffer: include the set of genes that maximise the Stouffer's Z score obtained by combining their p-values. |
u | gene-url | A generic URL for a row/probe/gene, e.g. 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Gene&term='. Appending the gene symbol should result in a correct URL for the node. If you use the --gene-info-file, xMotif appends the gene symbol from this file. Otherwise, it appends the row/probe id. There is no support currently for different URLs for different subsets of nodes. It is best to quote the URL when passing it on the command line. |
l | loocv-file | Perform leave-one-out cross validation and print results to this file. |
n | number-iterations | The number of iterations to perform during classifier training. Each iteration will compute at most one xMotif. |
o | output-file | Path of a file to print xMotifs to. |
p | p-value-threshold | The maximum p-value at which to consider a gene to be co-expressed in a set of samples. |
s | sample-expansion | Specify how to expand an xMotif to include other samples. Your options are (i) gene-range: adds a sample to the xMotif if it fits within the range of genes expression values for each gene selected to match the discriminant. (ii) gene-range-class: same as 'gene-range' but only adds samples within the same class. (iii) search: perform a gradient descent search by adding/deleting samples till the score of the xMotif reaches a local optimum. (iv) search-class: like 'search' except it only adds or deletes samples in the current class (v) search-only-add: like 'search' but it only adds samples to the current xMotif. (vi) search-only-add-class: like 'search-add-class' but it only adds samples in the current class. (vii) none: do no expansion. See details in the Expanding Samples section. |
S | selfv-file | Perform self validation (for each training sample, check if the classifier correctly predicts the sample's class) and print results to this file. |
x | xmotif-file | Path of a file containing xMotifs computed in an earlier run. If you provide this file, xmotif will ignore the --gene-expression-file argument and use these xMotifs to build classifiers, perform cross validation, compute functional enrichments, etc. You will still have to provide the dataset using the --gene-expression-file option and the class file using the --class-file option to construct the classifier. |
random-seed | A seed for the random number generator. Use this option to ensure repetition of results in multiple runs. | |
P | p-value-method | Method of calculating p value. See details in the Computing the p-value for a Gene section. |
After running the xMotif software, the user obtains an output file containing all the motifs found. The location of the file is determined from the value of -o(--output-file)
option. Each line of this file contains information about each xMotif computed. Each line of this file is tab-delimited and has the following strings: a unique identifier of the xMotif, a string that can be ignored (usually "black"), the list of genes in the xMotif, and the list of samples in the xMotif. The identifier of the xMotif takes the format itemset<motif number>_<number of="" columns="" in="" this="" motif>="">_<number of rows in this motif>
.