Biorithm: Bicluster %Miner

Introduction

This program mines biclusters in a given binary matrix. This algorithm is similar to the well-known Apriori algorithm, except that our implementation incrementally adds rows rather than columns) and whenever it finds a new bicluster it immediately adds all possible rows that make the bicluster closed. The software outputs both the rows and columns of the biclusters it finds. If you are interested in computing biclusters in real-valued matrices, you may find the xMotif package useful.

Installing the Bicluster Miner

Download the Biorithm package and follow the installation instructions for Biorithm. The executable file will be available as apriori/apriori.

Invoking the Bicluster Miner

To run the bicluster miner, you will need a single input file containing a binary matrix. This matrix should contain a header, naming the columns in the file. Each line of the file must be tab-delimited and contain only 0s or 1s. Please see the table below for the precise format of the input file. You can generate a sample binary matrix file using the genmatrix script in the biorithm-1.0/truth-tables/synth directory. For example, to generate a 100*10 binary matrix of sparsity 0.5 with this script and store it in a file called synth.mat, run the command

 genmatrix 10 100 0.5 > synth.mat

If you have a binary matrix in a file called data.mat, to mine biclusters with at least two rows and at least three columns, run the following commands in a shell (recall that the executable is in the directory biorithm-1.0/apriori): :

 apriori -R 2 -C 3 data.mat -o biclusters.dat

The output file biclusters.dat will contain all the mined biclusters. Please see the table below for the precise format of the output file.

Command Line Options

The following options are available for apriori:

Flag	Long	Description
C	min-number-columns	The minimum number of columns that should be in an itemset.
i	input-file	Name of file containing the binary data. This file is a tab-delimited text file. The first row of the file contains the column names. Every other row contains a row identifier followed by the binary vector for that row. If you specify this option multiple times, you must ensure that the order of columns matches across the different files. If you specify this option multiple times, apriori will only compute closed biclusters whose rows straddle multiple datasets (files). If you do not want only such biclusters, simply concatenate all the files into one file and pass this merged file as an argument to this option.
R	min-number-rows	The minimum number of rows that should be in an itemset.
o	output-file	Name of file to print itemsets to. The format is one itemset per line containing the name of the itemset, the list of rows, and the list of columns, all separated by tabs. The format of the itemset name is itemset_<index>_<number of rows>_<number of columns>
-	reduce-memory-usage	Reduce the RAM used by the Apriori algorithm, at the cost of taking more time. This option may become the default in a future version. This option is lightly tested!