One of the most effective ways to learn what MCMR can do is to recreate the figures from the paper. Unfortunately, MCMR can be quite a chore to set up because it is put together from many other software projects. The purpose of this page is to ease the process. In particular, it should be possible to recreate any subset of the analysis pipeline because intermediate files from each stage are provided, allowing you to skip ahead to the end of that stage. Alternatively, you can try and do everything from the start, and use the intermediate files as checks.
In addition to the various motif finders, each of which has different availability/licensing, there are all of the dependencies that the motif finders have, and the dependencies of MCMR Itself. I have put together a separate guide to deal with these issues, accessible as part of the general documentation for the MCMR package.
In general this involves reading each paper, making decisions as to what one wants to consider regulatory sequence, and the haphazard task of determining what exact sequence was used in various steps of the paper, and obtaining the sequence from the genome project. To skip this step, just download the "seqtar" files from the Data section of Explore Datasets page. The descriptions of the datasets in following sections recaptitulates the descriptions from the paper.
100 bp windows centered upon the known motif instances described by Wenick and Hobert 2004 were taken as regulatory regions for the AIY dataset analysis. A complication arises on chromosome III, where two motifs occur within 100 bp of each other (from 7833700..7833715 and 7833742..7833757 in WS176 release). This was resolved by taking a 100 bp window around the point halfway between these instances.
Zhang et al 2002 employed a microarray profiling strategy We took the upstream 2kb non-repeat, non-exon sequence from the top 20 genes and their orthologs as candidate regulatory sequence.
Table 1 of GuhaThakurta et al. 2002 lists 28 genes, of which two (F30F8.4 and M01B12.1) are not annotated in the WS160 release of the C. elegans genome, and two (F44E5.4 and F44E5.5) share an upstream region. The remaining 25 non-redundant upstream 2 kb regions, truncated at the upstream gene, were used as inputs for further analysis.
The complete parameters for every run of every motif finder is contained in the "run protocol" files on the Explore Datasets page. To re-run the motif finders, you must download this file, place it in the same directory as the seqtar, and then run the MCMR package with the run protocol and fasta file as inputs, as follows:
generation/motif_master.pl dataset.protocol
Where dataset reflects that particular dataset that you are running on. This process should create a set of .tar.gz files in the output directory (configured in perllib/GenerationConfig.pm). The "runtar" files on the download page are tarballs of these tar.gz files. Each of these .tar.gz files contains the raw output of a particular motif finder run in the native format of the motif finder. In preparation for motif reduction, these files must be converted to a standard format, which is the role of the EMF file format. To convert the runtar to an EMF file, you must run
generation/motif_parser.pl -r dataset.runtar dataset.protocol > dataset.emfThe end result of this will be an EMF file, containing all of the matches for all of the motifs that were found by all the motif finders. You can skip this step by downloading the EMF files from the Explore Datasets page.
The next step is reducing the motifs to find a smaller set of representative motifs, with associated redundancy information. The results of this will be stored in a single compact sqlite database. This is performed by the following command:
reduction/reduce.py -g dataset.gff -s dataset.seqtar dataset.emfImages can be generated using scripts in the "export" directory of the MCMR package. I will describe how to create the output for the HSP dataset, as the other datasets are handled wth similar commands.
To generate the gbrowse annotation files for the core motifs from the HSP dataset, type
export/gba.py -r ca -f hsp_guhaExons_sqlite.WS176.db
Documentation for creating other gbrowse annotation files can be found by passing the "-h" flag. To
The following command will create input file for cytoscape:
export/cytoscape.py hsp_guhaExons_sqlite.WS176.db
In particualr, it will create the following files:
hsp_guhaExons_sqlite.WS176.sif | Network Topology File |
hsp_guhaExons_sqlite.WS176_allr.eda | Edge file for ALLR scores |
hsp_guhaExons_sqlite.WS176_clique.noa | Node file for Clique Identity |
hsp_guhaExons_sqlite.WS176_consensus.noa | Node file for Consensus Sequence |
hsp_guhaExons_sqlite.WS176_pover.eda | Edge file for Percentage Overlap |
hsp_guhaExons_sqlite.WS176_program.noa | Node file for source program |
These files must be loaded individually into a new cytoscape session, which can then be saved to produce a single, portable Cytoscape session file.
MCMR includes a script, misc/FindRegionInGenome.pl, that allows you to map regions to positions if there is a new release of the genome. It can be run as follows:
misc/FindRegionsInGenome.pl -g <genome fasta> -r <region fasta>