Recreating Results from the Paper

Overview

One of the most effective ways to learn what MCMR can do is to recreate the figures from the paper. Unfortunately, MCMR can be quite a chore to set up because it is put together from many other software projects. The purpose of this page is to ease the process. In particular, it should be possible to recreate any subset of the analysis pipeline because intermediate files from each stage are provided, allowing you to skip ahead to the end of that stage. Alternatively, you can try and do everything from the start, and use the intermediate files as checks.

Installation Issues

In addition to the various motif finders, each of which has different availability/licensing, there are all of the dependencies that the motif finders have, and the dependencies of MCMR Itself. I have put together a separate guide to deal with these issues, accessible as part of the general documentation for the MCMR package.

Getting from published papers to regulatory regions

In general this involves reading each paper, making decisions as to what one wants to consider regulatory sequence, and the haphazard task of determining what exact sequence was used in various steps of the paper, and obtaining the sequence from the genome project. To skip this step, just download the "seqtar" files from the Data section of Explore Datasets page. The descriptions of the datasets in following sections recaptitulates the descriptions from the paper. 

AIY

100 bp windows centered upon the known motif instances described by Wenick and Hobert 2004 were taken as regulatory regions for the AIY dataset analysis. A complication arises on chromosome III, where two motifs occur within 100 bp of each other (from 7833700..7833715 and 7833742..7833757 in WS176 release). This was resolved by taking a 100 bp window around the point halfway between these instances.

MEC

Zhang et al 2002 employed a microarray profiling strategy We took the upstream 2kb non-repeat, non-exon sequence from the top 20 genes and their orthologs as candidate regulatory sequence.

HSP

Table 1 of GuhaThakurta et al. 2002 lists 28 genes, of which two (F30F8.4 and M01B12.1) are not annotated in the WS160 release of the C. elegans genome, and two (F44E5.4 and F44E5.5) share an upstream region. The remaining 25 non-redundant upstream 2 kb regions, truncated at the upstream gene, were used as inputs for further analysis.

Generating motifs from regulatory regions

The complete parameters for every run of every motif finder is contained in the "run protocol" files on the Explore Datasets page. To re-run the motif finders, you must download this file, place it in the same directory as the seqtar, and then run the MCMR package with the run protocol and fasta file as inputs, as follows: 

generation/motif_master.pl dataset.protocol

Where dataset reflects that particular dataset that you are running on. This process should create a set of .tar.gz files in the output directory (configured in perllib/GenerationConfig.pm). The "runtar" files on the download page are tarballs of these tar.gz files. Each of these .tar.gz files contains the raw output of a particular motif finder run in the native format of the motif finder. In preparation for motif reduction, these files must be converted to a standard format, which is the role of the EMF file format. To convert the runtar to an EMF file, you must run

 generation/motif_parser.pl -r dataset.runtar dataset.protocol > dataset.emf

The end result of this will be an EMF file, containing all of the matches for all of the motifs that were found by all the motif finders. You can skip this step by downloading the EMF files from the Explore Datasets page.

Maximal Clique Motif Reduction

The next step is reducing the motifs to find a smaller set of representative motifs, with associated redundancy information. The results of this will be stored in a single compact sqlite database. This is performed by the following command:

reduction/reduce.py -g dataset.gff -s dataset.seqtar dataset.emf

This will produce a file named dataset_sqlite.db that contains all infomation about the runs necessary for export to various displays.

Generating Figures from MCMR Sqlite Database

Images can be generated using scripts in the "export" directory of the MCMR package. I will describe how to create the output for the HSP dataset, as the other datasets are handled wth similar commands.

Motif Summary Pages

To generate a static web page with motif logos and associated information for the HSP dataset, type

export/ResultPage.py hsp_guhaExons_sqlite.WS176.db

Gbrowse annotation Files

To generate the gbrowse annotation files for the core motifs from the HSP dataset, type 

export/gba.py -r ca -f hsp_guhaExons_sqlite.WS176.db

Documentation for creating other gbrowse annotation files can be found by passing the "-h" flag. To

Cytoscape Files

The following command will create input file for cytoscape:

export/cytoscape.py hsp_guhaExons_sqlite.WS176.db

In particualr, it will create the following files:

hsp_guhaExons_sqlite.WS176.sif Network Topology File
hsp_guhaExons_sqlite.WS176_allr.eda Edge file for ALLR scores
hsp_guhaExons_sqlite.WS176_clique.noa Node file for Clique Identity
hsp_guhaExons_sqlite.WS176_consensus.noa Node file for Consensus Sequence
hsp_guhaExons_sqlite.WS176_pover.eda Edge file for Percentage Overlap
hsp_guhaExons_sqlite.WS176_program.noa Node file for source program

These files must be loaded individually into a new cytoscape session, which can then be saved to produce a single, portable Cytoscape session file.

Other Useful Things

MCMR includes a script, misc/FindRegionInGenome.pl, that allows you to map regions to positions if there is a new release of the genome. It can be run as follows:

misc/FindRegionsInGenome.pl -g <genome fasta> -r <region fasta> 

and will produce a GFF file describing the positions of the regions within the genome that can be used in the reduction step described above, or can be used to update an existing sqlite.db file as follows:

reduction/addRegionGFF.py -g <region_gff> <sqlite file>