The latest verson of CEGMA contains several bug fixes; please see the download section for more details.
We have created a mailing list which we will use to send news about updates and for any general discussion of CEGMA. If you are interested in subscribing, please follow these instructions (subscriptions are moderated):
To unsubscribe:
Please email us with any other bugs or suggestions that you have.
In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of 458 core proteins that are present in a wide range of taxa. Since these proteins are highly conserved, sequence alignment methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completness of the genome or annotations.
The CEGMA core genes dataset was built using the NCBI euKaryotic clusters of Orthologous Groups (KOGs) database. More specifically, the starting point was using the subset of 928 KOGs that are conserved between Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. These were then filtered by various criteria to arrive at the final set of 458 Core Eukaryotic Genes (CEGs), see paper for details.
All links in following tables are links to underlying datafiles. The name of each link is the file size in Kilo- or Megabytes.
| Proteins | Alignment | Profiles | ||||||
|---|---|---|---|---|---|---|---|---|
| (fasta) | (clustal) | (hmmer) | ||||||
| Core eukaryotic genes (458) | 1M | 2M | 19M | |||||
The following files correspond to the genomic data of the proteins selected from the KOGs database. These correspond to the annotations (from 2007) of the core genes in these six genomes. A single tarball containing all of this data can also be downloaded
| Genomic | Coordinates | Transcript | Proteins | |||||
|---|---|---|---|---|---|---|---|---|
| (fasta) | (gff) | (fasta) | (fasta) | |||||
| H. sapiens | 13M | 173K | 559K | 193K | ||||
| D. melanogaster | 1M | 62K | 563K | 194K | ||||
| C. elegans | 2M | 85K | 557K | 192K | ||||
| A. thaliana | 1M | 196K | 568K | 196K | ||||
| S. cerevisiae | 1M | 25K | 569K | 196K | ||||
| S.pombe | 1M | 47K | 557K | 193K | ||||
Genes were selected to make sure that the encoded proteins were longer than 100 amino acids, contained no stop codons in frame, and had less than 50% low complexity sequence (as determined by the seg program distributed with WU-BLAST run with default parameters). The genes are required to have introns in the range of 40 bp to 10 Kbp and to use canonical splice sites. Each set contains 500 random genes. A single tarball containing all of this data can also be downloaded
| Genomic | Coordinates | Transcript | Proteins | |||||
|---|---|---|---|---|---|---|---|---|
| (fasta) | (gff) | (fasta) | (fasta) | |||||
| H. sapiens | 26M | 128K | 781K | 265K | ||||
| D. melanogaster | 3M | 57K | 796K | 270K | ||||
| C. elegans | 2M | 72K | 547K | 195K | ||||
| A. thaliana | 1M | 78K | 578K | 197K | ||||
| S. cerevisiae | 1M | 15K | 750K | 253K | ||||
| S. pombe | 1M | 27K | 721K | 244K | ||||
The following data correspond to the genes mapped by CEGMA in the recently sequenced genomes of Anopheles gambiae, Chlamydomonas reinhardtii, Ciona intestinalis and Toxoplasma gondii. The last column corresponds to the genes that are not mapped in the annotations of the current pipelines. A single tarball containing all of this data can also be downloaded
| Genomic | Coordinates | Proteins | Not annot. | |||||
|---|---|---|---|---|---|---|---|---|
| (fasta) | (gff) | (fasta) | (fasta) | |||||
| Anopheles gambiae | 15M | 211M | 8.5M | 8.5M | ||||
| Chlamydomonas reinhardtii | 4.0M | 70M | 2.6M | 2.6M | ||||
| Ciona intestinalis | 1.2M | 13M | 772K | 772K | ||||
| Toxoplasma gondii | 10M | 141M | 6.4M | 6.4M | ||||
CEGMA distributions contain several directories and files compressed in a tar.gz file. Source code and documentation files are included in the distribution, as well as several parameter files and other information.
Download the latest version: cegma_v2.4.010312.tar.gz. Note that older versions of CEGMA are still available. Fixes in the new version include:
Please see the full CEGMA README file for installation instructions, but also note that CEGMA requires version 2.2.3-rc7 or 2.4.1 of Genewise. The former version is not available on the EBI web or ftp site, so Ewan Birney has kindly agreed that we can host a copy on our website:
To install Genewise, you will also need to have glib installed (which can be installed on a Mac via Macports). On an Ubuntu Linux system, you should be able to install GeneWise by running:
$ sudo apt-get install wise
Also see these two guides to installing CEGMA on Ubuntu Linux:
To test that everything is running correctly, you can try the following command that uses the test files provided by CEGMA:
cegma -g /path/to/CEGMA/sample/sample.dna -p /path/to/CEGMA/sample/sample.prot
where /path/to/CEGMA should be the correct path to your local CEGMA installation. This should run relatively quickly and produce the following screen and file output.
CEGMA should produce 7 output files for each run. Assuming you use 'output' as the output file name prefix:
'Complete' refers to those predicted proteins in the set of 248 CEGs that when aligned to the HMM for the KOG for that protein-family, give an alignment length that is 70% of the protein length. I.e. if CEGMA produces a 100 amino acid protein, and the alignment length to the HMM to which that protein should belong is 110, then we would say that the protein is 'complete' (91% aligned).
If a protein is not complete, but if it still exceeds a pre-computed minimum alignment score, then we call the protein 'partial'. Note that these pre-computed scores were made before we changed CEGMA from HMMER v2 to instead use HMMER v3. This might have made some slight differences to what is classed partial. The pre-computed scores are all in the file CEGMA/data/completeness_cutoff.tbl. Note that the set of 'Partial' matches, necessarily includes the set of 'Complete' matches.