Due to lack of resources we can no longer offer any support or feedback regarding CEGMA. We suggest that you try a new tool called BUSCO. Read this blog post for more details.
In addition to other options (see news item below), you can now run CEGMA preinstalled as part of an Ubuntu VM. Thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core for doing this. Here are the instructions for using our CEGMA VM.
Thanks to Shaun Jackman, there is a Homebrew package for CEGMA available. This is part of the Homebrew-Science initiative, making command-line tools more easily installable on Macs (also on Linux if you use Linuxbrew). If you have homebrew installed you can add CEGMA as follows:
brew install homebrew/science/cegma
This release fixes some minor bugs, tidies up the code in several ways, and moves the code to GitHub. See the full release notes on Github for more details and download the code there or download it from this website. Note, this will be the last 2.x release of CEGMA.
I've started a FAQ to collect common questions about CEGMA.
Users of iPlant can now run CEGMA thanks to Michael Crusoe who has added CEGMA as an iPlant application.
Unfortunately, due to insufficient resources it will not be possible to address any queries or issues regarding version 2 of CEGMA. However, there are ongoing plans to completely redevelop CEGMA to make it more powerful and flexible, and to create a code base which will be easier to maintain in future.
We are happy to continue running the latest version of CEGMA (v2.4) on your behalf if you can make the genome or genome assembly available to download.
In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of 458 core proteins that are present in a wide range of taxa. Since these proteins are highly conserved, sequence alignment methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completness of the genome or annotations.
The CEGMA core genes dataset was built using the NCBI euKaryotic clusters of Orthologous Groups (KOGs) database. More specifically, the starting point was using the subset of 928 KOGs that are conserved between Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. These were then filtered by various criteria to arrive at the final set of 458 Core Eukaryotic Genes (CEGs), see paper for details.
CEGMA distributions contain several directories and files compressed in a tar.gz file. Source code and documentation files are included in the distribution, as well as several parameter files and other information. This final 2.x release of CEGMA is also available on GitHub
Download the latest version: CEGMA_v2.5.tar.gz. Note that older versions of CEGMA are still available. See included release notes for full details of changes. Please note, that apart from fixing bugs, improving the display of CEGMA output, and tidying up some of the underlying code, no new functionality is provided by this version
Please see the full CEGMA README file and the CEGMA FAQ FAQ for installation instructions. Note that CEGMA requires version 2.2.3-rc7 or 2.4.1 of Genewise. The former version is not available on the EBI web or ftp site, so Ewan Birney has kindly agreed that we can host a copy on our website:
To install Genewise, you will also need to have glib installed (which can be installed on a Mac via Macports). On an Ubuntu Linux system, you should be able to install GeneWise by running:
$ sudo apt-get install wise
Also see these two guides to installing CEGMA on Ubuntu Linux:
To test that everything is running correctly, you can try the following command that uses the test files provided by CEGMA:
cegma -g /path/to/CEGMA/sample/sample.dna -p /path/to/CEGMA/sample/sample.prot
where /path/to/CEGMA should be the correct path to your local CEGMA installation. This should run relatively quickly and produce the following screen and file output.
All links in following tables are links to underlying datafiles. The name of each link is the file size in Kilo- or Megabytes.
Proteins | Alignment | Profiles | ||||||
---|---|---|---|---|---|---|---|---|
(fasta) | (clustal) | (hmmer) | ||||||
Core eukaryotic genes (458) | 1M | 2M | 19M | |||||
The following files correspond to the genomic data of the proteins selected from the KOGs database. These correspond to the annotations (from 2007) of the core genes in these six genomes. A single tarball containing all of this data can also be downloaded
Genomic | Coordinates | Transcript | Proteins | |||||
---|---|---|---|---|---|---|---|---|
(fasta) | (gff) | (fasta) | (fasta) | |||||
H. sapiens | 13M | 173K | 559K | 193K | ||||
D. melanogaster | 1M | 62K | 563K | 194K | ||||
C. elegans | 2M | 85K | 557K | 192K | ||||
A. thaliana | 1M | 196K | 568K | 196K | ||||
S. cerevisiae | 1M | 25K | 569K | 196K | ||||
S.pombe | 1M | 47K | 557K | 193K | ||||
Genes were selected to make sure that the encoded proteins were longer than 100 amino acids, contained no stop codons in frame, and had less than 50% low complexity sequence (as determined by the seg program distributed with WU-BLAST run with default parameters). The genes are required to have introns in the range of 40 bp to 10 Kbp and to use canonical splice sites. Each set contains 500 random genes. A single tarball containing all of this data can also be downloaded
Genomic | Coordinates | Transcript | Proteins | |||||
---|---|---|---|---|---|---|---|---|
(fasta) | (gff) | (fasta) | (fasta) | |||||
H. sapiens | 26M | 128K | 781K | 265K | ||||
D. melanogaster | 3M | 57K | 796K | 270K | ||||
C. elegans | 2M | 72K | 547K | 195K | ||||
A. thaliana | 1M | 78K | 578K | 197K | ||||
S. cerevisiae | 1M | 15K | 750K | 253K | ||||
S. pombe | 1M | 27K | 721K | 244K | ||||
The following data correspond to the genes mapped by CEGMA in the recently sequenced genomes of Anopheles gambiae, Chlamydomonas reinhardtii, Ciona intestinalis and Toxoplasma gondii. The last column corresponds to the genes that are not mapped in the annotations of the current pipelines. A single tarball containing all of this data can also be downloaded
Genomic | Coordinates | Proteins | Not annot. | |||||
---|---|---|---|---|---|---|---|---|
(fasta) | (gff) | (fasta) | (fasta) | |||||
Anopheles gambiae | 15M | 211M | 8.5M | 8.5M | ||||
Chlamydomonas reinhardtii | 4.0M | 70M | 2.6M | 2.6M | ||||
Ciona intestinalis | 1.2M | 13M | 772K | 772K | ||||
Toxoplasma gondii | 10M | 141M | 6.4M | 6.4M | ||||
Also see the CEGMA FAQ. CEGMA should produce 7 output files for each run. Assuming you use 'output' as the output file name prefix:
'Complete' refers to those predicted proteins in the set of 248 CEGs that when aligned to the HMM for the KOG for that protein-family, give an alignment length that is 70% of the protein length. I.e. if CEGMA produces a 100 amino acid protein, and the alignment length to the HMM to which that protein should belong is 110, then we would say that the protein is 'complete' (91% aligned).
If a protein is not complete, but if it still exceeds a pre-computed minimum alignment score, then we call the protein 'partial'. Note that these pre-computed scores were made before we changed CEGMA from HMMER v2 to instead use HMMER v3. This might have made some slight differences to what is classed partial. The pre-computed scores are all in the file CEGMA/data/completeness_cutoff.tbl. Note that a protein that is deemed to be 'Complete' will also be included in the set of Partial matches.
We have created a mailing list which we will use to send news about updates and for any general discussion of CEGMA. If you are interested in subscribing, please follow these instructions (subscriptions are moderated):
To unsubscribe:
Please email us with any other bugs or suggestions that you have.