CEGMA HAS BEEN DISCONTINUED

We suggest you consider using alternative tools

CEGMA

News
Abstract
Download CEGMA
Installation instructions
Core eukaryotic genes dataset
Core eukarytotic genes in new species
Understanding CEGMA output
References
Mailing List

News

May 18th 2015 - CEGMA is no longer being supported

Due to lack of resources we can no longer offer any support or feedback regarding CEGMA. We suggest that you try a new tool called BUSCO. Read this blog post for more details.

July 30th 2014 - New CEGMA VM available

In addition to other options (see news item below), you can now run CEGMA preinstalled as part of an Ubuntu VM. Thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core for doing this. Here are the instructions for using our CEGMA VM.

July 21st 2014 - Homebrew packages for CEGMA available

Thanks to Shaun Jackman, there is a Homebrew package for CEGMA available. This is part of the Homebrew-Science initiative, making command-line tools more easily installable on Macs (also on Linux if you use Linuxbrew). If you have homebrew installed you can add CEGMA as follows:

brew install homebrew/science/cegma

May 19th 2014 - version 2.5 of CEGMA is available

This release fixes some minor bugs, tidies up the code in several ways, and moves the code to GitHub. See the full release notes on Github for more details and download the code there or download it from this website. Note, this will be the last 2.x release of CEGMA.

April 14th 2014 - New CEGMA FAQ

I've started a FAQ to collect common questions about CEGMA.

November 6th 2013 - Run CEGMA via iPlant

Users of iPlant can now run CEGMA thanks to Michael Crusoe who has added CEGMA as an iPlant application.

August 2013 - CEGMA v2 is being discontinued

Unfortunately, due to insufficient resources it will not be possible to address any queries or issues regarding version 2 of CEGMA. However, there are ongoing plans to completely redevelop CEGMA to make it more powerful and flexible, and to create a code base which will be easier to maintain in future.

We are happy to continue running the latest version of CEGMA (v2.4) on your behalf if you can make the genome or genome assembly available to download.

Abstract

In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of 458 core proteins that are present in a wide range of taxa. Since these proteins are highly conserved, sequence alignment methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completness of the genome or annotations.

The CEGMA core genes dataset was built using the NCBI euKaryotic clusters of Orthologous Groups (KOGs) database. More specifically, the starting point was using the subset of 928 KOGs that are conserved between Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. These were then filtered by various criteria to arrive at the final set of 458 Core Eukaryotic Genes (CEGs), see paper for details.

Go to top of page

Download CEGMA

CEGMA distributions contain several directories and files compressed in a tar.gz file. Source code and documentation files are included in the distribution, as well as several parameter files and other information. This final 2.x release of CEGMA is also available on GitHub

CEGMA v2.5

Download the latest version: CEGMA_v2.5.tar.gz. Note that older versions of CEGMA are still available. See included release notes for full details of changes. Please note, that apart from fixing bugs, improving the display of CEGMA output, and tidying up some of the underlying code, no new functionality is provided by this version

Go to top of page

Installing CEGMA

Please see the full CEGMA README file and the CEGMA FAQ FAQ for installation instructions. Note that CEGMA requires version 2.2.3-rc7 or 2.4.1 of Genewise. The former version is not available on the EBI web or ftp site, so Ewan Birney has kindly agreed that we can host a copy on our website:

wise2.2.3-rc7.tar.gz

To install Genewise, you will also need to have glib installed (which can be installed on a Mac via Macports). On an Ubuntu Linux system, you should be able to install GeneWise by running:

$ sudo apt-get install wise

Also see these two guides to installing CEGMA on Ubuntu Linux:

installing CEGMA on Ubuntu 1 (kindly provided by Markus Grohme).
installing CEGMA on Ubuntu 2 (kindly provided by Christoph Hahn).

To test that everything is running correctly, you can try the following command that uses the test files provided by CEGMA:

cegma -g /path/to/CEGMA/sample/sample.dna -p /path/to/CEGMA/sample/sample.prot

where /path/to/CEGMA should be the correct path to your local CEGMA installation. This should run relatively quickly and produce the following screen and file output.

Go to top of page

Core eukaryotic genes dataset

All links in following tables are links to underlying datafiles. The name of each link is the file size in Kilo- or Megabytes.

Data based on the original KOG entries:

	Proteins	Alignment	Profiles
	(fasta)	(clustal)	(hmmer)

Core eukaryotic genes (458)	1M	2M	19M

Genomic data

The following files correspond to the genomic data of the proteins selected from the KOGs database. These correspond to the annotations (from 2007) of the core genes in these six genomes. A single tarball containing all of this data can also be downloaded

	Genomic	Coordinates	Transcript	Proteins
	(fasta)	(gff)	(fasta)	(fasta)

H. sapiens	13M	173K	559K	193K
D. melanogaster	1M	62K	563K	194K
C. elegans	2M	85K	557K	192K
A. thaliana	1M	196K	568K	196K
S. cerevisiae	1M	25K	569K	196K
S.pombe	1M	47K	557K	193K

Random sets

Genes were selected to make sure that the encoded proteins were longer than 100 amino acids, contained no stop codons in frame, and had less than 50% low complexity sequence (as determined by the seg program distributed with WU-BLAST run with default parameters). The genes are required to have introns in the range of 40 bp to 10 Kbp and to use canonical splice sites. Each set contains 500 random genes. A single tarball containing all of this data can also be downloaded

	Genomic	Coordinates	Transcript	Proteins
	(fasta)	(gff)	(fasta)	(fasta)

H. sapiens	26M	128K	781K	265K
D. melanogaster	3M	57K	796K	270K
C. elegans	2M	72K	547K	195K
A. thaliana	1M	78K	578K	197K
S. cerevisiae	1M	15K	750K	253K
S. pombe	1M	27K	721K	244K

Go to top of page

Core genes in new species

The following data correspond to the genes mapped by CEGMA in the recently sequenced genomes of Anopheles gambiae, Chlamydomonas reinhardtii, Ciona intestinalis and Toxoplasma gondii. The last column corresponds to the genes that are not mapped in the annotations of the current pipelines. A single tarball containing all of this data can also be downloaded

	Genomic	Coordinates	Proteins	Not annot.
	(fasta)	(gff)	(fasta)	(fasta)

Anopheles gambiae	15M	211M	8.5M	8.5M
Chlamydomonas reinhardtii	4.0M	70M	2.6M	2.6M
Ciona intestinalis	1.2M	13M	772K	772K
Toxoplasma gondii	10M	141M	6.4M	6.4M

Go to top of page

Understanding CEGMA output

Also see the CEGMA FAQ. CEGMA should produce 7 output files for each run. Assuming you use 'output' as the output file name prefix:

output.cegma.dna - contains DNA sequence of each CEGMA prediction along with flanking DNA (defaults to ± 2000 bp)
output.cegma.errors - contains any error messages produced by all of the CEGMA scripts
output.cegma.fa - contains protein sequences of the predicted CEGs. One protein for each of the 458 core genes that are present in your genome
output.cegma.gff - contains exon details of all of the CEGMA predicted genes
output.cegma.id - contains the KOG IDs for the selected proteins
output.cegma.local.gff - contains the GFF information of the CEGs using local coordiantes (relative to the dna file)
output.completeness_report - contains a summary of which of the subset of the 248 most highly-conserved CEGs are present (either partially or completely, see below for more details)

Partial vs Complete

'Complete' refers to those predicted proteins in the set of 248 CEGs that when aligned to the HMM for the KOG for that protein-family, give an alignment length that is 70% of the protein length. I.e. if CEGMA produces a 100 amino acid protein, and the alignment length to the HMM to which that protein should belong is 110, then we would say that the protein is 'complete' (91% aligned).

If a protein is not complete, but if it still exceeds a pre-computed minimum alignment score, then we call the protein 'partial'. Note that these pre-computed scores were made before we changed CEGMA from HMMER v2 to instead use HMMER v3. This might have made some slight differences to what is classed partial. The pre-computed scores are all in the file CEGMA/data/completeness_cutoff.tbl. Note that a protein that is deemed to be 'Complete' will also be included in the set of Partial matches.

Go to top of page

References

Genis Parra, Keith Bradnam and Ian Korf. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes."
Bioinformatics, 23: 1061-1067 (2007)
Genis Parra, Keith Bradnam, Zemin Ning, Thomas Keane, and Ian Korf. Assessing the gene space in draft genomes"
Nucleic Acids Research, 37(1): 298-297 (2009)

Go to top of page

CEGMA Mailing List

We have created a mailing list which we will use to send news about updates and for any general discussion of CEGMA. If you are interested in subscribing, please follow these instructions (subscriptions are moderated):

Email sympa@ucdavis.edu using the address that you want to use to subscribe to the list
In the subject line of your message, type: subscribe cegma First_name Last_name
Leave the message body blank.

To unsubscribe:

Email sympa@ucdavis.edu using the address that you want to use to subscribe to the list
In the subject line of your message, type: unsubscribe cegma
Leave the message body blank

Please email us with any other bugs or suggestions that you have.

Go to top of page

CEGMA HAS BEEN DISCONTINUED

We suggest you consider using alternative tools

CEGMA

Contents

News

May 18th 2015 - CEGMA is no longer being supported

July 30th 2014 - New CEGMA VM available

July 21st 2014 - Homebrew packages for CEGMA available

May 19th 2014 - version 2.5 of CEGMA is available

April 14th 2014 - New CEGMA FAQ

November 6th 2013 - Run CEGMA via iPlant

August 2013 - CEGMA v2 is being discontinued

Abstract

Download CEGMA

CEGMA v2.5

Installing CEGMA

Core eukaryotic genes dataset

Data based on the original KOG entries:

Genomic data

Random sets

Core genes in new species

Understanding CEGMA output

Partial vs Complete

References

CEGMA Mailing List