About Korf Lab




July 30th 2014 - New CEGMA VM available

In addition to other options (see news item below), you can now run CEGMA preinstalled as part of an Ubuntu VM. Thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core for doing this. Here are the instructions for using our CEGMA VM.

July 21st 2014 - Homebrew packages for CEGMA available

Thanks to Shaun Jackman, there is a Homebrew package for CEGMA available. This is part of the Homebrew-Science initiative, making command-line tools more easily installable on Macs (also on Linux if you use Linuxbrew). If you have homebrew installed you can add CEGMA as follows:

brew install homebrew/science/cegma

May 19th 2014 - version 2.5 of CEGMA is available

This release fixes some minor bugs, tidies up the code in several ways, and moves the code to GitHub. See the full release notes on Github for more details and download the code there or download it from this website. Note, this will be the last 2.x release of CEGMA.

April 14th 2014 - New CEGMA FAQ

I've started a FAQ to collect common questions about CEGMA.

November 6th 2013 - Run CEGMA via iPlant

Users of iPlant can now run CEGMA thanks to Michael Crusoe who has added CEGMA as an iPlant application.

August 2013 - CEGMA v2 is being discontinued

Unfortunately, due to insufficient resources it will not be possible to address any queries or issues regarding version 2 of CEGMA. However, there are ongoing plans to completely redevelop CEGMA to make it more powerful and flexible, and to create a code base which will be easier to maintain in future.

We are happy to continue running the latest version of CEGMA (v2.4) on your behalf if you can make the genome or genome assembly available to download.


In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of 458 core proteins that are present in a wide range of taxa. Since these proteins are highly conserved, sequence alignment methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completness of the genome or annotations.

The CEGMA core genes dataset was built using the NCBI euKaryotic clusters of Orthologous Groups (KOGs) database. More specifically, the starting point was using the subset of 928 KOGs that are conserved between Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. These were then filtered by various criteria to arrive at the final set of 458 Core Eukaryotic Genes (CEGs), see paper for details.

Go to top of page

Download CEGMA

CEGMA distributions contain several directories and files compressed in a tar.gz file. Source code and documentation files are included in the distribution, as well as several parameter files and other information. This final 2.x release of CEGMA is also available on GitHub

CEGMA v2.5

Download the latest version: CEGMA_v2.5.tar.gz. Note that older versions of CEGMA are still available. See included release notes for full details of changes. Please note, that apart from fixing bugs, improving the display of CEGMA output, and tidying up some of the underlying code, no new functionality is provided by this version

Go to top of page

Installing CEGMA

Please see the full CEGMA README file and the CEGMA FAQ FAQ for installation instructions. Note that CEGMA requires version 2.2.3-rc7 or 2.4.1 of Genewise. The former version is not available on the EBI web or ftp site, so Ewan Birney has kindly agreed that we can host a copy on our website:


To install Genewise, you will also need to have glib installed (which can be installed on a Mac via Macports). On an Ubuntu Linux system, you should be able to install GeneWise by running:

$ sudo apt-get install wise

Also see these two guides to installing CEGMA on Ubuntu Linux:

To test that everything is running correctly, you can try the following command that uses the test files provided by CEGMA:

cegma -g /path/to/CEGMA/sample/sample.dna -p /path/to/CEGMA/sample/sample.prot

where /path/to/CEGMA should be the correct path to your local CEGMA installation. This should run relatively quickly and produce the following screen and file output.

Go to top of page

Core eukaryotic genes dataset

All links in following tables are links to underlying datafiles. The name of each link is the file size in Kilo- or Megabytes.

Data based on the original KOG entries:

    Proteins Alignment Profiles  
  (fasta) (clustal) (hmmer)
Core eukaryotic genes (458) 1M 2M 19M

Genomic data

The following files correspond to the genomic data of the proteins selected from the KOGs database. These correspond to the annotations (from 2007) of the core genes in these six genomes. A single tarball containing all of this data can also be downloaded

    Genomic Coordinates Transcript Proteins  
  (fasta) (gff) (fasta) (fasta)
H. sapiens 13M 173K 559K 193K
D. melanogaster 1M 62K 563K 194K
C. elegans 2M 85K 557K 192K
A. thaliana 1M 196K 568K 196K
S. cerevisiae 1M 25K 569K 196K
S.pombe 1M 47K 557K 193K

Random sets

Genes were selected to make sure that the encoded proteins were longer than 100 amino acids, contained no stop codons in frame, and had less than 50% low complexity sequence (as determined by the seg program distributed with WU-BLAST run with default parameters). The genes are required to have introns in the range of 40 bp to 10 Kbp and to use canonical splice sites. Each set contains 500 random genes. A single tarball containing all of this data can also be downloaded

    Genomic Coordinates Transcript Proteins  
  (fasta) (gff) (fasta) (fasta)
H. sapiens 26M 128K 781K 265K
D. melanogaster 3M 57K 796K 270K
C. elegans 2M 72K 547K 195K
A. thaliana 1M 78K 578K 197K
S. cerevisiae 1M 15K 750K 253K
S. pombe 1M 27K 721K 244K

Go to top of page

Core genes in new species

The following data correspond to the genes mapped by CEGMA in the recently sequenced genomes of Anopheles gambiae, Chlamydomonas reinhardtii, Ciona intestinalis and Toxoplasma gondii. The last column corresponds to the genes that are not mapped in the annotations of the current pipelines. A single tarball containing all of this data can also be downloaded

    Genomic Coordinates Proteins Not annot.  
  (fasta) (gff) (fasta) (fasta)
Anopheles gambiae 15M 211M 8.5M 8.5M
Chlamydomonas reinhardtii 4.0M 70M 2.6M 2.6M
Ciona intestinalis 1.2M 13M 772K 772K
Toxoplasma gondii 10M 141M 6.4M 6.4M

Go to top of page

Understanding CEGMA output

Also see the CEGMA FAQ. CEGMA should produce 7 output files for each run. Assuming you use 'output' as the output file name prefix:

Partial vs Complete

'Complete' refers to those predicted proteins in the set of 248 CEGs that when aligned to the HMM for the KOG for that protein-family, give an alignment length that is 70% of the protein length. I.e. if CEGMA produces a 100 amino acid protein, and the alignment length to the HMM to which that protein should belong is 110, then we would say that the protein is 'complete' (91% aligned).

If a protein is not complete, but if it still exceeds a pre-computed minimum alignment score, then we call the protein 'partial'. Note that these pre-computed scores were made before we changed CEGMA from HMMER v2 to instead use HMMER v3. This might have made some slight differences to what is classed partial. The pre-computed scores are all in the file CEGMA/data/completeness_cutoff.tbl. Note that a protein that is deemed to be 'Complete' will also be included in the set of Partial matches.

Go to top of page


Go to top of page

CEGMA Mailing List

We have created a mailing list which we will use to send news about updates and for any general discussion of CEGMA. If you are interested in subscribing, please follow these instructions (subscriptions are moderated):

To unsubscribe:

Please email us with any other bugs or suggestions that you have.

Go to top of page