Σ -- cegma
Ian Korf Lab. Genome Center. UCDavis


   

CONTENTS

This web page contains

Abstract GO TOP

 
In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of 458 core proteins that are present in a wide range of taxa. Since these proteins are highly conserved, sequence alignment methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completness of the genome or annotations.


Core eukaryotic genes dataset GO TOP

Data based on the original KOG entries:

    Proteins Alignment Profiles  
  (fasta) (clustal) (hmmer)
 
Core eukaryotic genes (458) 1M 2M 19M
 
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.

Genomic data: The following files correspond to the genomic data of the proteins selected from the KOGs database. These correspond to the latests annotations of the core genes in these six genomes.

    Genomic Coordinates Transcript Proteins  
  (fasta) (gff) (fasta) (fasta)
 
H.sapiens 13M 173K 559K 193K
D.melanogaster 1M 62K 563K 194K
C.elegans 2M 85K 557K 192K
A.thaliana 1M 196K 568K 196K
S.cerevisiae 1M 25K 569K 196K
S.pombe 1M 47K 557K 193K
 
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.

Random sets: Genes were selected to make sure that the encoded proteins were longer than 100 amino acids, contained no stop codons in frame, and had less than 50% low complexity sequence (as determined by the seg program distributed with WU-BLAST run with default parameters). The genes are required to have introns in the range of 40 bp to 10Kbp and to use canonical splice sites. Each set contains 500 random genes.

    Genomic Coordinates Transcript Proteins  
  (fasta) (gff) (fasta) (fasta)
 
H.sapiens 26M 128K 781K 265K
D.melanogaster 3M 57K 796K 270K
C.elegans 2M 72K 547K 195K
A.thaliana 1M 78K 578K 197K
S.cerevisiae 1M 15K 750K 253K
S.pombe 1M 27K 721K 244K
 
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.

Core genes in new species GO TOP

The following data correspond to the genes mapped by CEGMA in the recently sequenced genomes of Anopheles gambiae, Chlamydomonas reinhardtii, Ciona intestinalis and Toxoplasma gondii. The last column corresponds to the genes that are not mapped in the annotations of the current pipelines.

    Genomic Coordinates Proteins Not annot.  
  (fasta) (gff) (fasta) (fasta)
 
Anopheles gambiae 15M 211M 8.5M 8.5M
Chlamydomonas reinhardtii 4.0M 70M 2.6M 2.6M
Ciona intestinalis 1.2M 13M 772K 772K
Toxoplasma gondii 10M 141M 6.4M 6.4M
 
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.


Source code distribution

 
CEGMA distribution contains several directories and files compressed in a tar.gz file. Source code and documentation files are included in the distribution, as well as several parameters files and other extra information. If you prefer not to install the software you can contact us at genparra(at)ucdavis.edu, and we will run the pipeline in your favorite genome.

cegma v 2.0:   NEW !!

  • cegma v 2.0   Latest full distribution: source code and documentation
         [DOWNLOAD]

  • New features included:
    • Genome sequence completeness estimation based on the amount of genes found (see Genome completeness ).
    • New parameter --intron_max allows to set a maximum intron length.
    • Output file including genomic coordinates.
    • Compatible with new release of geneid (v 3.1.7) WARNING !! Last geneid version seems to be not very stable. We strongly recommend geneid v1.2 to be used with CEGMA.

cegma v 1.0:

  • cegma v 1.0   Full distribution: source code and documentation
         [DOWNLOAD]

Instructions to install cegma on your computer.

References
  • G. Parra, K. Bradnam and I. Korf.
    "CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes."
    Bioinformatics, 23: 1061-1067 (2007)   [Abstract]   [Full Text]