|
This web page contains
In this study, we report a computational method, CEGMA (Core
Eukaryotic Genes Mapping Approach), for building a highly reliable set
of gene annotations in the absence of experimental data. We define a
set of 458 core proteins that are present in a wide range of
taxa. Since these proteins are highly conserved, sequence alignment
methods can reliably identify their exon-intron structures in genomic
sequences. The resulting dataset can be used to train a gene finder or
to assess the completness of the genome or annotations.
Data based on the original KOG entries:
| | | Proteins | Alignment | Profiles | |
| | (fasta) | (clustal) | (hmmer) |
| |
| Core eukaryotic genes (458) |
1M |
2M |
19M |
| |
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Genomic data: The following files correspond to the genomic data of
the proteins selected from the KOGs database. These correspond to the
latests annotations of the core genes in these six genomes.
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.
Random sets: Genes were selected to make sure that the encoded
proteins were longer than 100 amino acids, contained no stop codons in
frame, and had less than 50% low complexity sequence (as determined by
the seg program distributed with WU-BLAST run with default
parameters). The genes are required to have introns in the range of 40
bp to 10Kbp and to use canonical splice sites. Each set contains 500
random genes.
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.
The following data correspond to the genes mapped by CEGMA in
the recently sequenced genomes of Anopheles gambiae,
Chlamydomonas reinhardtii, Ciona intestinalis and
Toxoplasma gondii. The last column corresponds to the genes
that are not mapped in the annotations of the current pipelines.
This table shows the file sizes of the files in each category.
Click on file size numbers to retrieve the corresponding file.
Click here to get the tar.gz file with all the data.
CEGMA distribution contains several directories and files
compressed in a tar.gz file. Source code and documentation
files are included in the distribution, as well as several parameters
files and other extra information. If you prefer not to install the
software you can contact us at genparra(at)ucdavis.edu, and we will
run the pipeline in your favorite genome.
cegma v 2.0: NEW !!
- cegma v 2.0 Latest full distribution: source code and documentation
[DOWNLOAD]
New features included:
- Genome sequence completeness estimation based on the amount of genes found
(see Genome completeness ).
- New parameter --intron_max allows to set a maximum intron length.
- Output file including genomic coordinates.
- Compatible with new release of geneid (v 3.1.7) WARNING !! Last geneid version seems to be not very stable. We strongly recommend geneid v1.2 to be used with CEGMA.
cegma v 1.0:
- cegma v 1.0 Full distribution: source code and documentation
[DOWNLOAD]
Instructions to install cegma on your computer.
- G. Parra, K. Bradnam and I. Korf.
"CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes."
Bioinformatics, 23: 1061-1067 (2007) [Abstract] [Full Text]
|