Last updated by Keith Bradnam on 2014–05–02
I'm grateful to Philipp Bayer
for his many contributions to this FAQ.
First just try running the cegma
command (a Perl script) with no
other options. If your CEGMA path is correctly set up then you you should see the
default CEGMA help page, which starts off like this:
cegma
PROGRAM:
cegma - 2.5
Core Eukaryotic Genes Mapping Approach
USAGE:
cegma [options] <-g genomic_fasta_sequence>
DESCRIPTION:
CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for
building a set of high reliable set of gene annotations in
virtually any eukaryotic genome. It combines tblastn, genewise,
hmmer, with geneid, an "ab initio" gene prediction program.
Then, try running CEGMA with the supplied ‘sample’ files. E.g. if
you installed CEGMA in /software/cegma
, then run:
cegma -g /software/cegma/sample/sample.dna -p /software/cegma/sample/sample.prot
This runs CEGMA using a small DNA file with a subset of the 458 CEGMA proteins. This should produce output as it runs. This output starts:
********************************************************************************
** MAPPING PROTEINS TO GENOME (TBLASTN) **
********************************************************************************
RUNNING: genome_map -n genome -p 6 -o 5000 -c 2000 -t 1 /software/cegma/sample/sample.prot
/software/cegma/sample/sample.dna 2>output.cegma.errors
Building a new DB, current time: 04/17/2014 20:01:19
New DB name: /tmp/genome90742.blastdb
New DB title: /software/cegma/sample/sample.dna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 1 sequences in 0.0374589 seconds.
You should also check that your $WISECONFIGDIR environment variable has been set:
ls $WISECONFIGDIR/gene.stat
If set correctly, this shouldn't produce an error.
The first step of CEGMA runs the genome_map
script. In turn this
relies on using NCBI BLAST+, and in particular the makeblastdb
command.
First try repeating the genome_map
command that is running. E.g.
genome_map -n genome -p 6 -o 5000 -c 2000 -t 1 /software/cegma/sample/sample.prot
/software/cegma/sample/sample.dna
Observe any error messages. Most problems at this stage are due to problems with
the format of FASTA definition lines not being accepted by makeblastdb
.
You could always test this step separately as well, e.g. run:
makeblastdb -in <your_genome_FASTA_file> -dbtype nucl -parse_seqids -out test_BLASTDB
Common problems with the FASTA files include:
|
) or commas in definition line.makeblastdb
likes to see some text in the identifier.A simple, but workable naming scheme is simply:
>scaffold1
ACGT...
>scaffold2
ACGT....
etc.
One other issue that we know about is that changes NCBI made to version 2.2.30+ of NCBI BLAST breaks CEGMA because they no longer support a word size of 5 (which is used by CEGMA when it runs TBLASTN). This may be a bug by the NCBI (more info here). You could change the relevant line of the genome_map script to allow a word size of 6 (which is currently permitted), but this may change downstream CEGMA results.
geneid-train did not work properly
. What happened?CEGMA assumes that there is at least one core gene present in your genome assembly. It’s possible that this isn’t always the case. The CEGMA output tells you how many genes are being kept at each stage. E.g.
********************************************************************************
** MAPPING PROTEINS TO GENOME (TBLASTN) **
********************************************************************************
Found 2209 candidate regions in seqs.fa
********************************************************************************
** MAKING INITIAL GENE PREDICTIONS FOR CORE GENES (GENEWISE + GENEID) **
********************************************************************************
NOTE: created 14 geneid predictions
********************************************************************************
** FILTERING INITIAL PROTEINS PRODUCED BY GENEID (HMMER) **
********************************************************************************
NOTE: Found 1 geneid predictions with scores above threshold value
If you have very low numbers of candidate genes in the early stages, it’s possible that all will be removed by the filtering steps. And if you end up with zero core genes, you may see a strange error message (possibly a 'divide by zero' error).
Well, it’s better than having 234 CEGs, but not as good as having 236. The number of core genes in any assembly is just one metric by which to assess the quality of an assembly. There are many others. Please also see my blog post Which genome assembler gives you the best genome assembly? for more discussion of this topic.
The original CEGMA developer set up things so that each KOG (which contains 6
proteins, one from each of 6 model organism species) is allowed to have up to 5
BLAST matches each in the target genome assembly. This is the first part of the
CEGMA pipeline (run by the genome_map
script).
Theoretically, this initial BLAST step could therefore find up to 30 non-overlapping candidate regions across a genome. In reality, most of these regions will be overlapping from different input proteins.
So when you see a KOG ID in the CEGMA output files such as ‘KOG0002.9’, this is just referring to the 9th region considered by BLAST for KOG 0002. These IDs are used throughout the downstream steps by CEGMA but are essentially just internal identifiers.
To convert these CEGMA-formatted KOG IDs, back to regular KOG IDs, you can run something like the following in the terminal:
cat output.cegma.id | sed 's/\.[0-9]*//' > output.kog.id
The latter is a subset of the former, though due to different filtering rules for the more conserved subset, it is possible that a core gene is present in the 458 CEGs but not present in the set of 248 CEGs.
A KOG is a euKaryotic Orthologous Group, and the KOGs database is the source of the orthologous groups used by CEGMA. Although KOGs (published over a decade ago) was based on 7 species, we only use 6 of these 7 within CEGMA (the microsporidian parasite Encephalitozoon cuniculi was excluded).
So CEGMA uses 458 KOGs (each of which has 6/7 proteins). Essentially, we refer to these as Core Eukaryotic Genes (CEGs). The underlying protein sequences are taken from the KOGs database, but the HMM models we build from each KOG are part of CEGMA. So a 'CEG' really describes a 6-protein KOG plus the HMM that we build from that KOG.
You could just try running the genewise
command with your own data
but to test it with the built-in CEGMA data, try the following steps (which assume
that your CEGMA installation is at /software/cegma):
# run genome map with the supplied sample data
genome_map -n genome -p 6 -o 5000 -c 2000 -t 1 /software/cegma/sample/sample.prot
/software/cegma/sample/sample.dna
# this should produce a genome.chunks.fa file (amongst others). Now we want to grab
# one sequence from this which should contain a KOG sequence (KOG0631) that genewise can work with.
head -n 1134 genome.chunks.fa | tail -n 116 > tmp.fa
# check that tmp.fa contains one sequence resulting from BLAST hit to KOG0631!
# now run genewise using the suitable HMM file for this KOG
genewise -splice_gtag -quiet -gff -pretty -alb
-hmmer /software/cegma/data/hmm_profiles/KOG0631.hmm tmp.fa > tmp.genewise
If this works, you should have a genewise file which describes the protein structure of a potential core gene. If this doesn't work, you probably have a problem with your genewise installation
FATAL ERROR when running genome_map 32512:
sh: 1: ./genome_map: not found
Fix: This happens when $CEGMA isn't set properly. set $CEGMA as described in manual.
Can't locate FAlite.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2
/usr/local/share/perl/5.14.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.14
/usr/share/perl/5.14 /usr/local/lib/site_perl .) at ./genome_map line 14.
BEGIN failed--compilation aborted at ./genome_map line 14.
Fix: CEGMA depends on the Perl library FAlite.pm, which is in CEGMA's lib folder, or download from the above link. You can do 'export PERL5LIB:$CEGMA/lib:$PERL5LIB'. Or just copy it to any of the folders in that list or to the directory where you're running CEGMA.
Warning Error
Unable to open gene.stat as gene stats file
Fix: This happens when $WISECONFIGDIR is set incorrectly, or not at all. As a workaround, you can download your own copy of genewise, you don't have to install your own copy, just set your environment variable to the wisecfg folder in the extracted genewise-folder.
export WISECONFIGDIR=/home/your_username/wise2.2.3-rc7/wisecfg
Error: NCBI C++ Exception:
ncbi::CObject::ThrowNullPointerException() - Attempt to access NULL pointer.
Fix: Update your blast+ version, older versions like 2.25 have this problem with some datasets and several threads, 2.29 doesn't seem to have it. If it still persists, a workaround is to use -T 1.
In sample.cegma.errors
Warning Error
In GeneLoop6 matrix special read off - out of bounds on matrix [i,j
is 193,-1 state 0 in standard matrix]
Fix: Two things cause this. The first is an older version of CEGMA, so just update it. The second is re-starting CEGMA from an incomplete, crashed run (saved by using the -ext flag). The solution is to delete the results from the incomplete run and start again.
Thanks to Shaun Jackman, there is a Homebrew package for CEGMA available. This is part of the Homebrew-Science initiative, making command-line tools more easily installable on Macs (also on Linux if you use Linuxbrew). If you have homebrew installed you can add CEGMA as follows:
brew install homebrew/science/cegma
An alternative package manager for Linux is Local Package Manager (LPM) and they also have a CEGMA package. Thanks to Masahiro Kasahara for bringing this to my attention.
Users of iPlant can also run CEGMA thanks to Michael Crusoe who has added CEGMA as an iPlant application.
I am also grateful to Matt MacManes for letting me know that there is a public Amazon Machine Instance called CEGMA on the CLOUD, which means you can also run CEGMA on Amazon's Elastic Cloud infrastructure.
If you use Docker, then you can use the CEGMA docker container that was made by Rob Syme.
Finally, we now have a CEGMA VM that you can download which will contain a pre-installed version of CEGMA that runs on Ubuntu Linux. See our CEGMA VM page for more info. Thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core for setting this up.
Jun 26, 2015: Keith Bradnam is interviewed by Frontline Genomics Magazine about his life in Bioinformatics.
Apr 8, 2015: Ian Korf is quoted in a Nature commentary article about Bioinformatics Service cores and the need for beter career paths for bioinformaticians.
Mar 16, 2015: Danielle Lemay is interviewed by the UC Davis News team about the new publication by herself, Kristen Beck (lead author), Ian Korf and others that describes new milk proteomes for human and macaque.
Apr 22, 2013: The Assemblathon 2 paper has won the 2013 BioMed Central Open Data award
Dec 10, 2013: A short piece in the UC Davis Alumni Magazine that discusses the new Genomics undergraduate major that Ian Korf co-developed.
Nov 26, 2013: Ian Korf writes a News & Views piece for Nature Methods about two new comparisons of programs that work with RNA-seq data
Nov 1, 2013: Keith B. and Kristen are both featured in a piece on Inquiring Minds as part of the new One UC Davis campaign.
Our free 175-page primer that teaches the basics of Unix & Perl
Our book that greatly expands on our free primer.
A regular discussion forum at the Genome Center for all things relating to sequence analysis
Where we work
Part of the Genome Center
For questions or comments about the website, please e-mail:
korflab AT ucdavis DOT edu
Contact information for specific members of our lab can be found on their personal pages.