CEGMA FAQ

Last updated by Keith Bradnam on 2014–05–02
I'm grateful to Philipp Bayer for his many contributions to this FAQ.

1) How can I test whether CEGMA installed correctly?
2) CEGMA fails when it tries to run the first step (TBLASTN). What should I do?
3) I see an error that geneid-train did not work properly. What happened?
4) CEGMA said that my assembly contained 235 CEGs (out of 248). Is this good?
5) What do the numerical suffix on KOG IDs represent?
6) What’s the difference between the 458 and 248 sets of core eukaryotic genes (CEGs)?
7) What is the difference between a KOG and a CEG?
8) How can I test my genewise installation?
9) Why do I get a FATAL error when running genome_map?
10) Why can't CEGMA locate FAlite.pm?
11) Why can't gene.stat file be opened?
12) What is the NCBI C++ Exception?
13) What does the GeneLoop out of bounds error mean?
14) Are there other - less painful - ways that I can install CEGMA?

1. How can I test whether CEGMA installed correctly?

First just try running the cegma command (a Perl script) with no other options. If your CEGMA path is correctly set up then you you should see the default CEGMA help page, which starts off like this:

                    
                        cegma  

PROGRAM:
                     cegma - 2.5

         Core Eukaryotic Genes Mapping Approach  

USAGE:

	cegma [options] <-g genomic_fasta_sequence>   

DESCRIPTION:  

	CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for  
	building a set of high reliable set of gene annotations in  
	virtually any eukaryotic genome. It combines tblastn, genewise,  
	hmmer, with geneid, an "ab initio" gene prediction program.

Then, try running CEGMA with the supplied ‘sample’ files. E.g. if you installed CEGMA in /software/cegma, then run:


cegma -g /software/cegma/sample/sample.dna -p /software/cegma/sample/sample.prot

This runs CEGMA using a small DNA file with a subset of the 458 CEGMA proteins. This should produce output as it runs. This output starts:


********************************************************************************  
**                    MAPPING PROTEINS TO GENOME (TBLASTN)                    **  
********************************************************************************  

RUNNING: genome_map  -n genome -p 6 -o 5000 -c 2000 -t 1  /software/cegma/sample/sample.prot 
/software/cegma/sample/sample.dna 2>output.cegma.errors  


Building a new DB, current time: 04/17/2014 20:01:19  
New DB name:   /tmp/genome90742.blastdb  
New DB title:  /software/cegma/sample/sample.dna  
Sequence type: Nucleotide  
Keep Linkouts: T  
Keep MBits: T  
Maximum file size: 1073741824B  
Adding sequences from FASTA; added 1 sequences in 0.0374589 seconds.

You should also check that your $WISECONFIGDIR environment variable has been set:

ls $WISECONFIGDIR/gene.stat

If set correctly, this shouldn't produce an error.

2. CEGMA fails when it tries to run the first step (TBLASTN). What should I do?

The first step of CEGMA runs the genome_map script. In turn this relies on using NCBI BLAST+, and in particular the makeblastdb command. First try repeating the genome_map command that is running. E.g.


genome_map  -n genome -p 6 -o 5000 -c 2000 -t 1  /software/cegma/sample/sample.prot 
/software/cegma/sample/sample.dna

Observe any error messages. Most problems at this stage are due to problems with the format of FASTA definition lines not being accepted by makeblastdb. You could always test this step separately as well, e.g. run:


makeblastdb -in <your_genome_FASTA_file> -dbtype nucl -parse_seqids -out test_BLASTDB

Common problems with the FASTA files include:

Duplicate FASTA identifiers (they must be unique)
Presence of certain characters like pipes (|) or commas in definition line.
Parentheses or square brackets may also cause problems.
Using only numbers as identifiers. makeblastdb likes to see some text in the identifier.
Missing identifiers
Missing sequences

A simple, but workable naming scheme is simply:

>scaffold1  
ACGT...  
>scaffold2  
ACGT....

etc.

One other issue that we know about is that changes NCBI made to version 2.2.30+ of NCBI BLAST breaks CEGMA because they no longer support a word size of 5 (which is used by CEGMA when it runs TBLASTN). This may be a bug by the NCBI (more info here). You could change the relevant line of the genome_map script to allow a word size of 6 (which is currently permitted), but this may change downstream CEGMA results.

3. I see an error that `geneid-train did not work properly`. What happened?

CEGMA assumes that there is at least one core gene present in your genome assembly. It’s possible that this isn’t always the case. The CEGMA output tells you how many genes are being kept at each stage. E.g.


********************************************************************************  
**                    MAPPING PROTEINS TO GENOME (TBLASTN)                    **  
********************************************************************************  

Found 2209 candidate regions in seqs.fa  


********************************************************************************  
**     MAKING INITIAL GENE PREDICTIONS FOR CORE GENES (GENEWISE + GENEID)     **  
********************************************************************************  

NOTE: created 14 geneid predictions  


********************************************************************************  
**           FILTERING INITIAL PROTEINS PRODUCED BY GENEID (HMMER)            **  
********************************************************************************  

NOTE: Found 1 geneid predictions with scores above threshold value

If you have very low numbers of candidate genes in the early stages, it’s possible that all will be removed by the filtering steps. And if you end up with zero core genes, you may see a strange error message (possibly a 'divide by zero' error).

4. CEGMA said that my assembly contained 235 CEGs (out of 248). Is this good?

Well, it’s better than having 234 CEGs, but not as good as having 236. The number of core genes in any assembly is just one metric by which to assess the quality of an assembly. There are many others. Please also see my blog post Which genome assembler gives you the best genome assembly? for more discussion of this topic.

5. What do the numerical suffix on KOG IDs represent?

The original CEGMA developer set up things so that each KOG (which contains 6 proteins, one from each of 6 model organism species) is allowed to have up to 5 BLAST matches each in the target genome assembly. This is the first part of the CEGMA pipeline (run by the genome_map script).

Theoretically, this initial BLAST step could therefore find up to 30 non-overlapping candidate regions across a genome. In reality, most of these regions will be overlapping from different input proteins.

So when you see a KOG ID in the CEGMA output files such as ‘KOG0002.9’, this is just referring to the 9th region considered by BLAST for KOG 0002. These IDs are used throughout the downstream steps by CEGMA but are essentially just internal identifiers.

To convert these CEGMA-formatted KOG IDs, back to regular KOG IDs, you can run something like the following in the terminal:


cat output.cegma.id | sed 's/\.[0-9]*//' > output.kog.id

6. What’s the difference between the 458 and 248 sets of core eukaryotic genes (CEGs)?

The latter is a subset of the former, though due to different filtering rules for the more conserved subset, it is possible that a core gene is present in the 458 CEGs but not present in the set of 248 CEGs.

7. What is the difference between a KOG and a CEG?

A KOG is a euKaryotic Orthologous Group, and the KOGs database is the source of the orthologous groups used by CEGMA. Although KOGs (published over a decade ago) was based on 7 species, we only use 6 of these 7 within CEGMA (the microsporidian parasite Encephalitozoon cuniculi was excluded).

So CEGMA uses 458 KOGs (each of which has 6/7 proteins). Essentially, we refer to these as Core Eukaryotic Genes (CEGs). The underlying protein sequences are taken from the KOGs database, but the HMM models we build from each KOG are part of CEGMA. So a 'CEG' really describes a 6-protein KOG plus the HMM that we build from that KOG.

8. How can I test my genewise installation?

You could just try running the genewise command with your own data but to test it with the built-in CEGMA data, try the following steps (which assume that your CEGMA installation is at /software/cegma):


# run genome map with the supplied sample data
genome_map  -n genome -p 6 -o 5000 -c 2000 -t 1  /software/cegma/sample/sample.prot 
/software/cegma/sample/sample.dna

# this should produce a genome.chunks.fa file (amongst others). Now we want to grab
# one sequence from this which should contain a KOG sequence (KOG0631) that genewise can work with.		
head -n 1134 genome.chunks.fa | tail -n 116 > tmp.fa

# check that tmp.fa contains one sequence resulting from BLAST hit to KOG0631!

# now run genewise using the suitable HMM file for this KOG
genewise -splice_gtag -quiet -gff -pretty -alb 
-hmmer /software/cegma/data/hmm_profiles/KOG0631.hmm tmp.fa > tmp.genewise

If this works, you should have a genewise file which describes the protein structure of a potential core gene. If this doesn't work, you probably have a problem with your genewise installation

9) Why do I get a FATAL error when running genome_map?

FATAL ERROR when running genome_map 32512:
sh: 1: ./genome_map: not found

Fix: This happens when $CEGMA isn't set properly. set $CEGMA as described in manual.

10) Why can't CEGMA locate FAlite.pm?

Can't locate FAlite.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 
/usr/local/share/perl/5.14.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.14 
/usr/share/perl/5.14 /usr/local/lib/site_perl .) at ./genome_map line 14.
BEGIN failed--compilation aborted at ./genome_map line 14.

Fix: CEGMA depends on the Perl library FAlite.pm, which is in CEGMA's lib folder, or download from the above link. You can do 'export PERL5LIB:$CEGMA/lib:$PERL5LIB'. Or just copy it to any of the folders in that list or to the directory where you're running CEGMA.

11) Why can't gene.stat file be opened?

Warning Error
Unable to open gene.stat  as gene stats file

Fix: This happens when $WISECONFIGDIR is set incorrectly, or not at all. As a workaround, you can download your own copy of genewise, you don't have to install your own copy, just set your environment variable to the wisecfg folder in the extracted genewise-folder.

export WISECONFIGDIR=/home/your_username/wise2.2.3-rc7/wisecfg

12) What is the NCBI C++ Exception?

Error: NCBI C++ Exception:
ncbi::CObject::ThrowNullPointerException() - Attempt to access NULL pointer.

Fix: Update your blast+ version, older versions like 2.25 have this problem with some datasets and several threads, 2.29 doesn't seem to have it. If it still persists, a workaround is to use -T 1.

13) What does the GeneLoop out of bounds error mean?

In sample.cegma.errors
Warning Error
In GeneLoop6 matrix special read off - out of bounds on matrix [i,j
is 193,-1 state 0 in standard matrix]

Fix: Two things cause this. The first is an older version of CEGMA, so just update it. The second is re-starting CEGMA from an incomplete, crashed run (saved by using the -ext flag). The solution is to delete the results from the incomplete run and start again.

14) Are there other - less painful - ways that I can install CEGMA?

Thanks to Shaun Jackman, there is a Homebrew package for CEGMA available. This is part of the Homebrew-Science initiative, making command-line tools more easily installable on Macs (also on Linux if you use Linuxbrew). If you have homebrew installed you can add CEGMA as follows:

brew install homebrew/science/cegma

An alternative package manager for Linux is Local Package Manager (LPM) and they also have a CEGMA package. Thanks to Masahiro Kasahara for bringing this to my attention.

Users of iPlant can also run CEGMA thanks to Michael Crusoe who has added CEGMA as an iPlant application.

I am also grateful to Matt MacManes for letting me know that there is a public Amazon Machine Instance called CEGMA on the CLOUD, which means you can also run CEGMA on Amazon's Elastic Cloud infrastructure.

If you use Docker, then you can use the CEGMA docker container that was made by Rob Syme.

Finally, we now have a CEGMA VM that you can download which will contain a pre-installed version of CEGMA that runs on Ubuntu Linux. See our CEGMA VM page for more info. Thanks to Richard Feltstykket at the UC Davis Genome Center's Bioinformatics Core for setting this up.

CEGMA FAQ

1. How can I test whether CEGMA installed correctly?

2. CEGMA fails when it tries to run the first step (TBLASTN). What should I do?

3. I see an error that `geneid-train did not work properly`. What happened?

4. CEGMA said that my assembly contained 235 CEGs (out of 248). Is this good?

5. What do the numerical suffix on KOG IDs represent?

6. What’s the difference between the 458 and 248 sets of core eukaryotic genes (CEGs)?

7. What is the difference between a KOG and a CEG?

8. How can I test my genewise installation?

9) Why do I get a FATAL error when running genome_map?

10) Why can't CEGMA locate FAlite.pm?

11) Why can't gene.stat file be opened?

12) What is the NCBI C++ Exception?

13) What does the GeneLoop out of bounds error mean?

14) Are there other - less painful - ways that I can install CEGMA?

Korf Lab in the news

Quick Links

Unix & Perl for Biologists

Unix & Perl to the RESCUE!

Bits & Bites

UC Davis Genome Center

UC Davis Bioinformatics Core

Contact Us

Search Korf Lab Website

CEGMA FAQ

1. How can I test whether CEGMA installed correctly?

2. CEGMA fails when it tries to run the first step (TBLASTN). What should I do?

3. I see an error that geneid-train did not work properly. What happened?

4. CEGMA said that my assembly contained 235 CEGs (out of 248). Is this good?

5. What do the numerical suffix on KOG IDs represent?

6. What’s the difference between the 458 and 248 sets of core eukaryotic genes (CEGs)?

7. What is the difference between a KOG and a CEG?

8. How can I test my genewise installation?

9) Why do I get a FATAL error when running genome_map?

10) Why can't CEGMA locate FAlite.pm?

11) Why can't gene.stat file be opened?

12) What is the NCBI C++ Exception?

13) What does the GeneLoop out of bounds error mean?

14) Are there other - less painful - ways that I can install CEGMA?

Korf Lab in the news

Quick Links

Unix & Perl for Biologists

Unix & Perl to the RESCUE!

Bits & Bites

UC Davis Genome Center

UC Davis Bioinformatics Core

Contact Us

Search Korf Lab Website

3. I see an error that `geneid-train did not work properly`. What happened?