******************** CEGMA v.2.3. README File ******************** $Id: README,v 1.10 2011/07/19 18:20:47 keith Exp $ Summary: A. What's CEGMA ? B. Installing CEGMA C. File Listing D. Compiling CEGMA E. To run CEGMA F. Authors and help *************************************** A. What's CEGMA ? ------------------ CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for building a set of high reliable set of gene annotations in virtually any eukaryotic genome. The strategy relies on a simple fact: some highly conserved proteins are encoded in essentially all eukaryotic genomes. We use the KOGs database to build a set of these highly conserved ubiquitous proteins. We define a set of 458 core proteins, and the protocol, CEGMA, to find orthologs of the core proteins in new genomes and to determine their exon-intron structures. A local version of CEGMA can be installed on UNIX platforms and it requires pre-installation of PERL, NCBI BLAST+, HMMER, GeneWise and geneid. The procedure uses information from the core genes of six model organisms by first using TBLASTN to identify candidate regions in a new genome. It then proposes and redefines gene structures using a combination of GeneWise, HMMER and geneid. The system includes the use of a profile for each core protein to ensure the reliability of the gene structure. Installation, setup and usage of CEGMA is very easy, and there is a range of options to configure output predictions and program behavior. CEGMA source code, compiled binaries and documentation are available under the GNU GENERAL PUBLIC LICENSE. Comments and questions are welcome. *************************************** B. Installing CEGMA -------------------- The CEGMA distribution contains several directories and files. Source code and documentation files are included in the distribution. The distribution is archived and compressed in a single file using the command tar -zcvf. The compressed file name is CEGMA.tar.gz (or something similar depending on compiled binaries included). The CEGMA files can be extracted following these instructions: Type: tar -zxvf CEGMA.tar.gz After executing these commands, the directory cegma will be created in your working directory. CEGMA needs the pre-installation of the following software: - geneid (geneid v 1.4) http://genome.imim.es/software/geneid/ - genewise (wise2.2.3-rc7) http://www.ebi.ac.uk/Wise2/ or http://korflab.ucdavis.edu/Datasets/cegma/wise2.2.3-rc7.tar.gz Note that genewise requires glib to be present for correct installation. - hmmer (HMMER 3.0) http://hmmer.janelia.org/ - NCBI BLAST+ (2.2.25) ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ Check that you have the right version of the previous software. If you are using a different version and experience any problems, please let us know !!! *************************************** C. File Listing --------------- The CEGMA distribution contains the following files and directories: ** bin/ The executable scripts ** data/ Core proteins, core profiles and cutoff and generic parameter file for geneid ** sample/ A test sequence. ** sample_output/ The results generated by CEGMA ** src/ Source code of CEGMA. ** GNULicense This software is registered under GNU license. ** Makefile This file is required to build CEGMA executable files. ** README This file. The CEGMA distribution contains a set of independent programs that are used by CEGMA.pl: * parsewise - a parser for the genewise outputs. * geneid-train and make_paramfile - build a parameter file for geneid *************************************** D. Compiling CEGMA ------------------- Move into the CEGMA directory. Type: make to compile CEGMA. This will generate the CEGMA executable files within the bin/ subdirectory. Type: cegma -h to test the executable file has been correctly created. *************************************** E. To run CEGMA ----------------- There are two environmental variables that can be set by users to their preferences: + You must specify the path where CEGMA can find the default files with the shell variable "CEGMA". + CEGMA needs to write few temporary files in a directory with permissions for current user to read and write. Default temporary directory path is set to "/tmp/" but you can assign a different temporary directory path using the variable "CEGMATMP". + CEGMA uses some homemade PERL modules. You must set the PERL5LIB environment variable to the distribution /lib path or copy the modules to your PERL module directory. Setting those vars in Bourne-shell and C-shell: o Using a Bourne-Shell (e.g. bash): export CEGMA="path" export CEGMATMP="path" export PERL5LIB="$PERL5LIB:$CEGMA/lib" o Using a C-Shell: setenv CEGMA "path" setenv CEGMATMP "path" setenv PERL5LIB "$PERL5LIB:$CEGMA/lib" Genewise will also require that you set the $WISECONFIGDIR environment variable To run CEGMA using the 458 default proteins type: cegma --genome If you have multiple cores on your computer, you can speed things up by using the -threads option which passes the number of specified threads to the TBLASTN and hmmsearch programs. TESTING CEGMA: cegma --genome sample.dna --protein sample.prot -o sample -- The output to compare with the sample files is in: sample_output/ -- CEGMA generates some intermediate files in the process. The files that contain the final predictions, in GFF and the fasta files of the corresponding genome and protein sequences are: output.cegma.fa - predicted CEGs proteins output.cegma.gff - coordinates in the genomic sequences output.cegma.id - KOG ids for the selected proteins output.cegma.local.dna - local fragments of DNA containing the genes output.cegma.local.gff - coordinates in the local fragments output.completeness_report - statistics of the percentages of 248 highly conserved CEGs output.cegma.errors - may contain error messages produced by some programs TROUBLESHOOTING CEGMA: First try inspecting the output.cegma.errors file to see if there are any obvious problems. If you run cegma with the -v option you will see 'verbose' output which will include progress information (i.e. which KOG is currently being processed). This may help let you know how far CEGMA has run. Running CEGMA with the -ext option will preserve all intermediate output files. This can speed up troubleshooting as subsequent runs will first check to see which output files already exist and CEGMA will skip any step for which a required output is already present. Please note that NCBI BLAST+ requires FASTA headers to adhere to certain requirements. FASTA headers which consist only of digits or which consist of digits followed by whitespace followed by any other text will cause problems. Specifically, the blastdbcmd program will not be able to extract sequences from a BLAST database if the sequence index only consists of numbers. RUNNING OTHER SETS OF PROTEINS WITH CEGMA If you have a set of proteins that you want to use instead of the KOGs provided by CEGMA, you can do that easily. You have to create a HMM profile with HMMER, chose a cutoff for each profile and use the following options when running CEGMA: -p, --protein fasta file of the protein sequences. --prot_num Number of proteins per family/profile. They have to be in consecutive order in the fasta file. (default: 6) --cutoff_file File with the cutoff for the HMMER alignments. (default: \$CEGMA/data/profiles_cutoff.tbl) --hmm_prefix Each protein ID must have "___" followed by the hmmprefix and a number (ex. At3g02190___KOG1762). (default: KOG) --hmm_directory Directory that contains the hmm files. The files must be named hmm_prefix(number).hmm ex. KOG1762.hmm. (default: \$CEGMA/data/hmm_profiles) Example: cegma --genome sample.dna --prot_num 4 --protein ORTH.fa --hmm_prefix ORTH \ --hmm_profiles hmm_profiles/ --cutoff_file profiles_cutoff.tbl For the previous command-line example, you must have 4 proteins per family and the proteins must be named protid___ORTH[0-1] (ex:At3g02190___ORTH0001). You must also have a directory with the hmm profile for each family name ORTH0001.hmm. *************************************** F. Authors and help ------------------- CEGMA has been written by Genis Parra (formerly at UC Davis Genome Center) and subsequently updated by Keith Bradnam (krbradnam@ucdavis.edu). CEGMA home page is at "http://korflab.ucdavis.edu/Datasets/cegma/"