******************** CEGMA v.2.0. README File ******************** $Id: README,v 1.6 2007/09/28 03:18:52 genis Exp $ Summary: A. What's CEGMA ? B. Installing CEGMA C. File Listing D. Compiling CEGMA E. To run CEGMA F. Authors and help *************************************** A. What's CEGMA ? ------------------ CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for building a set of high reliable set of gene annotations in virtually any eukaryotic genome. The strategy relies on a simple fact: some highly conserved proteins are encoded in essentially all eukaryotic genomes. We use the KOGs database to build a set of these highly conserved ubiquitous proteins. We define a set of 458 core proteins, and the protocol, CEGMA, to find orthologs of the core proteins in new genomes and to determine their exon-intron structures. A local version of CEGMA can be installed on UNIX platforms and it requires pre-instalation of PERL, WU-BLAST, HMMER, GeneWise and geneid. The procedure uses information from the core genes of six model organisms by first using TBLASTN to identify candidate regions in a new genome. It then proposes and redefines gene structures using a combination of GeneWise, HMMER and geneid. The system includes the use of a profile for each corfe protein to ensure the reliability of the gene structure. Installation, setup and usage of CEGMA is very easy, and there is a range of options to configure output predictions and program behaviour. CEGMA source code, compiled binaries and documentation are available under the GNU GENERAL PUBLIC LICENSE. Comments and questions are welcome. *************************************** B. Installing CEGMA -------------------- The CEGMA distribution contains several directories and files. Source code and documentation files are included in the distribution. The distribution is archived and compressed in a single file using the command tar -zcvf. The compressed file name is CEGMA.tar.gz (or something similar depending on compiled binaries included). The CEGMA files can be extracted following these instructions: Type: tar -zxvf CEGMA.tar.gz After executing these commands, the directory cegma will be created in your working directory. CEGMA needs the pre-instalation of the following software: - geneid (geneid v 1.3.7) http://genome.imim.es/software/geneid/ - genewise (wise2.2.3-rc7) http://www.ebi.ac.uk/Wise2/ - hmmer (HMMER 2.3.2 [Oct 2003]) http://hmmer.janelia.org/ - wu-blast (TBLASTN 2.0MP-WashU [10-May-2005]) http://blast.wustl.edu/ Check that you have the right version of the previous software. If you are using a different version and experience any problem, please let us know !!! *************************************** C. File Listing --------------- The CEGMA distribution contains the following files and directories: ** bin/ The executable scripts ** data/ Core proteins, core profiles and cutoff and generic parameter file for geneid ** sample/ A test sequence. ** sample_output/ The results generated by CEGMA ** src/ Source code of CEGMA. ** GNULicense This software is registered under GNU license. ** Makefile This file is required to build CEGMA executable files. ** README This file. The CEGMA distribution contains a set of independent programs that are used by CEGMA.pl: * parseblast - a complete parser far any of the blast versions and outputs. * parsewise - a parser for the genewise outputs. * get_chunks - a program to build the candidate regions from the blast resuts * geneid-train and make_paramfile - build a parameter file for geneid *************************************** D. Compiling CEGMA ------------------- Move into the CEGMA directory. Type: make to compile CEGMA. This will generate the CEGMA executable files within the bin/ subdirectory. Type: cegma -h to test the executable file has been correctly created. *************************************** E. To run CEGMA ----------------- There are two environmental variables that can be set by users to their preferences: + You must specify the path where CEGMA can find the default files with the shell variable "CEGMA". + CEGMA needs to write few temporary files in a directory with permissions for current user to read and write. Default temporary directory path is set to "/tmp/" but you can assign a different temporary directory path using the variable "CEGMATMP". + CEGMA uses some homemade PERL modules. You must set the PERL5LIB environment variable to the distribution /lib path or copy the modules to your PERL module directory. Setting those vars in Bourne-shell and C-shell: o Using a Bourne-Shell (e.g. bash): export CEGMA="path" export CEGMATMP="path" export PERL5LIB="$PERL5LIB:$CEGMA/lib" o Using a C-Shell: setenv CEGMA "path" setenv CEGMATMP "path" setenv PERL5LIB "$PERL5LIB:$CEGMA/lib" To run CEGMA using the 458 default proteins type: cegma --genome TESTING CEGMA: cegma --genome sample.dna --protein sample.prot -o sample -- The output to compare with the sample files is in: sample_output/ -- CEGMA generates some intermediate files in the process. The files that contain the final predictions, in GFF and the fasta files of the corresponding genome and protein sequences are: sample.cegma.fa - predicted CEGs proteins sample.cegma.gff - coordinates in the genomic sequences sample.cegma.id - KOG ids for the selected proteins sample.cegma.local.dna - local fragments of DNA containing the genes sample.cegma.local.gff - coordinates in the local fragments sample.completeness_report - statistics of the percentages of CEGs *************************************** F. Authors and help ------------------- CEGMA has been written by Genis Parra (Genome Center UCDAVIS). CEGMA home page is at "http://korflab.ucdavis.edu/Datasets/cegma/"