******************** CEGMA v.2.3. README File ********************

	$Id: README,v 1.10 2011/07/19 18:20:47 keith Exp $	


Summary:
A. What's CEGMA ?
B. Installing CEGMA
C. File Listing
D. Compiling CEGMA
E. To run CEGMA
F. Authors and help

***************************************

A. What's CEGMA ?
------------------

CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for
building a set of high reliable set of gene annotations in virtually
any eukaryotic genome. The strategy relies on a simple fact: some
highly conserved proteins are encoded in essentially all eukaryotic
genomes. We use the KOGs database to build a set of these highly
conserved ubiquitous proteins. We define a set of 458 core proteins,
and the protocol, CEGMA, to find orthologs of the core proteins in new
genomes and to determine their exon-intron structures.

A local version of CEGMA can be installed on UNIX platforms and it
requires pre-installation of PERL, NCBI BLAST+, HMMER, GeneWise and
geneid. The procedure uses information from the core genes of six
model organisms by first using TBLASTN to identify candidate regions
in a new genome. It then proposes and redefines gene structures using
a combination of GeneWise, HMMER and geneid. The system includes the
use of a profile for each core protein to ensure the reliability of
the gene structure.

Installation, setup and usage of CEGMA is very easy, and there is a 
range of options to configure output predictions and program behavior.

CEGMA source code, compiled binaries and documentation are available 
under the GNU GENERAL PUBLIC LICENSE.

Comments and questions are welcome.      

***************************************

B. Installing CEGMA
--------------------

The CEGMA distribution contains several directories and files. Source 
code and documentation files are included in the distribution.

The distribution is archived and compressed in a single file using the
command tar -zcvf. The compressed file name is CEGMA.tar.gz (or
something similar depending on compiled binaries included). The CEGMA
files can be extracted following these instructions:

Type: 
tar -zxvf CEGMA.tar.gz

After executing these commands, the directory cegma will be created 
in your working directory. 

CEGMA needs the pre-installation of the following software:

- geneid (geneid v 1.4)
http://genome.imim.es/software/geneid/

- genewise (wise2.2.3-rc7)
http://www.ebi.ac.uk/Wise2/ or
http://korflab.ucdavis.edu/Datasets/cegma/wise2.2.3-rc7.tar.gz

Note that genewise requires glib to be present for correct installation.


- hmmer (HMMER 3.0) 
http://hmmer.janelia.org/

- NCBI BLAST+ (2.2.25)
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Check that you have the right version of the previous software.
If you are using a different version and experience any problems,
please let us know !!!


***************************************

C. File Listing
---------------

The CEGMA distribution contains the following files and directories:

** bin/
The executable scripts

** data/
Core proteins, core profiles and cutoff and generic parameter file 
for geneid

** sample/
A test sequence. 

** sample_output/
The results generated by CEGMA

** src/ 
Source code of CEGMA.

** GNULicense
This software is registered under GNU license.

** Makefile
This file is required to build CEGMA executable files.

** README
This file.

The CEGMA distribution contains a set of independent programs that are
used by CEGMA.pl:

* parsewise - a parser for the genewise outputs.  
* geneid-train and make_paramfile - build a parameter file for geneid

***************************************

D. Compiling CEGMA
-------------------

Move into the CEGMA directory.

Type:
make 
to compile CEGMA.

This will generate the CEGMA executable files within the bin/ subdirectory. 

Type:
cegma -h 

to test the executable file has been correctly created.


***************************************

E. To run CEGMA
-----------------

There are two environmental variables that can be set by users to
their preferences:

   + You must specify the path where CEGMA can find the default files
   with the shell variable "CEGMA". 

   + CEGMA needs to write few temporary files in a directory with
   permissions for current user to read and write. Default temporary
   directory path is set to "/tmp/" but you can assign a different
   temporary directory path using the variable "CEGMATMP".

   + CEGMA uses some homemade PERL modules. You must set the PERL5LIB
   environment variable to the distribution /lib path or copy the
   modules to your PERL module directory.

   
Setting those vars in Bourne-shell and C-shell:

     o Using a Bourne-Shell (e.g. bash):
           export CEGMA="path"
           export CEGMATMP="path"
	   export PERL5LIB="$PERL5LIB:$CEGMA/lib"

     o Using a C-Shell:
           setenv CEGMA "path"
           setenv CEGMATMP "path"
           setenv PERL5LIB "$PERL5LIB:$CEGMA/lib"

Genewise will also require that you set the $WISECONFIGDIR environment variable


To run CEGMA using the 458 default proteins type:

cegma --genome <genomic_fasta_sequence>

If you have multiple cores on your computer, you can speed things up by using the -threads <n> option 
which passes the number of specified threads to the TBLASTN and hmmsearch programs.

TESTING CEGMA:

cegma --genome sample.dna --protein sample.prot -o sample

-- The output to compare with the sample files is in: 
  sample_output/

-- CEGMA generates some intermediate files in the process. The files
that contain the final predictions, in GFF and the fasta files of the
corresponding genome and protein sequences are:

  output.cegma.fa               - predicted CEGs proteins
  output.cegma.gff              - coordinates in the genomic sequences
  output.cegma.id               - KOG ids for the selected proteins
  output.cegma.local.dna        - local fragments of DNA containing the genes
  output.cegma.local.gff        - coordinates in the local fragments
  output.completeness_report    - statistics of the percentages of 248 highly conserved CEGs
  output.cegma.errors           - may contain error messages produced by some programs


TROUBLESHOOTING CEGMA:

First try inspecting the output.cegma.errors file to see if there are any obvious problems. 
If you run cegma with the -v option you will see 'verbose' output which will include progress
information (i.e. which KOG is currently being processed). This may help let you know how far
CEGMA has run. 

Running CEGMA with the -ext option will preserve all intermediate output files. This can speed
up troubleshooting as subsequent runs will first check to see which output files already exist
and CEGMA will skip any step for which a required output is already present. 

Please note that NCBI BLAST+ requires FASTA headers to adhere to certain requirements. FASTA
headers which consist only of digits or which consist of digits followed by whitespace followed
by any other text will cause problems. Specifically, the blastdbcmd program will not be able
to extract sequences from a BLAST database if the sequence index only consists of numbers.


RUNNING OTHER SETS OF PROTEINS WITH CEGMA

If you have a set of proteins that you want to use instead of the KOGs
provided by CEGMA, you can do that easily. You have to create a HMM
profile with HMMER, chose a cutoff for each profile and use the
following options when running CEGMA:

     -p, --protein     fasta file of the protein sequences.

     --prot_num        Number of proteins per family/profile. 
                       They have to be in consecutive order in the fasta file.
                          (default: 6)
     --cutoff_file     File with the cutoff for the HMMER alignments.
                          (default: \$CEGMA/data/profiles_cutoff.tbl) 
     --hmm_prefix      Each protein ID must have "___" followed by the hmmprefix 
                       and a number (ex. At3g02190___KOG1762).
                          (default: KOG)
     --hmm_directory   Directory that contains the hmm files. The files must be
                       named hmm_prefix(number).hmm  ex. KOG1762.hmm.
                          (default: \$CEGMA/data/hmm_profiles)    
Example:

   cegma  --genome sample.dna --prot_num 4 --protein ORTH.fa --hmm_prefix ORTH \
          --hmm_profiles hmm_profiles/  --cutoff_file profiles_cutoff.tbl
 
For the previous command-line example, you must have 4 proteins per family and
the proteins must be named protid___ORTH[0-1] (ex:At3g02190___ORTH0001). 

You must also have a directory with the hmm profile for each family name
ORTH0001.hmm.


***************************************

F. Authors and help
-------------------

CEGMA has been written by Genis Parra (formerly at UC Davis Genome Center) and subsequently updated
by Keith Bradnam (krbradnam@ucdavis.edu).

CEGMA home page is at "http://korflab.ucdavis.edu/Datasets/cegma/"