Sequence Data for Assemblathon - A Genome Assembly Challenge
------------------------------------------------------------

The Assemblathon is a collaborative project to improve the computational tools used in genome assembly, and to
produce new metrics by which to assess the quality of assembled genomes. For more information about the 
Assemblathon, visit http://assemblathon.org or email assemblathon-help@ucdavis.edu. The information below
relates to the 2010/2011 Assemblathon.


Species 'A'
-----------

Simulated Illumina data for species 'A' is represented by four libraries using a range of insert sizes: 
two are based on the Paired-end assay and two are based on the Mate Pair assay. I.e.

Paired-end:
100 bp (x 2) reads from 200 +/- 20 bp insert library at 40x coverage
100 bp (x 2) reads from 300 +/- 30 bp insert library at 40x coverage 

Mate Pair:
100 bp (x 2) reads from 3,000 +/- 300 bp insert library at 20x coverage
100 bp (x 2) reads from 10,000 +/- 1,000 bp insert library at 20x coverage

Mate Pair loop fragmentation size: insert = 500 +/- 50 bp

Each library is represented by two files (one for each read of the paired reads). Filenames include the insert size, 
coverage and pair read identifier (1 or 2). I.e.

speciesA_200i_40x.1.fastq.bz2
speciesA_200i_40x.2.fastq.bz2
speciesA_300i_40x.1.fastq.bz2
speciesA_300i_40x.2.fastq.bz2
speciesA_3000i_20x.1.fastq.bz2
speciesA_3000i_20x.2.fastq.bz2
speciesA_10000i_20x.1.fastq.bz2
speciesA_10000i_20x.2.fastq.bz2

The sequence identifier for each read includes details of the library used and a unique read identifier
which also denotes which member of the pair the read is. E.g.

@assemblathon_20x10000i_16/1

This would indicate that this is read '16' from the first (/1) member of a read pair that was derived from
the 20x coverage library that used a 10,000 bp insert size.


Species 'B'
-----------

The file 'speciesB.diploid.fa.bz2' contains the diploid reference sequence for an ancestor (~100 my diverged)
of the sequenced genome from species A. Additionally, 'speciesB.reference.fa.bz2' contains what can be thought of 
as the reference version of the B genome.

Associated with the reference sequence is a set of annotations in GFF format ('speciesB.annotations.gff.bz2').
The GFF file provides coordinates of CDSs, UTRs, non-gene contrained elements (NGEs), and non-exonic contrained
elements (NXEs).


Additional Notes
----------------

The genome of species 'A ' diploid, just like species 'B'. Synthetic Illumina Reads are generated at 
random from the two haplotypes of species 'A'.

The mate pair libraries contain a fraction of reads that are the result of the wrong loop fragment 
being pulled down. Additionally all libraries will contain some duplicate reads (with unique error) 
and some bacterial contamination. Finally the mate-pair libraries will sometimes have chimeric reads,
the frequency of this depends on the size of the mate-pair loop fragmentation size (500 +/- 50 bp) and  
the read length. It is assumed that the joining of the ends of the mate-pair loop may occur uniformly 
throughout the fragment if the fragment containing biotin is pulled down at all.

The quality scores on all reads are in Phred+64 ascii format following the Illumina 1.5+ specification 
where the 'B' character has special meaning and phred scores of 0 and 1 are left out.

In the simulation we conditioned our error rate on the reference base, position in the read, and the 
assigned quality score. We do not assume that the assigned quality score is a direct representation 
of the underlying quality and instead chose to model error in this empirically.

The programs used to model this error and simulate reads may be found here: https://github.com/jstjohn/SimSeq

The error in these reads may not be as easy to trim as error in a real Illumina sequencing run. In 
observing a typical real dataset you will probably notice that error tends to cluster towards the ends 
of reads. In our dataset this clustering of error can also be seen, because we model error based on the 
position of the base within the read. However, since we do not condition our error on the error of the 
previous base, each position's error is completely independent of the error of the previous position in 
any given read.