Index of /Datasets/Assemblathon/Assemblathon1

 Name                    Last modified      Size  Description
 Parent Directory                             -   
 Entries/                17-Mar-2011 18:22    -   
 checksums.md5           05-Jan-2011 18:22  1.5K  
 assemblathon_stats.pl   18-Apr-2011 09:29   14K  
 basic_assembly_metri..> 24-Mar-2011 14:22  241K  
 assemblathon_ucdavis..> 28-Apr-2011 16:41  1.4M  
 assemblathon_talk.pdf   15-Mar-2011 08:28  1.8M  
 biology_of_genomes_a..> 12-May-2011 05:42  2.5M  
 speciesB.annotations..> 07-Dec-2010 13:39   11M  
 genome_informatics_a..> 09-Nov-2011 21:30   16M  
 ucsc_assemblathon_an..> 15-Mar-2011 10:28   16M  
 speciesB.reference.f..> 07-Dec-2010 13:39   31M  
 ucsc_assemblathon_ta..> 15-Mar-2011 10:26   33M  
 speciesB.diploid.fa.bz2 03-Dec-2010 16:53   62M  
 speciesA.diploid.fa.bz2 25-Feb-2011 10:49   62M  
 speciesA_10000i_20x...> 03-Dec-2010 13:23  930M  
 speciesA_10000i_20x_..> 05-Jan-2011 17:58  930M  
 speciesA_10000i_20x_..> 05-Jan-2011 17:22  930M  
 speciesA_3000i_20x_r..> 05-Jan-2011 17:52  931M  
 speciesA_3000i_20x_r..> 05-Jan-2011 17:24  931M  
 speciesA_3000i_20x.1..> 03-Dec-2010 13:21  931M  
 speciesA_10000i_20x...> 03-Dec-2010 13:23  945M  
 speciesA_10000i_20x_..> 05-Jan-2011 17:23  945M  
 speciesA_10000i_20x_..> 05-Jan-2011 17:59  945M  
 speciesA_3000i_20x_r..> 05-Jan-2011 17:26  946M  
 speciesA_3000i_20x_r..> 05-Jan-2011 17:54  946M  
 speciesA_3000i_20x.2..> 03-Dec-2010 13:21  946M  
 speciesA_300i_40x.1...> 03-Dec-2010 13:54  1.8G  
 speciesA_200i_40x.1...> 03-Dec-2010 13:51  1.8G  
 speciesA_300i_40x.2...> 03-Dec-2010 13:54  1.8G  
 speciesA_200i_40x.2...> 03-Dec-2010 13:51  1.8G

Sequence Data for Assemblathon - A Genome Assembly Challenge
------------------------------------------------------------

The Assemblathon is a collaborative project to improve the computational tools used in genome assembly, and to
produce new metrics by which to assess the quality of assembled genomes. For more information about the
Assemblathon, visit http://assemblathon.org or email assemblathon-help@ucdavis.edu. The information below
relates to the 2010/2011 Assemblathon.

Species 'A'
-----------

Simulated Illumina data for species 'A' is represented by four libraries using a range of insert sizes:
two are based on the Paired-end assay and two are based on the Mate Pair assay. I.e.

Paired-end:
100 bp (x 2) reads from 200 +/- 20 bp insert library at 40x coverage
100 bp (x 2) reads from 300 +/- 30 bp insert library at 40x coverage

Mate Pair:
100 bp (x 2) reads from 3,000 +/- 300 bp insert library at 20x coverage
100 bp (x 2) reads from 10,000 +/- 1,000 bp insert library at 20x coverage

Mate Pair loop fragmentation size: insert = 500 +/- 50 bp

Each library is represented by two files (one for each read of the paired reads). Filenames include the insert size,
coverage and pair read identifier (1 or 2). I.e.

speciesA_200i_40x.1.fastq.bz2
speciesA_200i_40x.2.fastq.bz2
speciesA_300i_40x.1.fastq.bz2
speciesA_300i_40x.2.fastq.bz2
speciesA_3000i_20x.1.fastq.bz2
speciesA_3000i_20x.2.fastq.bz2
speciesA_10000i_20x.1.fastq.bz2
speciesA_10000i_20x.2.fastq.bz2

The sequence identifier for each read includes details of the library used and a unique read identifier
which also denotes which member of the pair the read is. E.g.

@assemblathon_20x10000i_16/1

This would indicate that this is read '16' from the first (/1) member of a read pair that was derived from
the 20x coverage library that used a 10,000 bp insert size.

Species 'B'
-----------

The file 'speciesB.diploid.fa.bz2' contains the diploid reference sequence for an ancestor (~100 my diverged)
of the sequenced genome from species A. Additionally, 'speciesB.reference.fa.bz2' contains what can be thought of
as the reference version of the B genome.

Associated with the reference sequence is a set of annotations in GFF format ('speciesB.annotations.gff.bz2').
The GFF file provides coordinates of CDSs, UTRs, non-gene contrained elements (NGEs), and non-exonic contrained
elements (NXEs).

Additional Notes
----------------

The genome of species 'A ' diploid, just like species 'B'. Synthetic Illumina Reads are generated at
random from the two haplotypes of species 'A'.

The mate pair libraries contain a fraction of reads that are the result of the wrong loop fragment
being pulled down. Additionally all libraries will contain some duplicate reads (with unique error)
and some bacterial contamination. Finally the mate-pair libraries will sometimes have chimeric reads,
the frequency of this depends on the size of the mate-pair loop fragmentation size (500 +/- 50 bp) and
the read length. It is assumed that the joining of the ends of the mate-pair loop may occur uniformly
throughout the fragment if the fragment containing biotin is pulled down at all.

The quality scores on all reads are in Phred+64 ascii format following the Illumina 1.5+ specification
where the 'B' character has special meaning and phred scores of 0 and 1 are left out.

In the simulation we conditioned our error rate on the reference base, position in the read, and the
assigned quality score. We do not assume that the assigned quality score is a direct representation
of the underlying quality and instead chose to model error in this empirically.

The programs used to model this error and simulate reads may be found here: https://github.com/jstjohn/SimSeq

The error in these reads may not be as easy to trim as error in a real Illumina sequencing run. In
observing a typical real dataset you will probably notice that error tends to cluster towards the ends
of reads. In our dataset this clustering of error can also be seen, because we model error based on the
position of the base within the read. However, since we do not condition our error on the error of the
previous base, each position's error is completely independent of the error of the previous position in
any given read.