Sequence Data for Assemblathon - A Genome Assembly Challenge ------------------------------------------------------------ The Assemblathon is a collaborative project to improve the computational tools used in genome assembly, and to produce new metrics by which to assess the quality of assembled genomes. For more information about the Assemblathon, visit http://assemblathon.org or email assemblathon-help@ucdavis.edu. The information below relates to the 2010/2011 Assemblathon. Species 'A' ----------- Simulated Illumina data for species 'A' is represented by four libraries using a range of insert sizes: two are based on the Paired-end assay and two are based on the Mate Pair assay. I.e. Paired-end: 100 bp (x 2) reads from 200 +/- 20 bp insert library at 40x coverage 100 bp (x 2) reads from 300 +/- 30 bp insert library at 40x coverage Mate Pair: 100 bp (x 2) reads from 3,000 +/- 300 bp insert library at 20x coverage 100 bp (x 2) reads from 10,000 +/- 1,000 bp insert library at 20x coverage Mate Pair loop fragmentation size: insert = 500 +/- 50 bp Each library is represented by two files (one for each read of the paired reads). Filenames include the insert size, coverage and pair read identifier (1 or 2). I.e. speciesA_200i_40x.1.fastq.bz2 speciesA_200i_40x.2.fastq.bz2 speciesA_300i_40x.1.fastq.bz2 speciesA_300i_40x.2.fastq.bz2 speciesA_3000i_20x.1.fastq.bz2 speciesA_3000i_20x.2.fastq.bz2 speciesA_10000i_20x.1.fastq.bz2 speciesA_10000i_20x.2.fastq.bz2 The sequence identifier for each read includes details of the library used and a unique read identifier which also denotes which member of the pair the read is. E.g. @assemblathon_20x10000i_16/1 This would indicate that this is read '16' from the first (/1) member of a read pair that was derived from the 20x coverage library that used a 10,000 bp insert size. Species 'B' ----------- The file 'speciesB.diploid.fa.bz2' contains the diploid reference sequence for an ancestor (~100 my diverged) of the sequenced genome from species A. Additionally, 'speciesB.reference.fa.bz2' contains what can be thought of as the reference version of the B genome. Associated with the reference sequence is a set of annotations in GFF format ('speciesB.annotations.gff.bz2'). The GFF file provides coordinates of CDSs, UTRs, non-gene contrained elements (NGEs), and non-exonic contrained elements (NXEs). Additional Notes ---------------- The genome of species 'A ' diploid, just like species 'B'. Synthetic Illumina Reads are generated at random from the two haplotypes of species 'A'. The mate pair libraries contain a fraction of reads that are the result of the wrong loop fragment being pulled down. Additionally all libraries will contain some duplicate reads (with unique error) and some bacterial contamination. Finally the mate-pair libraries will sometimes have chimeric reads, the frequency of this depends on the size of the mate-pair loop fragmentation size (500 +/- 50 bp) and the read length. It is assumed that the joining of the ends of the mate-pair loop may occur uniformly throughout the fragment if the fragment containing biotin is pulled down at all. The quality scores on all reads are in Phred+64 ascii format following the Illumina 1.5+ specification where the 'B' character has special meaning and phred scores of 0 and 1 are left out. In the simulation we conditioned our error rate on the reference base, position in the read, and the assigned quality score. We do not assume that the assigned quality score is a direct representation of the underlying quality and instead chose to model error in this empirically. The programs used to model this error and simulate reads may be found here: https://github.com/jstjohn/SimSeq The error in these reads may not be as easy to trim as error in a real Illumina sequencing run. In observing a typical real dataset you will probably notice that error tends to cluster towards the ends of reads. In our dataset this clustering of error can also be seen, because we model error based on the position of the base within the read. However, since we do not condition our error on the error of the previous base, each position's error is completely independent of the error of the previous position in any given read.