Index of /Datasets/Assemblathon/Assemblathon1
Name Last modified Size Description
Parent Directory -
Entries/ 17-Mar-2011 18:22 -
assemblathon_stats.pl 18-Apr-2011 09:29 14K
assemblathon_talk.pdf 15-Mar-2011 08:28 1.8M
assemblathon_ucdavis..> 28-Apr-2011 16:41 1.4M
basic_assembly_metri..> 24-Mar-2011 14:22 241K
biology_of_genomes_a..> 12-May-2011 05:42 2.5M
checksums.md5 05-Jan-2011 18:22 1.5K
genome_informatics_a..> 09-Nov-2011 21:30 16M
speciesA.diploid.fa.bz2 25-Feb-2011 10:49 62M
speciesA_10000i_20x...> 03-Dec-2010 13:23 930M
speciesA_10000i_20x...> 03-Dec-2010 13:23 945M
speciesA_10000i_20x_..> 05-Jan-2011 17:58 930M
speciesA_10000i_20x_..> 05-Jan-2011 17:59 945M
speciesA_10000i_20x_..> 05-Jan-2011 17:22 930M
speciesA_10000i_20x_..> 05-Jan-2011 17:23 945M
speciesA_200i_40x.1...> 03-Dec-2010 13:51 1.8G
speciesA_200i_40x.2...> 03-Dec-2010 13:51 1.8G
speciesA_3000i_20x.1..> 03-Dec-2010 13:21 931M
speciesA_3000i_20x.2..> 03-Dec-2010 13:21 946M
speciesA_3000i_20x_r..> 05-Jan-2011 17:52 931M
speciesA_3000i_20x_r..> 05-Jan-2011 17:54 946M
speciesA_3000i_20x_r..> 05-Jan-2011 17:24 931M
speciesA_3000i_20x_r..> 05-Jan-2011 17:26 946M
speciesA_300i_40x.1...> 03-Dec-2010 13:54 1.8G
speciesA_300i_40x.2...> 03-Dec-2010 13:54 1.8G
speciesB.annotations..> 07-Dec-2010 13:39 11M
speciesB.diploid.fa.bz2 03-Dec-2010 16:53 62M
speciesB.reference.f..> 07-Dec-2010 13:39 31M
ucsc_assemblathon_an..> 15-Mar-2011 10:28 16M
ucsc_assemblathon_ta..> 15-Mar-2011 10:26 33M
Sequence Data for Assemblathon - A Genome Assembly Challenge
The Assemblathon is a collaborative project to improve the computational tools used in genome assembly, and to
produce new metrics by which to assess the quality of assembled genomes. For more information about the
Assemblathon, visit http://assemblathon.org or email firstname.lastname@example.org. The information below
relates to the 2010/2011 Assemblathon.
Simulated Illumina data for species 'A' is represented by four libraries using a range of insert sizes:
two are based on the Paired-end assay and two are based on the Mate Pair assay. I.e.
100 bp (x 2) reads from 200 +/- 20 bp insert library at 40x coverage
100 bp (x 2) reads from 300 +/- 30 bp insert library at 40x coverage
100 bp (x 2) reads from 3,000 +/- 300 bp insert library at 20x coverage
100 bp (x 2) reads from 10,000 +/- 1,000 bp insert library at 20x coverage
Mate Pair loop fragmentation size: insert = 500 +/- 50 bp
Each library is represented by two files (one for each read of the paired reads). Filenames include the insert size,
coverage and pair read identifier (1 or 2). I.e.
The sequence identifier for each read includes details of the library used and a unique read identifier
which also denotes which member of the pair the read is. E.g.
This would indicate that this is read '16' from the first (/1) member of a read pair that was derived from
the 20x coverage library that used a 10,000 bp insert size.
The file 'speciesB.diploid.fa.bz2' contains the diploid reference sequence for an ancestor (~100 my diverged)
of the sequenced genome from species A. Additionally, 'speciesB.reference.fa.bz2' contains what can be thought of
as the reference version of the B genome.
Associated with the reference sequence is a set of annotations in GFF format ('speciesB.annotations.gff.bz2').
The GFF file provides coordinates of CDSs, UTRs, non-gene contrained elements (NGEs), and non-exonic contrained
The genome of species 'A ' diploid, just like species 'B'. Synthetic Illumina Reads are generated at
random from the two haplotypes of species 'A'.
The mate pair libraries contain a fraction of reads that are the result of the wrong loop fragment
being pulled down. Additionally all libraries will contain some duplicate reads (with unique error)
and some bacterial contamination. Finally the mate-pair libraries will sometimes have chimeric reads,
the frequency of this depends on the size of the mate-pair loop fragmentation size (500 +/- 50 bp) and
the read length. It is assumed that the joining of the ends of the mate-pair loop may occur uniformly
throughout the fragment if the fragment containing biotin is pulled down at all.
The quality scores on all reads are in Phred+64 ascii format following the Illumina 1.5+ specification
where the 'B' character has special meaning and phred scores of 0 and 1 are left out.
In the simulation we conditioned our error rate on the reference base, position in the read, and the
assigned quality score. We do not assume that the assigned quality score is a direct representation
of the underlying quality and instead chose to model error in this empirically.
The programs used to model this error and simulate reads may be found here: https://github.com/jstjohn/SimSeq
The error in these reads may not be as easy to trim as error in a real Illumina sequencing run. In
observing a typical real dataset you will probably notice that error tends to cluster towards the ends
of reads. In our dataset this clustering of error can also be seen, because we model error based on the
position of the base within the read. However, since we do not condition our error on the error of the
previous base, each position's error is completely independent of the error of the previous position in
any given read.