Assembly Strategy & Data Production Progress

Data production • General outline of assembly strategy

Original plan • 454 • SOLiD • WGP (Keygene): new sequence-based physical map • Due date: July 15

Developments • US to join 454 data production • Spain to prepare 4-5 kb mate-pair library and run • Throughput Titanium lower than specs (500 Mb/run): • 350-400 Mb /run • Effect of clonality/redundancy apparent in 454 data: • ~11% in shotgun library • ~13% in 3 kb library • ~30% in 20 kb library • Roche/454 offered to prepare additional paired-end libraries • new recommendations for coverage given by Roche/454:

New recommendations Roche/454 • Libraries per genome size • 3kb: 1 library every 250MB of your genome • 8kb: 1 library every 100MB (or 250MB) of your genome • 20kb: 1 library every 100MB of your genome • Sequencing per library • 3kb: 2 Titanium runs per library, 3X coverage • 8kb: 1 Titanium run per library, 2X coverage • 20kb: 0.5 Titanium runs per library, 1.5-2X coverage • 15X shotgun reads

Paired-end library production by Roche/454 • Q2 2009 • 3 kb libraries: 4 • 20 kb libraries: 4 • Q3 2009 (currently being produced: ready ~beginning august) • 8 kb libraries: 10 (US) • 20 kb libraries: 6 (Italy & France) • 40 kb libraries: 4 (US)

NL sequencing of Q2 2009 libraries • shotgun libraries (home made): • total: 19 runs • 3 kb: • lib1: 4.0 runs; lib2: 1.625 runs; lib3: 0.25 runs; lib4: 0.25 runs • total: 6.125 runs • libraries also shipped to Italy and France • 20 kb: • lib1: 5.75 runs; lib2 1 run; lib3: 0.25 runs; lib3: 0.25 runs • total: 7.25 runs • libraries also shipped to Italy and France

Typical output 454 Ti run basecalling software bug new version basecaller may yield additional ~25 bases/read!

Calculations (1) low end specs corrections for clonality/redundancy

Calculations (2) NL sequencing of Q2 2009 libraries • shotgun libraries (home made): • total: 19 runs = 5.9 Gb = 6.2X • recommended = 15X • 3 kb: • total: 6.125 runs = 1.7 Gb (nonredundant) * 50% = 0.9X paired ends • recommended = 3X • 20 kb: • total: 7.25 runs = 1.5 Gb (nonredundant) * 50% = 0.8X paired ends • recommended = 1.5-2X

To be calculated today! • Who has to do how much additional sequencing from which libraries?

SOliD data production • NL / Applied BioSystems

SOliD data production • Applied BioSystems offered to prepare additional 10 kb mate-pair library • currently running in Italy • Spain produces 4-5 kb mate-pair library • Discussion: • do we need additional 7 kb mate-pair library, to be prepared by UK?

Additional data • ~4 Million shotgun Sanger reads from Selected BAC Mixture (SBM-data, Kazusa) • currently being put on harddisk which will be shipped to Netherlands this week • 400,000 BAC ends (200,000 pairs) • 200,000 fosmid ends (100,000 pairs) • additional 200K reads will be produced (?) • ~36% euchromatic sequence (70 Mb) • WGP: sequenced based physical map

Data production • General outline of assembly strategy

Strategy overview • Create assembly-validation set • Filter raw data • De novo assembly of 454 & SBM data • Consolidate 454/SBM assemblies • Integrate SOLiD data into 454/SBM assembly • Scaffold using BAC and fosmid ends • Map scaffolds to physical map

Strategy overview • Release of assembly to SOL Sequencing Consortium: November • Annotation by iTAG • Public release of data (under ENCODE guidelines) December 2009

Strategy in detail1: Create assembly-validation set • Input: Sanger BAC contigs from SGNOutput: Selected high-quality subset of large Sanger BAC contigs Discussion: • We might be able to use the same pipeline for BAC selection as is being developed for potato (by Erwin Datema) • Coordinator/specific tasks/division of labor:single location, single person: NL • Deadline: August 1

2: Filter raw data (1) • Input: raw sequence data Output: clean reads, ready for assembly • Discussion:Should the input data be filtered in advance? If so, what criteria should be used? Should all countries use the same filtering or can everyone experiment with different settings and filters and contribute their best data set? • Possible filter criteria: repeats, contamination (human, vectors, local sources of contamination, mitochondrion/chloroplast), duplicates (redundancy & clonality) • How exactly will the high repeat content influence the assembly? Can we include them in the assembly from the start or should we remove them to reduce complexity (and will this influence the final assembly quality)?

2: Filter raw data (2) • Coordinator/specific tasks/division of labor:single location (filtering for local sources of contamination probably has to be done locally, because not everyone may be willing or allowed to share 'local' sequences) • Deadline: September 5

3: De novo assembly of 454 & SBM data (1) • Input: (filtered) 454 and SBM readsOutput: 5-10 different assemblies Discussion: • Explore different assembly methods, parameter settings, etc. • Newbler, CABOG, other? • Should these assemblies already be validated against the validation set or will this happen during the next step? • What are the criteria that an assembly should comply with or how to assess the quality of the assemblies? Should we define these? Statistics like the number of contigs/scaffolds, N50 size, etc?

3: De novo assembly of 454 & SBM data (2) Discussion: • How should unassembled reads be treated? These would include repetitive reads, singleton reads (and very small contigs?), erroneous reads, etc. • Should all data (assembled or not) be available in the end for possible usage downstream? • Do we want to do a de novo assembly of the SOLiD data? If so, should we assemble it standalone or in a hybrid fashion with 454 & SBM? • Coordinator/specific tasks/division of labor:Assembly in one location or distribute over countries? In case of the latter, how to divide the labor? In our opinion multiple people could contribute to this step. • Deadline:

4: Consolidate multiple 454/SBM assemblies into a single best product (1) • Input: 5-10 assembled data setsOutput: Single best, validated, assembly of 454 and SBM data. Discussion: • Reconcile and merge various assemblies (from step 3) into a single best assembly • The assembly must be validated against the validation set (from step 1): all BAC contigs must be present in the assembly. • Compare and validate assemblies (e.g. amosvalidate) and assess error rates among different assemblies

4: Consolidate multiple 454/SBM assemblies into a single best product (2) Discussion: • What are the quality criteria? Which data makes it into the best assembly? How should conflicts between the assemblies be resolved? • Can we already use the physical map for some quality assessment? • Coordinator/specific tasks/division of labor:Consolidation should happen in a single location • Deadline:

5: Add SOLiD data to 454/SBM assembly (1) • Input: SOLiD reads and single best 454/SBM assembly (from step 4)Output: single best 454/SBM/SOLiD assembly Discussion: • De novo assembly of SOLiD data? • Use SOLiD reads to fix possible base errors in 454/SBM assembly and homopolymer tracts. • Gap filling and extension using unassembled SOLiD/454/SBM reads and read-pairs

5: Add SOLiD data to 454/SBM assembly (2) Discussion: Coordinator/specific tasks/division of labor:De novo assembly can possibly be done by multiple people • Consolidation and/or mapping (incl. gap filling) on 454/SBM assembly should happen at a single location • Deadline:

6: Scaffold using BAC and fosmid ends • Input: clone ends and single best 454/SBM/SOLiD assemblyOutput: single best 454/SBM/SOLiD/clone-end assembly Discussion: • Strict selection on clone ends to select non-duplicated reads that have a paired-end read • Newbler can handle paired fosmid ends but not BAC ends (limit on spacing of paired ends) • Coordinator/specific tasks/division of labor:Single location? • Deadline:

7: Map scaffolds to physical map • Input: physical map and single best 454/SBM/SOLiD/CE assemblyOutput: draft of tomato genome Discussion:Should be done incrementally with mapping of the clone ends? How to handle contradictions between step 6 and 7? • Coordinator/specific tasks/division of labor:Coordinated by and executed in NL (Wageningen) • Deadline:

To be settled today • Time frame • July - October2009 • Timing of deliverables • Practical issues: • Division of labor • Share all 454 data with assembly team from 454 Life Sciences (Jim Knight)?

Strategy overview

Assembly Strategy & Data Production Progress

Assembly Strategy & Data Production Progress

Presentation Transcript

Requirements Strategy Stream General Assembly

General Assembly

General outline

GENERAL ASSEMBLY

General outline

General assembly

GENERAL ASSEMBLY

GENERAL ASSEMBLY

Data production General outline of assembly strategy

General Assembly

GENERAL ASSEMBLY

GENERAL ASSEMBLY

GENERAL ASSEMBLY

GENERAL ASSEMBLY

General Assembly

General Assembly

General outline

General outline

Requirements Strategy Stream General Assembly

General outline

General Assembly