PE-Assembler: De novo assembler using short paired-end reads

PE-Assembler: De novo assembler using short paired-end reads PramilaNuwanthaAriyaratne

Outline • Method • Read screening • Seed building • Contig extension • Scaffolding • Gap filling • Result

Data-sets Used • Single end reads • Paired end reads • ReadLength (from 25bp to 100bp) • Insert size vary from MinSpanto MaxSpan • The information are mainly from this data-sets.

Overview • Read screening step select a set of reads as starting point. • Seed building step extend these reads using Single End Reads to make them longer than MaxSpan. Successfully extended regions are called seeds. • Contig extension try to extend all seeds using paired-end reads, result sequences called contigs.

Read screening • Get all k-mers from all the reads. • A k-merthat is expected to occur in the actual genome is called a ‘solid’ k-mer. • A k-mer that is expected to occur within a repeat region is called a ‘repeat’ k-mer. • Repeat Region: • ACTTTGACACACACACAC……ACACACACGTTGAG

Read screening

Read screening • A read is solid read if: • All it’s k-mers are within the two threshold cut-off. • Example: • Two cut-off [42, 120] from previous graph. • K=5 • Read: ACCGTATA • ACCGT, CCGTA, CGTAT, GTATA • 100, 70, 90, 140 • Not a solid read.

Read screening • Example: • Two cut-off [42, 120] from previous graph. • K=5 • Read: ACCGTATG • ACCGT, CCGTA, CGTAT, GTATG • 100, 70, 90, 70 • A solid read.

Seed Building • Try to extend the solid read using all overlapping reads.

Seed Building • Because of sequencing errors or small repeats, there maybe multiple feasible candidates.

Seed Building • Ambiguities due to sequencing errors, we extend every candidate base up to ReadLength. • If only one candidate path reach the full distance ReadLength, then that path is assumed to be correct extension. • If no path or more than one path found. Try other side.

Seed Building • Finally, when the sequence reach MaxSpan, (called seed) do a verification. • At least one paired-end reads overlaps with this seed within expected length [MinSpan, MaxSpan]

Contig Extension • This step aims to extend each verified seed to form a longer contig using Paired-End reads. • For multiple feasible candidates, may due to 3 reasons. • First, sequencing errors. • Second, short tandem repeat. Handling in Gap Filing step. • Third, long repeat. Which longer than MaxSpan.

Scaffolding • Find the correct ordering of the resulting set of contigs. • Gao Song currently working on it.

Gap filling • Gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig.

Gap filling

Simulated data results. • Result compare using: • Average Length of all contigs. • N50, N90 of contigs. Bigger better. • Coverage. • Large Misassembly: accuracy is much more important than others.

Simulated data results.

Thank you for attention.

PE-Assembler: De novo assembler using short paired-end reads

PE-Assembler: De novo assembler using short paired-end reads

Presentation Transcript

Get the Egg?

Using Non-fiction Text as Read Alouds: Paired Reading

Chapter 2: HCS12 Assembly Programming

Next Generation Sequencing

Microcomputer Organization

The 8051 Assembly Language

Uma Viagem ao Novo Mundo do Câncer de Próstata UPTODATE Brasília

Kidney Paired Donation

Evolutionary and Agent-based Search / Exploration in Chemical Library and De Novo Design

UNIT II ASSEMBLERS

Lecture 4. Short Read Alignment

Genome Reconstruction: A Puzzle with a Billion Pieces

Chapter 7 – MSP430 Assembler / Linker