220 likes | 384 Vues
RNA Sequence Assembly. WEI Xueliang. Overview. Sequence Assembly Current Method My Method RNA Assembly To Do. Sequence Assembly. Goal : get the DNA/RNA sequence. Machine cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases. Define: Read = Tag = Fragment.
E N D
RNA Sequence Assembly WEI Xueliang
Overview • Sequence Assembly • Current Method • My Method • RNA Assembly • To Do
Sequence Assembly • Goal : get the DNA/RNA sequence. • Machine cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases. • Define: Read = Tag = Fragment
Overview • Sequence Assembly • Current Method • My Method • RNA Assembly • To Do
De novo sequence assembly • Calculating the overlap need huge amount of time.
DE BRUIJN GRAPH • K-Mer : Length k substring of the Tag. • Each nodes only have 4 out degrees at most. • Hashing the node. • “CTG”=>(132)4=(30)10 • “CTG”=>”TGG” • (132=)4 shift left. • (1320)4 module (1000)4 • (320)4 + (3)4 ‘G’ • (323)4
DE BRUIJN GRAPH (CONT’) • If there are repeats, like ”GACT” • 3-Mer De Bruijn can not know which way is the correct way. 6-Mer can get the correct sequence. • Larger K, better result.
De novo sequence assembly • Suppose use K = Length of Tag. (20-Mer) • TGACGTAGCTATGTATTTTG • GACGTAGCTATGTATTTTGT (no 20-Mer) • Coverage is not enough to support large K.
Overview • Sequence Assembly • Current Method • My Method • RNA Assembly • To Do
MY METHOD. • Tag length=6, K=3 • When we have • AAGACT? • Try all the way: • AAGACTC • AAGACTT • AAGACTG • Check Tag : • AGACTC • The correct way should be AAGACTC
Overview • Sequence Assembly • Current Method • My Method • RNA Assembly • To Do
ALTERNATIVE SPLICING • The graph • All cDNA sequences.
RNA ASSEMBLY’S PROBLEM • Merge? • Index the sequence.
RNA ASSEMBLY’S PROBLEM(CONT’) • Solution?
RNA ASSEMBLY’S PROBLEM(CONT’) • Index Tags
RNA ASSEMBLY’S PROBLEM(CONT’) • Solution? • Speed?
SINGLE TAG’S LIMITATION • |Yellow Sequence| >= Length of Tag • Length of Tag 25-100bp. • Single Tag is not enough!
DATASET - PAIRED END TAGS • Fragment length usually > 1k • Some RNA sequence is shorter than 1k.
TO DO • Handle large data-sets. (10G) • Improve accuracy. • Using PETs data.