1 / 25

CS 5263 & CS 4593 Bioinformatics

CS 5263 & CS 4593 Bioinformatics. Next-generation sequencing - Mapping short reads. Short read mapping. Input: A reference genome A collection of many 25-100bp tags (reads) User-specified parameters Output: One or more genomic coordinates for each tag

gochs
Télécharger la présentation

CS 5263 & CS 4593 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 & CS 4593Bioinformatics Next-generation sequencing - Mapping short reads

  2. Short read mapping • Input: • A reference genome • A collection of many 25-100bp tags (reads) • User-specified parameters • Output: • One or more genomic coordinates for each tag • In practice, only 70-75% of tags successfully map to the reference genome. Why?

  3. Multiple mapping • A single tag may occur more than once in the reference genome. • The user may choose to ignore tags that appear more than n times. • As n gets large, you get more data, but also more noise in the data.

  4. Inexact matching ? • An observed tag may not exactly match any position in the reference genome. • Sometimes, the tag almost matches one or more positions. • Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out. • The user can specify the maximum number of mismatches, or a phred-style quality score threshold. • As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags.

  5. Read Length is Not As Important For Resequencing Jay Shendure

  6. Mapping Reads Back • Hash Table (Lookup table) • FAST, but requires perfect matches. [O(m n + N)] • Array Scanning • Can handle mismatches, but not gaps. [O(m N)] • Dynamic Programming (Smith Waterman) • Indels • Mathematically optimal solution • Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] • Burrows-Wheeler Transform (BW Transform) • FAST. [O(m + N)] (without mismatch/gap) • Memory efficient. • But for gaps/mismatches, it lacks sensitivity

  7. Spaced seed alignment • Tags and tag-sized pieces of reference are cut into small “seeds.” • Pairs of spaced seeds are stored in an index. • Look up spaced seeds for each tag. • For each “hit,” confirm the remaining positions. • Report results to the user.

  8. Burrows-Wheeler • Store entire reference genome. • Align tag base by base from the end. • When tag is traversed, all active locations are reported. • If no match is found, then back up and try a substitution.

  9. Why Burrows-Wheeler? • BWT very compact: • Approximately ½ byte per base • As large as the original text, plus a few “extras” • Can fit onto a standard computer with 2GB of memory • Linear-time search algorithm • proportional to length of query for exact matches

  10. Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac gc$aaac acaacg$ Burrows-Wheeler Matrix (BWM)

  11. Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

  12. Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac 3 1 4 2 5 6 See the suffix array?

  13. Key observation 1$acaacg1 2aacg$ac1 1acaacg$1 3acg$aca2 1caacg$a1 2cg$acaa3 1g$acaac2 a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column.

  14. 4 5 6 5 6 3 4 5 6 3 4 2 5 6 3 1 4 2 5 6 Recover text 6

  15. Exact match 3 1 4 2 5 6

  16. Exact match (another example) BWT(agcagcagact) = tgcc$ggaaaac Search for pattern: gca gca gca gca gca Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/

  17. Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. BWT SA FM indices

  18. Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. BWT gca SA FM indices Next block: From 1 + 0 = 1 to 1 + (4-1) = 4

  19. Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. BWT gca SA FM indices Next block: From 5 + 0 = 5 to 5 + (2-1) = 6

  20. Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. BWT gca SA FM indices Next block: From 8 + 1 = 9 to 8 + (3-1) = 10

  21. Inexact match

  22. Main advantage of BWT against suffix array • BWT needs less memory than suffix array • For human genome m = 3 * 109 : • Suffix array: mlog2(m) bits = 4m bytes = 12GB • BWT: m/4 bytes plus extras = 1 - 2 GB • m/4 bytes to store BWT (2 bits per char) • Suffix array and occurrence counts array take 5 m log2 m bits = 20 n bytes • In practice, SA and OCC only partially stored, most elements are computed on demand (takes time!) • Tradeoff between time and space

  23. Burrows-Wheeler Requires <2Gb of memory. Runs 30-fold faster. Is much more complicated to program. Bowtie Spaced seeds Requires ~50Gb of memory. Runs 30-fold slower. Is much simpler to program. MAQ Comparison

  24. Short-read mapping software http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

  25. References • (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25  • SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics (2008) 24: 713-4 • (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760 • SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li, (2009) 25: 1966–1967 • (MAQ) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8. • Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) • http://www.allisons.org/ll/AlgDS/Strings/BWT/

More Related