1 / 40

Jen Taylor Bioinformatics Team CSIRO Plant Industry

Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? . Jen Taylor Bioinformatics Team CSIRO Plant Industry. Assumptions. Every k-mer has equal chance of being sequenced. Read density. Deviations from Assumptions?.

luke
Télécharger la présentation

Jen Taylor Bioinformatics Team CSIRO Plant Industry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry

  2. Assumptions • Every k-mer has equal chance of being sequenced CSIRO. Newton Meeting July 2010 - Sequence coverage

  3. Read density CSIRO. Newton Meeting July 2010 - Sequence coverage

  4. Deviations from Assumptions? CSIRO. Newton Meeting July 2010 - Sequence coverage

  5. Impacts on read coverage - Outline CSIRO. Newton Meeting July 2010 - Sequence coverage • Sample preparation • MNase Digestion • Alignment • Parameter choices • Mismatches • Multiple read mappings • Hamming edit distances and k-mer space

  6. Assumptions : Digestion Illumina SOLiD http://seq.molbiol.ru/sch_lib_fr.html CSIRO. Newton Meeting July 2010 - Sequence coverage

  7. ChIPSeq MNase Linker Digest Remove Nucleosomes Sequence & Align CSIRO. Newton Meeting July 2010 - Sequence coverage

  8. ChIPSeq - Nucleosome Sample: MNase digested Size fractionated Control: MNase digested Random sizes CSIRO. Newton Meeting July 2010 - Sequence coverage

  9. araTha9 Aligned Reads 36-MerMonomer Composition CSIRO. Newton Meeting July 2010 - Sequence coverage

  10. araTha9 Aligned Reads 5’ +/- 16bpMonomer Composition CSIRO. Newton Meeting July 2010 - Sequence coverage

  11. MNase Site PreferencingFlick et al., J. Mol. Biology 1986 CSIRO. Newton Meeting July 2010 - Sequence coverage

  12. araTha9 Control MNase Site Preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage

  13. ChIPSeq Sequence & Align MNase Digest Remove Nucleosomes CSIRO. Newton Meeting July 2010 - Sequence coverage

  14. araTha9 Control MNase Site Preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage

  15. Nucleosome potentials – Read Density Normalised Read Density Base Coordinate 1 Kb CSIRO. Newton Meeting July 2010 - Sequence coverage

  16. Nucleosome potentials MNase Potential Normalised Read Density CSIRO. Newton Meeting July 2010 - Sequence coverage

  17. Nucleosome potentials MNase Potential Normalised Read Density CSIRO. Newton Meeting July 2010 - Sequence coverage

  18. Nucleosome potential CSIRO. Newton Meeting July 2010 - Sequence coverage

  19. MNase biases aiding interpretation? • Can aid identification in a local sequence ? • Dependent upon local sequence context • Cautionary tale about analysing sequence contexts of ChipSeq data • Nucleotide composition analyses must take into account digestion preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage

  20. Impacts on read coverage - Outline CSIRO. Newton Meeting July 2010 - Sequence coverage • Sample preparation • MNase Digestion • Alignment • Parameter choices • Mismatches • Multiple read mappings • Hamming edit distances and k-mer space

  21. Hamming Edit Distances CSIRO. Newton Meeting July 2010 - Sequence coverage • Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k • For all possible kmers (36, 65 ) in Arabidopsis genome • All vs.All, both strands • Minimum HE distance

  22. Arabidopsis Minimum Hamming Edit Distances 36mer CSIRO. Newton Meeting July 2010 - Sequence coverage

  23. Alignment issues hg18 dm3 araTha9 0 2 4 6 8 10 12 14 ce6 sacCer6 CSIRO. Newton Meeting July 2010 - Sequence coverage

  24. Alignment artefacts : aligner properties CSIRO. Newton Meeting July 2010 - Sequence coverage

  25. Breakdown of sequencing run CSIRO. Newton Meeting July 2010 - Sequence coverage

  26. Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA H …..AGCTTAGCCTGGTACTGGTA…. 2 AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTACTGGTA…. 3 AGATTAGCCTGGTACTGCTA No Alignment CSIRO. Newton Meeting July 2010 - Sequence coverage

  27. Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA H …..AGATTAGCCTGGTACTGCTA…. 0 AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTACTGCTA…. 2 AGATTAGCCTGGTACTGCTA No Alignment CSIRO. Newton Meeting July 2010 - Sequence coverage

  28. Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA H …..AGCTTAGCCTGGTACTGCTA…. 1 AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTTCTGGTA…. 4 AGATTAGCCTGGTACTGCTA Alignment ! CSIRO. Newton Meeting July 2010 - Sequence coverage

  29. Testing Aligner Accuracy • Simulated reads • Known correct location • 25 million, 50 million • Perfect match, up to 5 mismatches, up to 10 mismatches • Error 3’ bias • Numbers of : • correctly aligned reads • incorrectly aligned reads • Unalignable reads • Speed CSIRO. Newton Meeting July 2010 - Sequence coverage

  30. Alignment artefacts :Managing mismatch thresholds CSIRO. Newton Meeting July 2010 - Sequence coverage

  31. Alignment artefacts :Managing mismatch thresholds CSIRO. Newton Meeting July 2010 - Sequence coverage

  32. How does this affect interpretation ? CSIRO. Newton Meeting July 2010 - Sequence coverage • Incorporation of edit differentials • Leads to gains in the number of alignable reads • Increased information • Determination of the alignment • Gains of 5 - 10% in mappable sites • Hamming edit distributions provide useful information Impact of MNase digestion on short read sequence coverage

  33. Hamming distance variability CSIRO. Newton Meeting July 2010 - Sequence coverage

  34. Read Deserts CSIRO. Newton Meeting July 2010 - Sequence coverage

  35. Read Deserts CSIRO. Newton Meeting July 2010 - Sequence coverage

  36. Sequence deserts CSIRO. Newton Meeting July 2010 - Sequence coverage

  37. Impacts on read coverage - Conclusions CSIRO. Newton Meeting July 2010 - Sequence coverage • Sample preparation • MNase Digestion • Local biases present • Alignment • Parameter choices • Mismatches – generally too low relative to uniqueness of kmers in the genome • Multiple read mappings – can drive ‘absence’ of mapped reads • Hamming edit distances and k-mer space • Kmers have unique and genome specific properties • Can be used to inform results of alignment

  38. Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CMIS / TBCP Paul Greenfield CSIRO. Newton Meeting July 2010 - Sequence coverage

  39. Paired end data – sample preparation C insert G insert A T CSIRO. Newton Meeting July 2010 - Sequence coverage

  40. Control and sample read density Control Sample CSIRO. Newton Meeting July 2010 - Sequence coverage

More Related