1 / 28

Sequencing Data Quality

Sequencing Data Quality. Saulo Aflitos. Assembly - Concepts. Read (≈100bp). Contig (≈2Kbp). Paired-End Mate-Pair. Scaffold (≈ 2Mbp). Pseudo Molecule (Super Scaffold). Low Complexity Region. Scaffolding. Paired-End Mate-Pair. Scaffold (≈ 2Mbp). Pseudo Molecule (Super Scaffold).

vida
Télécharger la présentation

Sequencing Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequencing Data Quality SauloAflitos

  2. Assembly - Concepts Read (≈100bp) Contig (≈2Kbp) Paired-End Mate-Pair Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Low Complexity Region

  3. Scaffolding Paired-End Mate-Pair Scaffold(≈ 2Mbp) Pseudo Molecule (Super Scaffold) Low Complexity Region

  4. Assembly

  5. Scaffolding Repeats?!

  6. Reality Consensus 3x 2x 1x 1x 3x Contig Reads Depth of Coverage Goldberg SMD et al. 2006

  7. Heterozygozity A/C A A A A A A A C C C C C C C A A A A A A A A A A A A A A A N A A A C G T A C G T A A A A 95% ±5 50% ±10

  8. Consequences of Data Cleaning 265.89 Raw Filtered 41.61 48.65 50.37 57.60

  9. Sequencing Shotgun RNAseq

  10. Sequencing Paired End Mate Pair

  11. Sample Preparation Genome Ultrasound Physical RE Shred Gel Beads Size Selection ID Binding to Surface Circularization Adapter Illumina 454 PacBio Sequencing

  12. Shredding

  13. Size Selection

  14. Sequencing Illumina PE Insert Size 150bp-2Kbp 100bp 100bp Read Length

  15. Sequencing 454 MP Insert Size 2K-20Kbp Read Length 500bp 150bp 150bp 150bp

  16. Data

  17. FastQ Machine Name Read ID (unique) Encoded Quality 0-40 Chance of being wrong

  18. FastQ Format

  19. FastQ Statistics 13 0.05 5%

  20. Cleaning

  21. FastQC Quality Checking Tool Contamination screen fastq screen Per base sequence quality Per base sequence content Per sequence quality Sequence duplication Sequence length distribution Per base GC content Per sequence GC content Per base N-content

  22. SolexaQA Cleaning Tool

  23. SolexaQA Cleaning Tool

  24. Exercise • Create “cleaning” folder • mkdir cleaning; cd cleaning • Inside it, run: wget -O saulo.bash http://goo.gl/Tx8g6 • Run it with: bash saulo.bash • This will download FastQC and SolexaQA • FASTQC HELP : http://goo.gl/EE8M7 • FASTQC TUTORIAL: http://goo.gl/rihyA • FASTQC MANUAL : http://goo.gl/9yihC • SolexaQAHelp : http://solexaqa.sourceforge.net/ • Run FastQC: ./FastQC/fastqc & • File > open [Files of Type = FastQ files]

  25. Exercise • Verify the two .fqfiles (you can use less): • bad_MiSeq_dataset.fq • good_MiSeq_dataset.fq • Clean the bad dataset with SolexaQA’s DynamicTrim.pl script: • perlSolexaQA_v.2.1/DynamicTrim.pl ► bad_MiSeq_dataset.fq-h 25 • Verify the improvement (or not) by opening • bad_MiSeq_dataset.fq.trimmed

  26. ?

More Related