1 / 66

Introduction To Next Generation Sequencing (NGS) Data Analysis

Introduction To Next Generation Sequencing (NGS) Data Analysis. Jenny Wu UCI Genomics High Throughput Facility. Outline. Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis Basics: terminology, data formats, general workflow etc. Data Analysis Pipeline

opa
Télécharger la présentation

Introduction To Next Generation Sequencing (NGS) Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility

  2. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data formats, general workflow etc. • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat, Cufflinksand cummeRbund • Data visualization: IGV • RNA-Seqpipeline software • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  3. Why Next Generation Sequencing • One can generate hundreds of millions of short sequences (35bp-150bp) in a single run in a short period of time with low per base cost. • Illumina/Solexa GA II / HiSeq 2000, 2500 • Roche/454 FLX, Titanium • Life Technologies/Applied BiosystemsSOLiD Reviews: Michael Metzker(2010) Nature Reviews Genetics 11:31 Quail et al (2012) BMC Genomics Jul 24;13:341.

  4. Why Bioinformatics Informatics (wall.hms.harvard.edu)

  5. Bioinformatics Challenges in NGS Data Analysis • VERY large text files (thousands of millions of lines long) • Can’t do ‘business as usual’ with familiar tools • Impossible memory usage and execution time • Manage, analyze, store, transfer and archive huge files • Need for powerful computers and expertise • Informatics groups must manage compute clusters • New algorithms and software are required and often time they are open source Unix/Linux based. • Collaboration of IT, bioinformaticians and biologists

  6. Basic NGS Workflow Olson et al.

  7. NGS Data Analysis Overview Olson et al.

  8. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data formats, general workflow etc. • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat, Cufflinksand cummeRbund • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  9. Terminology • Coverage (depth):The number of nucleotides from reads that are mapped to a given position. • Quality Score: Each called base comes with a quality score which measures the probability of base call error. • Paired-End Sequencing: Both end of the DNA fragment is sequenced, allowing highly precise alignment. • Multiplex Sequencing: "barcode" sequences are added to each sample so they can be distinguished in order to sequence large number of samples on one lane. • Mapping:Align reads to reference to identify their origin. • Assembly:Merging of fragments of DNA in order to reconstruct the original sequence. • Duplicate reads: Reads that are identical. • Multi-reads: Reads that can be mapped to multiple locations equally well.

  10. What does the data look like?Common NGS Data Formats For a full list, go to http://genome.ucsc.edu/FAQ/FAQformat.html

  11. File Formats • Reference sequences, reads: • FASTQ • FASTA • Alignments: • SAM • BAM • Features, annotation, scores: • GFF/GTF • BED/BigBed • WIG/BigWig http://genome.ucsc.edu/FAQ/FAQformat.html

  12. FASTA Format (Reference Seq)

  13. FASTQ Format (reads)

  14. FASTQ Format (Illumina Example) Lane Tile Barcode Flow Cell ID Tile Coordinates Read Record Header @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG + @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Bases Separator (with optional repeated header) Read Quality Scores NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads. (Passarelli, 2012)

  15. General Data Pipeline

  16. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  17. Why QC? Sequencing runs cost money • Consequences of not assessing the Data • Sequencing a poor library on multiple runs – throwing money away! Data analysis costs money and time • Cost of analyzing data, CPU time $$ • Cost of storing raw sequence data $$$ • Hours of analysis could be wasted $$$$ • Downstream analysis can be incorrect.

  18. How to QC? $ module load fastqc $ fastqc s_1_1.fastq; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPC Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

  19. FastQC:Example

  20. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  21. The UCSC Genome Browser Homepage General information Get genome annotation here! Get reference sequences here! Specific information— new features, current status, etc.

  22. Getting reference sequences

  23. Getting Reference Annotation

  24. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  25. Sequence Mapping Challenges • Alignment (Mapping) is the first steps once analysis-read reads are obtained. • The task: to align sequencing reads against a known reference. • Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.

  26. Short Read Alignment Olson et al.

  27. Short Read Alignment Software

  28. Short Reads Mapping Software

  29. How to choose an aligner? • There are many aligners and they vary a lot in performance(accuracy, memory usage, speed and flexibility etc). • Factors to consider : application, platform, read length, downstream analysis, etc. • Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie) • Guaranteed high accuracy will take longer.

  30. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  31. NGS Applications and Analysis Strategy (Hunicke-Smith et al, 2010)

  32. Application Specific Software

  33. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • Spliced alignment, normalization, coverage, differential expression • Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  34. RNA-Seq Pipeline (Wilhelm, B.T., et al, 2009)

  35. RNA-Seq: Spliced Alignment • Some reads will span two different exons • Need long enough reads to be able to reliably map both sides http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

  36. RNA-Seq: Coverage • Coverage in RNA-Seq is highly non-uniform • Within a single exon, there are regions with high coverage and regions with zero coverage. • They change when the library preparation protocol is changed. • The binding preferences of random hexamerprimers explain them only partially. We simply hope that this averages out over the whole transcript !

  37. RNA-Seq: Normalization • Gene-length bias • • Differential expression of longer genes is more significant because long genes yield more reads • • Ratio-based filtering yields more false positives for short genes • RNA-Seq normalization methods: • Scaling factor based: Total count, upper quartile, median, DESeq, TMM in edgeR. • Quantile, RPKM. Normalize by gene length and by number of reads mapped, e.g. RPKM.

  38. Definition of expression levels RPKM: Reads Per Kilobase per Million of mapped reads: FPKM: Fragment Per Kilobase per Million of mapped reads(for paired-end reads) Mortazavi, et al. 2008

  39. RNA-Seq: Differential Expression Discrete vs. Continuous data: Microarray florescence intensity data: continuous • Modeled using normal distribution RNA-Seq read count data: discrete • Modeled using negative binomial distribution Microarray software canNOTbe directly used to analyze RNA-Seqdata!

  40. RNA-Seq data analysis software http://www.ncbi.nlm.nih.gov/pubmed/21623353

  41. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq data analysis • spliced alignment, normalization, coverage, differential expression. • Tuxedo suite: Tophat, Cufflinks and cummeRbund. • Data visualization • RNA-seqpipeline software: RobiNA, Galaxy • ChIP-Seq data analysis workflow and software • Open source pipeline software with Graphical User Interface • Summary

  42. RNA-Seq (Tuxedo Protocol) • Spliced Read mapping SAM/BAM • 2. Transcript assembly and quantification GTF/GFF • 3. Merge assembled transcripts from multiple samples • 4. Differential Expression analysis http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

  43. 1. Spliced Alignment: Tophat Tophat : a spliced short read aligner for RNA-seq. $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

  44. The TopHat2 Pipeline

  45. Tophat Parameters http://tophat.cbcb.umd.edu/manual.html

  46. 2.Transcript assembly and abundance quantification: Cufflinks • Cufflinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. • $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam • $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/ accepted_hits.bam • $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/ accepted_hits.bam • $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/ accepted_hits.bam http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

  47. 3. Final Transcriptome assembly: Cuffmerge $ cuffmerge-g genes.gtf -s genome.fa -p 8 assemblies.txt $ more assembies.txt ./C1_R1_clout/transcripts.gtf ./C1_R2_clout/transcripts.gtf ./C2_R1_clout/transcripts.gtf ./C2_R2_clout/transcripts.gtf http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

  48. Cufflinks Parameters http://cufflinks.cbcb.umd.edu/manual.html

  49. Cufflinks and related resources • Pachter, L. Models for transcript • quantification from RNA-Seq.arXivpreprint arXiv:1104.3889 (2011). • • Trapnell C, Williams BA, Pertea • G, Mortazavi AM, Kwan G, van • Baren MJ, Salzberg SL, Wold B, • Pachter L. • Transcript assembly and • quantification by RNA-Seq • reveals unannotated transcripts • and isoform switching during • cell differentiation • Nature Biotechnology doi: • 10.1038/nbt.1621 • Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias Genome Biology doi:10.1186/ gb-2011-12-3-r22 • Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq Bioinformatics doi:10.1093/ bioinformatics/btr355

  50. 4.Differential Expression: Cuffdiff • CuffDiff: a program that compares transcript abundance between samples. • $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf • ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam, • ./C2_R1_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

More Related