Read Processing and Mapping: From Raw to Analysis-ready Reads

Read Processing and Mapping:From Raw to Analysis-ready Reads Ben Passarelli Stem Cell Institute Genome Center NGS Workshop 12 September 2012

Samples to Information Variant calling Gene expression Chromatin structure Methylome Immunorepertoires De novo assembly …

Many Analysis Pipelines Start with Read Mapping Genotyping (GATK) RNA-seq (Tuxedo) http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg http://www.broadinstitute.org/gatk/guide/topic?name=intro http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

From Raw to Analysis-ready Reads Raw reads • Session Topics • Understand read data formats and quality scores • Identify and fix some common read data problems • Find and prepare a genomic reference for mapping • Map reads to a genome reference • Understand alignment output • Sort, merge, index alignment for further analysis • Locally realign at indels to reduce alignment artifacts • Mark/eliminate duplicate reads • Recalibrate base quality scores • An easy way to get started • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads

Instrument Output Illumina MiSeq Illumina HiSeq IonTorrent PGM Roche 454 Pacific Biosciences RS Images (.tiff) Cluster intensity file (.cif) Base call file (.bcl) Movie Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) Standard flowgram file (.sff) Sequence Data (FASTQ Format)

FASTQ Format (Illumina Example) Lane Tile Barcode Raw reads Read Record Header Flow Cell ID Tile Coordinates @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG + @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Bases • Read assessment and prep Separator (with optional repeated header) Read Quality Scores • Duplicate marking • Local realignment • Mapping • Base quality recalibration Analysis-ready reads NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads

Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability P Q = -10 log10P * Name of first program to assign accurate base quality scores. From the Human Genome Project. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41

[benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Read [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Barcode [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed File Organization Raw reads [benpass@solexalign]$ ls [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 • Read assessment and prep [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Format • Base quality recalibration • Local realignment • Mapping • Duplicate marking Analysis-ready reads

Initial Read Assessment Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads Common problems that can affect analysis • Low confidence base calls • typically toward ends of reads • criteria vary by application • Presence of adapter sequence in reads • poor fragment size selection • protocol execution or artifacts • Over-abundant sequence duplicates • Library contamination

Initial Read Assessment: FastQC Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads • Free Download Download:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y • Samples reads (200K default): fast, low resource use

Read Assessment Examples ~71.48% of sequences are duplicates Sanger Quality Score by Cycle Median, Inner Quartile Range, 10-90 percentile range, Mean Read Duplication ~6% of sequences occur more than 10x ~8% of sampled sequences occur twice http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc Note: Duplication based on read identity, not alignment at this point

Read Assessment Example (Cont’d) Per base sequence content should resemble this…

Read Assessment Example (Cont’d)

Read Assessment Example (Cont’d) TruSeqAdapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

Read Assessment Example (Cont’d) Trim leading bases (library artifact) Trim for base quality or adapters (run or library issue)

Selected Tools to Process Reads • Fastxtoolkit* http://hannonlab.cshl.edu/fastx_toolkit/ • (partial list) • FASTQ Information: Chart Quality Statistics and Nucleotide Distribution • FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). • FASTQ Clipper: Removing sequencing adapters • FASTQ Quality Filter: Filters sequences based on quality • FASTQ Quality Trimmer: Trims (cuts) sequences based on quality • FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality • *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option. • SepPrephttps://github.com/jstjohn/SeqPrep • Adapter trimming • Merge overlapping paired-end read • Biopythonhttp://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html • (for python programmers) • Especially useful for implementing custom/complex sequence analysis/manipulation • Galaxy http://galaxy.psu.edu • Great for beginners: upload data, point and click • Just about everything you’ll see in today’s presentations

Read Mapping Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads http://www.broadinstitute.org/igv/

Read Mapping: Aligning to a Reference Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads

Read Mapping: BWA Raw reads • BWA Features • Uses Burrows Wheeler Transform • fast • modest memory footprint (<4GB) • Accurate • Tolerates base mismatches • increased sensitivity • reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge) • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads

Sequence References and Annotations http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information http://hgdownload.cse.ucsc.edu/downloads.html Comprehensive reference, annotation, and translation information ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle References and SNP information data by GATK Human only http://cufflinks.cbcb.umd.edu/igenomes.html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http://www.repeatmasker.org/

Fasta Sequence Format • One or more sequences per file • “>” denotes beginning of sequence or contig • Subsequent lines up to the next “>” define sequence • Lowercase base denotes repeat masked base • Contig ID may have comments delimited by “|” >chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr2 … >chr3 …

Running BWA Input files: reference.fasta, read1.fastq.gz, read2.fastq.gz Step 1: Index the genome (~3 CPU hours for a human genome reference): bwa index -a bwtswreference.fasta Step 2: Generate alignments in Burrows-Wheeler transform suffix array coordinates: bwaalnreference.fasta read1.fastq.gz > read1.sai bwaalnreference.fasta read2.fastq.gz > read2.sai Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads Step 3: Generate alignments in the SAM format (paired-end): bwasampereference.fasta read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz > alignment_ouput.sam http://bio-bwa.sourceforge.net/bwa.shtml

Running BWA (Cont’d) Simple Form: bwasampereference.fasta read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz > alignment.sam Output to BAM: bwasampereference.fasta read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz | samtoolsview -Sbh - > alignment.bam With Read Group Information: bwasampe-r "@RG\tID:readgroupID\tLB:libraryname\tSM:samplename\tPL:ILLUMINA“ \ reference.fasta \ read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam

SAM (BAM) Format Sequence Alignment/Map format • Universal standard • Human-readable (SAM) and compact (BAM) forms Structure • Header version, sort order, reference sequences, read groups, program/processing history • Alignment records

SAM/BAM Format: Header samtools to view bam header sort order reference sequence names with lengths read groups with platform, library and sample information program (analysis) history [benpassalign_genotype]$ samtools view -H allY.recalibrated.merge.bam @HD VN:1.0 GO:noneSO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 … @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrYLN:59373566 … @RG ID:86-191 PL:ILLUMINA LB:IL500SM:86-191-1 @RG ID:BsK010 PL:ILLUMINA LB:IL501SM:BsK010-1 @RG ID:Bsk136 PL:ILLUMINA LB:IL502SM:Bsk136-1 @RG ID:MAK001 PL:ILLUMINA LB:IL503SM:MAK001-1 @RG ID:NG87 PL:ILLUMINA LB:IL504SM:NG87-1 … @RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023 @PG ID:GATK IndelRealigner VN:2.0-39-gd091f72CL:knownAlleles=[] targetIntervals=tmp.intervals.listLODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:bwaPN:bwa VN:0.6.2-r126

SAM/BAM Format: Alignment Records 8 3 4 5 6 9 1 11 10 http://samtools.sourceforge.net/SAM1.pdf [benpassalign_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 RG:Z:SDH023 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability

Preparing for Next Steps Raw reads • Subsequent steps require sorted and indexed bams • Sort orders: karyotypic, lexicographical • Indexing improves analysis performance • Picard tools: fast, portable, free http://picard.sourceforge.net/command-line-overview.shtml Sort: SortSam.jar Merge: MergeSamFiles.jar Index: BuildBamIndex.jar • Order: sort, merge (optional), index • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads

Local Realignment Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads • 1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238 • BWT-based alignment is fast for matching reads to reference • Individual base alignments often sub-optimal at indels • Approach • Fast read mapping with BWT-based aligner • Realign reads at indel sites using gold standard (but much slower) Smith-Waterman1 algorithm • Benefits • Refines location of indels • Reduces erroneous SNP calls • Very high alignment accuracy in significantly less time, with fewer resources

Local Realignment Raw BWA alignment Post re-alignment at indels DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Duplicate Marking Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads Covered in genotyping presentation Note that this is done after alignment

Base Quality Recalibration • STEP 1: Find covariates at non-dbSNP sites using: • Reported quality score • The position within the read • The preceding and current nucleotide (sequencer properties) • java -Xmx4g -jar GenomeAnalysisTK.jar \ • -T BaseRecalibrator \ • -I alignment.bam\ • -R hg19/ucsc.hg19.fasta \ • -knownSiteshg19/dbsnp_135.hg19.vcf\ • -o alignment.recal_data.grp • STEP 2: Generate BAM with recalibrated base scores: • java -Xmx4g -jar GenomeAnalysisTK.jar \ • -T PrintReads \ • -R hg19/ucsc.hg19.fasta \ • -I alignment.bam\ • -BQSR alignment.recal_data.grp\ • -o alignment.recalibrated.bam Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads

Base Quality Recalibration (Cont’d)

Getting Started Is there an easier way to get started?!!

Getting Started http://galaxy.psu.edu/ Click “Use Galaxy”

Q&A

Read Processing and Mapping: From Raw to Analysis-ready Reads

Read Processing and Mapping: From Raw to Analysis-ready Reads

Presentation Transcript

TEM EDS X-ray analysis

Process Mapping

Hyperspectral image processing and analysis

Curriculum Mapping

LED Wafer Processing

Introduction to Intervention Mapping

13. Positional Cloning and Gene Mapping

Advanced Spatial Analysis

Understanding the Patient Journey Process Mapping and Value Stream / Flow Analysis

GIS and Geologic Mapping Day 2

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Image Processing and Analysis

Getting the manuscript ready 1

Grammatical processing with LFG and XLE

Image analysis and computer vision

Chapter 3 Genetic analysis of l inkage and chromosome mapping

SAP FI Accounts Receivable

2013 CAD Contest Technology Mapping for Macro Blocks

Texture Maps

ARCH Medical Center

Are you ready for a new challenge? Here is another set of words. How many can you read?