Groups

Groups Anschutz Bunsa Mishra Verdear Machado Johnson Rodriguez Kamenkovic Patel Zhang Nwosu Reneau Voutsinas

BLAST XML from Bio.Blast.Applications import NcbiblastpCommandline blastp_cline = NcbiblastpCommandline(cmd ="~/ncbi-blast-2.7.1+/bin/blastp", query="alphaProtein.faa", db="HS", evalue=0.01, out="out.xml", outfmt=5) blastp_cline() xml format

BLAST XML from Bio.Blast import NCBIXML result_handle = open("out.xml") blast_records = NCBIXML.parse(result_handle) for br in blast_records: for alignments in br.alignments: for hsp in alignments.hsps: if hsp.expect < 1e-10: print alignments.title running through blast records running through alignments

NGS pipeline Sequence Alignment/Map Binary Alignment/Map Variant Call Format from Dolled-Filhart et al, 2013

NGS File formats • Raw data from various vendors => various formats (FASTQ) • Different quality metrics (some more stringent than others) • As data analysis proceeds, end up with even more formats: • GenBank formats (SRA) • Alignments are in SAM/BAM • Genome Browser formats (wig, bed, gff, etc) • Variants in vcf files (SNPs, indels, etc)

FASTQ format: sequence + quality sequence identifier header 1 2 raw sequence sign 3 optional description 4 quality string c.f. S. Brown NYU @SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152 NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC +SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152 +50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII

Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same c.f. S. Brown NYU MCL-SRR350952.1 99 chr13 28330526 29 76M = 28330636 183 NAAAGACACAGTTACATGAAGAACATACTCCTCTCTCAGACTGCCCAGGTTCAGTGATTCATTCAACAAACTTTAT #,())*3--.@@@@@@@@@@<<<::87999<<<<<@@@@@@@:@@<:<<<<<8<<::7::::::::@@@22::::@ XT:A:U NM:i:1 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0T75 XA:Z:chr13,+28330526,76M,1; MCL-SRR350952.1 147 chr13 28330636 29 73M3S = 28330526 -183 TATAGGATTCAACTGTGAGAAAGACATATTAATCTCTTCCATTGTGCAGACTACATTCTTTTTTTTTTTTTTTGAG #####################B7:?2,?8;;@+=+A@3DEBB?2?5B7=A?=?4<8;;AIGIDIHFAIIGHIIBII XT:A:M NM:i:2 SM:i:29 AM:i:29 XM:i:2 XO:i:0 XG:i:0 MD:Z:18A18G35 MCL-SRR350952.2 99 chr9 16437227 60 76M = 16437304 153 NGAAATGCAAGGCTGTTTGGGATGTTTTCGAAGTGATGAATGCTGGAAGGATTGCTGTTCTCTAAGTGAGCAAGGA ############################################################################ XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0A75 XA:Z:chr9,+16437227,76M,1; MCL-SRR350952.2 147 chr9 16437304 60 76M = 16437227 -153 GCTGGGACTCCTGGTGCGATTATTGCTCTCAATGAAAGTCCTTATATCTGAGTCTGTCTTTGAAGATGGTACAGCC DBAEA<>G>GGDGG<DD3E<CDCC>E?D?E@DGDDDG8BGGGEGEG@E@@BCCFEE,IHIIGEGGEFCCF<FFFFD XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XA:Z:chr9,-16437304,76M,0;

Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same Coor12345678901234 5678901234567890123456789012345 Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT header seq info @HD VN:1.5 SO: coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 c.f. S. Brown NYU

Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same Coor12345678901234 5678901234567890123456789012345 Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT @HD VN:1.5 SO: coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 Query template NAME POSition QUERY Sequence bitwise FLAG CIGAR MateReference NaMe query QUAL MAPing quality Inferred Insert SIZE Reference seq. NAME c.f. S. Brown NYU

Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same @HD VN:1.5 SO: coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 bitwise FLAG CIGAR c.f. S. Brown NYU

Sequence Alignment Map standard file format for sequences aligned to reference genomeCIGAR example RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: ACTAGAATGGCT RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T POS: 5 CIGAR: 3M1I3M1D5M c.f. S. Brown NYU

VCF format: variants c.f. S. Brown NYU chr1 901559 . G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15 chr1 905974 . C T 187 . DP=22;VDB=0.0249;AF1=0.5;AC1=1;DP4=5,4,9,3;MQ=59;FQ=170;PV4=0.4,0.2,0.2,0.49 GT:PL :GQ 0/1:217,0,197:99 chr1 907622 . G C 82.3 . DP=8;VDB=0.0474;AF1=1;AC1=2;DP4=0,0,2,3;MQ=54;FQ=-42 GT:PL:GQ 1/1:115,15,0: 27 chr1 909073 . C T 39 . DP=16;VDB=0.0504;AF1=0.5;AC1=1;DP4=0,9,0,7;MQ=59;FQ=42;PV4=1,7.5e-09,0.14,0.49 GT:PL :GQ 0/1:69,0,156:72 chr1 909238 . G C 115 . DP=10;VDB=0.0503;AF1=1;AC1=2;DP4=0,0,4,4;MQ=60;FQ=-51 GT:PL:GQ 1/1:148,24,0: 45 chr1 909309 . T C 3.01 . DP=6;VDB=0.0018;AF1=0.4998;AC1=1;DP4=0,2,0,3;MQ=60;FQ=4.72;PV4=1,0.02,1,0.0071 GT:PL :GQ 0/1:30,0,47:28 chr1 909768 . A G 137 . DP=21;VDB=0.0530;AF1=1;AC1=2;DP4=0,0,11,9;MQ=60;FQ=-87 GT:PL:GQ 1/1:170,60,0: 99 chr1 935222 . C A 4.61 . DP=4;VDB=0.0216;AF1=1;AC1=2;DP4=0,0,0,2;MQ=60;FQ=-33 GT:PL:GQ 1/1:34,6,0:5

VCF format: variants ALT QUAL #CHROM POS ID REF QUAL INFO AA ancestral allele AC allele count in genotypes, for each ALT allele, in the same order as listed AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes AN total number of alleles in called genotypes BQ RMS base quality at this position CIGAR cigar string describing how to align an alternate allele to the reference allele DB dbSNP membership DP combined depth across samples, e.g. DP=154 END end position of the variant described in this record (esp. for CNVs) H2 membership in hapmap2 MQ RMS mapping quality, e.g. MQ=52 MQ0 Number of MAPQ == 0 reads covering this record NS Number of samples with data SB strand bias at this position SOMATIC indicates that the record is a somatic mutation, for cancer genomics VALIDATED validated by follow-up experiment c.f. S. Brown NYU chr1 901559 . G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15

Alignment or NGS What are the challenges? c.f. S. Brown NYU

Aligning millions of short reads Length of each read = 50-300 Number of reads = 107 – 108 Genome length = 3.109 or double if diploid • Computationally intensive • Aligning to reference genome => mapping the reads • Aligning de novo => genome assembly • Either way, using something like S-W or BLAST would take too long, so you modify them to take shortcuts (heuristics). • Heuristics include using indexing methods

A short word about indexing Just like the word and page number of the index of a book, this creates an index of short sequences, with their location on the genome. The index file contains index entries made up of a search key value and a pointer to the block in the data file. Hashing (or keys) is when you add line numbers and describe the contents by a minimum of characters, called seeds. Techniques are very well defined in computer science End result is that the files takes up a lot less space, take a lot less time to search

Aligning millions of short reads • Computationally intensive • Aligning to reference genome => mapping the reads • Aligning de novo => genome assembly • Heuristics include using indexing methods, some use ungapped alignment with short words • BWT (Burrows-Wheeler Transformation) greatly reduced computational time • Approaches include using hash tables, spaced seeds, contiguous seeds etc

Mapping The step of aligning the reads to the reference genome. - index the reads then scan them against the reference - find the reference match that has the lowest mismatch - calculate the p-value, and assign each read its location Problems: accuracy, splice junctions, variants Challenges: false positives, repeats, parental alleles, HapMap

NGS alignment algorithms Enter BWT Smith Waterman BLAST BLAT is precomputed BLAST

Groups

Groups

Presentation Transcript

Groups

GROUPS

Groups

Groups

Groups

GROUPS!!!

Groups

GROUPS!!!

Groups

Groups :

Groups

GROUPS

Groups

Groups

GROUPS

GROUPS

GROUPS

In-Groups/Out-Groups

Groups

Groups

Groups

Groups