510 likes | 618 Vues
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA). Yan Guo. Alignment. ATCGGGAATGCCGTTAACGGTTGGCGT. Reference genome. Human genome is about 3 billion base pair (3,000,000,000)in length.
E N D
Vanderbilt Center for Quantitative Sciences Summer InstituteSequencing Analysis (DNA) Yan Guo
Alignment ATCGGGAATGCCGTTAACGGTTGGCGT Reference genome Human genome is about 3 billion base pair (3,000,000,000)in length. If read is 100 bp long, what is the probability of unique alignment? 1/(4x4x4…4) =1/4100 =1/1.60694E+60
Alignment Tools • BWA http://bio-bwa.sourceforge.net/ • Bowtie http://bowtie-bio.sourceforge.net/index.shtml Doing accurate alignment for a 30 million reads will take 30 million x 3billion time units. Both are based on Borrows-Wheeler Algorithm
Alignment Results – Bam files • SAM – uncompressed • Bam – compressed • http://samtools.github.io/hts-specs/SAMv1.pdf • Sort and index before performing analysis • Don’t forget to perform QC on alignment
How to call SNPs http://www.broadinstitute.org/igv/
Recalibration Why do we need realignment and recalibration for DNA but not RNA?
SNP calling • GATK https://www.broadinstitute.org/gatk/ • Varscanhttp://varscan.sourceforge.net/
Annotation using ANNOVAR http://www.openbioinformatics.org/annovar/
Somatic Mutation • Different from SNP (not germline) • Both tumor and normal samples are needed to accurately define a somatic mutation • Tumor sample is almost never 100% tumor
Somatic mutation callers • MuTecthttp://www.broadinstitute.org/cancer/cga/mutect • Varscanhttp://varscan.sourceforge.net/
Quality Control on SNPs • Number of Novel Non-synonymous SNP ~ 100 – 200 • Transition / transversion ratio • Heterozygous / non reference homozygous ratio • Heterozygous consistency • Strand Bias • Cycle Bias
Heterozygous / non reference homozygous ratio by race and regions
Pooled Analysis • Pool samples together without barcode • Save money • Can only be used to evaluate allele frequency
Known – Things we always know that Sequencing data can do SNV, mutation CNV Xie et al. BMC Bioinformatics 2009 Structural Variants Alkan et al. Nature Review Genetics, 2011
Known Unknown – Other information we found that sequencing data contain SNVs and Mutations in non targeted regions Mitochondria Virus and Microbe
How is additional data mining possible? • Data mining is possible because capture techniques are not perfect.
Potential Functions of Intron and Intergenic ENCODE suggested that over 80% human genome maybe functional. Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)
Coverage of the Unintended Regions • The coverage don’t just drop off suddenly after the capture region end. • Capture region example: chr1 1000 1500 1000 1500 1000 1500
Reads Aligned to Non Target Regions Can Be Used to Detect SNPs • Tibetan exome study : Through exome sequencing of 50 Tibetan subjects, 2 intron SNPs were identified to be associated with high altitude. (Yi, et al. Science 2010) • Non capture region study: Non capture region’s reads were studied to show they can infer reliable SNPs. (Guo, et al BMC Genomics)
Known unknown - Mitochondria However, mitochondria is only 16569 BP Assumptions: 40 mil reads 100BP long read
Extract mitochondria from exome sequencing Tools: • Picardi et al. Nature Methods 2012 • Guo et al. Bioinformatics, 2013 (MitoSeek) Diagnosis: • Dinwiddie et al. Genmics 2013 • Nemeth et al, Brain 2013
Virus • Virus sequences can be captured through high throughput sequencing of human samples • HBV in liver cancer samples (Sung, et al. Nature Genetics, 2012) (Jiang, et al. Genome Research, 2012) • HPV in head and neck cancer (Chen, et al. Bioinformatics, 2012)
Tools for Detecting Virus from Sequencing data • PathSeq (Kostic, et al. Nature, 2011 Biotechnology) • VirusSeq (Chen, et al. Bioinformatics, 2012) • ViralFusionSeq (Li, et al. Bioinformatics, 2012) • VirusFinder (Wang, et al. PlOS ONE, 2013)
The Data Mining Ideas applied to RNA • RNAseq has been used a replacement of microarray. • Other application of RNAseq include dection of alternative splicing, and fusion genes. • Additional data mining opportunities also available for RNAseq data
SNV and Indel • Difficulty due to high false positive rate • RNAMapper (Miller, et al. Genome Research, 2013) • SNVQ (Duitama, et al. (BMC Genomics, 2013) • FX (Hong, et al. Bioinformatics, 2012) • OSA (Hu, et al. Binformatics, 2012)
Microsatellite instability Examples: • Yoon, et al. Genome Research 2013 • Zheng, et al. BMC Genomics, 2013
RNA Editing and Allele-specific expression RNA editing tools and database • DARNED, REDidb, dbRES, RADAR Allele-specific expression • asSeq (Sun, et al. Biometrics, 2012) • AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)
Exogenous RNA • Virus (Same as DNA) • Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012
Unknown Unknown Contamination Unknown treasures Reference is not perfect
Exome Samuels, et al. Trends in Genetics, 2013