1 / 160

Analysis of Next-Generation Sequencing (NGS) Data

Analysis of Next-Generation Sequencing (NGS) Data. Yun Li Department of Genetics Department of Biostatistics University of North Carolina. Notes up-front. Focus of my part today Next-Generation DNA Sequencing vs RNA-seq, CHIP-seq etc One particular type of genetic variants: SNPs

selene
Télécharger la présentation

Analysis of Next-Generation Sequencing (NGS) Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Next-Generation Sequencing (NGS) Data Yun Li Department of Genetics Department of Biostatistics University of North Carolina

  2. Notes up-front • Focus of my part today • Next-Generation DNA Sequencing • vs RNA-seq, CHIP-seq etc • One particular type of genetic variants: SNPs • vs indels, CNVs, SVs • Diploid humans • vs other model organisms • Complex disease genetics • vs Mendelian diseases

  3. Outline This slide is updated. • Introduction to Basic Biology and Genetics • Introduction to NGS Techonology (Illumina Solexa technology as an Example) • A Typical Workflow for NGS Analysis • Raw NGS Data • Read Alignment and Basic Quality Control • SNP Detection and Genotype Calling • Design of NGS-based Studies

  4. Introduction part 1: Biology & genetics primer

  5. The Human Genome • Genome: an individual or specie’s genetic constitution; made up of chromosomes • Chromosome: threadlike body found in the nucleus of the cell and containing the genes; made up of double-stranded DNA and protein • Human Genome: • comprised of 46 = 2*23 chr’s (diploid) • 22 autosomes 1 to 22, mostly long to short • Present in two copies • One paternal, one maternal • Sex chromosome X, Y: males XY; females XX • Total ~3 billion base pairs (bp) of DNA

  6. The Human Genome (cont’d) • Gene: segment of DNA with a detectable function (eg, code for a protein); ~20,000 genes in the human genome • Locus: specific gene or DNA segment or region on a chromosome • Allele: a particular form of a gene or DNA segment • Polymorphic • monomorphic Karyotype of a male

  7. DNA • DNA (deoxyribonucleic acid): heteropolomer molecule constructed of sugars, phosphates, and bases that carries the genetic information • DNA is the information store: it encodes the information for cells and organisms to re-produce • DNA variation responsible for many individual differences

  8. DNA (cont’d) • Base pair (bp): DNA is double-stranded, each strand is a series of the bases Nucleotide: one necleobase (nitrogenous base), a five-carbon sugar and one phosphate group

  9. Transition vs Transversion Mutation • Two groups of nitrogenous bases • Purines: A and G • Primidines: C and T • Types of mutations • Transitions: within-group (A<->G; C<->T) • Transversion: btw-group (A<->C; A<->T; G<->C; G<->T) • transition to transversion ratio is ~2

  10. Central Dogma • DNA -> RNA -> Protein (genome, transcriptome, proteome) • Transcription and translation

  11. Genetic Markers • SNP/SNV • Single Nucleotide Polymorphism/Variant Haplotype: set of alleles together on the same chr. Haplotype1: AAGGGATCCAC Haplotype2: AAGGAATCCAC

  12. SNPs • Single nucleotide substitution • The most abundant type of genetic variant in the human genome • >30,000,000 cataloged in the human genome • Easy to score cheaply, accurately • Vast majority two alleles (di-allelic or bi-allelic) • Nomenclature: rs number, eg, rs10885409 • Basis for genome-wide association studies (GWAS) • Microarray with 100,000s-1,000,000s SNPs • 1,000s of disease and trait association identified • http://www.genome.gov/gwastudies/: 6499 (as of 6/22/2012)

  13. Genetic Markers: Length Polymorphism • Microsatellite • Simple repeat sequence • Often 2-4 bp repeat; eg, ---CACACACA--- • Common in the genome, often many different alleles; ~15,000 mapped to specific location • Nomenclature: D number, eg, D22S1 • Primary marker for linkage studies • Variable number of tandem repeat (VNTR) • Typical repeat of 10-100 bp • CNV/CNP: Copy Number Variant/Polymorphism • Typically > 200 bp • Indel: Insertion, deletion variant

  14. Genetic Markers: Structure Variant • Structural variant • Generally defined as a region of DNA approximately 1Kb or larger in size and can include inversions and balanced translocations or genomic imbalances (eg., indels)

  15. Genotype and Phenotype • Diploid: two copies of each chromosome per cell, as in most human cells; pairs called homologues • Genotype: genetic constitution of the individual, usually referring to the locus or loci under study • Phenotype: • observed characters of individuals; expression of genes and other relevant factors • =trait: what we observe • Complex phenotype: determined by combinations of genes, environment, behavior. Eg, diabetes, hypertension, almost everything • Mendelian=simple phenotype: completely determined by one or few genes. Eg, CF.

  16. ABO Blood Group • Based on antigenic substances A and B present on the surface of red blood cells • Coded by a gene on chromosome 9 • Alleles: A,B,O • Genotypes • Homozygote: two copies of the same allele. Eg, AA, BB, OO • Heterozygote: two different alleles. Eg, AO, BO, AB. • n alleles => n(n+1)/2 possible genotypes

  17. Genetic code • Universal translation from DNA and RNA to protein; 3 bases code for one amino acid (codon) • Synonymous vs non-synonymous variant

  18. Introduction part 2: Intro to NGS technology

  19. History of DNA Sequencing

  20. A Road to Discover Human Genome hapmap.org www.1000genomes.org 1990-2003 2002 - 2008 -

  21. Different Approaches • Deep whole genome sequencing • Expensive, only can be applied to limited samples currently • Most complete ascertainment of all variations • Low coverage whole genome sequencing • Modest cost, typically X00-X000s individuals sequenced • Complete ascertainment of common variations • Less compete ascertainment of rare variants • Exome capture and targeted region sequencing • Modest cost, high coverage • Most interesting part of the genome

  22. With Complete Sequence Data • What is the contribution of each identified locus to a trait? • Multiple variants, common and rare • Effect size • What is the mechanism?What happens if we knockout a gene? • Most often, causal variant not examined directly by GWAS • Rare coding variants will provide important insights into mechanisms • What is the contribution of structural variation to disease? • These are hard to interrogate using current genotyping arrays • Are there additional susceptibility loci to be found? • Only subset of functional elements include common variants • Rare variants are more numerous and thus will point to additional loci

  23. Mutation Allele Frequency Spectrum (n=100 chromosomes)

  24. Site Frequency Spectrum • Number of variant allele at site • (n = 10,422 European Americans) • Total < 200 variant sites discovered in gene HHEX (7.9Kb) • Sanger sequencing, variants validated by 454 pyrosequencing • Black line: expected from Wright Fisher constant population size model and mutation rate estimated by Watterson’s method • Ref: Coventry et al (2010) Nat Commun 1(8):131 Figure3a.

  25. Sequencing Technologies • Sanger Capillary Sequencing • essentially the single viable DNA sequencing technology for almost three decades since 1977 • Costs: ~$0.5 per Kb (~$1.5 million whole genome) • Time: ~100 min per Kb (>570 years one genome) • The Human Genome Project took ~13 years at 5 major sites + >30 sites across the globe • This cost and throughput prohibited its application to large scale sequencing-based studies.

  26. NGS • Next-generation sequencing (NGS) • AKA, massively parallel sequencing (MPS), high throughput sequencing (HTS) • Debut ~2004-2005 • Cost: <$0.00005 per Kb (~$150 for 1X coverage) • the drop in costs is more dramatic than Moore’s Law.

  27. Sequencing Cost Drop Beats Moore’s Law

  28. NGS • Next-generation sequencing (NGS) • AKA, massively parallel sequencing (MPS), high throughput sequencing (HTS) • Debut ~2004-2005 • Cost: <$0.00005 per Kb (~$150 for 1X coverage) • the drop in costs is more dramatic than Moore’s Law. • Time: • <0.002 min per Kb (~4 days for whole genome) • Illumina HiSeq 2000: 100-300Gb/8days!

  29. At Costs, though • Shorter reads • Sanger sequencing: up to ~1Kb • NGS technologies: typically 30-400bp • Implication: a lot of tasks (e.g, assembly, read alignment, haplotyping) become more challenging • Higher per-base sequencing error rate • Sanger sequencing: < 0.001% • NGS: 0.5-1% • Implication: Need redundant sequencing of each base to distinguish sequencing errors from true polymorphisms

  30. Commonly used Technologies • Illumina Solexa sequencing-by-synthesis • Roche 454 pyrosequencing • Applied Biosystem SOLiD • Helicos Biosciences • Pacific Biosciences • Ion Torrent • Complete Genomics • Oxford Nanopore • …

  31. Illumina Solexa Technology Mardis (2008), Annual Review of Genomics and Human Genetics 9: 387-402

  32. Illumina Solexa Technology (cont’d) Reversible terminators: F (fluorescent labels) Metzker (2010), Nat Rev Genet 11: 31-46

  33. Washed out un-incorporated nucleotides, take picture Metzker (2010), Nat Rev Genet 11: 31-46

  34. Metzker (2010), Nat Rev Genet 11: 31-46

  35. Metzker (2010) Nat Rev Genet 11: 31-46

  36. Metzker (2010) Nat Rev Genet 11: 31-46

  37. Metzker (2010) Nat Rev Genet 11: 31-46

  38. Metzker (2010) Nat Rev Genet 11: 31-46

  39. Paired-ends

  40. Mate Pairs/Paired Ends Medvedev et al (2009) Nat Methods 6: S13-20

  41. Paired-End

  42. Paired-End and Indel read pairs generated by shearing DNA into fragments of approximately the same length (300±80 bases) and then sequencing ~35 bases at each end Manske & Kwiatkowski (2009) Genome Res 19, 2125-2132

  43. Deletion? Insertion? Deletion? Insertion?

  44. Deletion? Insertion? Deletion? Insertion?

  45. Deletion Insertion

  46. What do ngs data look like?

  47. Break: 9:10-9:25

  48. Real Data

  49. Now what do our data look like? • What do you want them to look at?

More Related