
Single Nucleotide PolymorphismCopy Number Variationsand SNP Array Xiaole Shirley Liu and Jun Liu
Outline • Definition and motivation • SNP distribution and characteristics • Allele frequency, LD, population stratification • SNP discovery (unknown) and genotyping(known) • CNV detection
Polymorphism • Polymorphism: sites/genes with “common” variation, less common allele frequency ≥1%, otherwise called rare variant and not polymorphic • First discovered (early 1980): restriction fragment length polymorphism • Some definitions: • Locus: position on chromosome where sequence or gene is located • Allele: alternative form of DNA on a locus
Polymorphism • Single Nucleotide Polymorphism • Occasionally short (1-3 bp) indels are considered SNPs too • Come from DNA-replication mistake individual germ line cell, then transmitted • ~90% of human genetic variation • Copy number variations • May or may not be genetic
Why Should We Care • Disease gene discovery • Association studies, certain SNPs are susceptible for diabetes • Chromosome aberrations, duplication / deletion might cause cancer • Personalized Medicine • Drug only effective if you have one allele
SNP Distribution • Most common, 1 SNP / 100-300 bp • Balance between mutation introduction rate and polymorphism lost rate • Most mutations lost within a few generations • 2/3 are CT differences • In non-coding regions, often less SNPs at more conserved regions • In coding regions, often more synonymous than non-synonymous SNPs
SNP Characteristics: Allele Frequency Distribution • Most alleles are rare (minor allele frequency < 10%)
SNP Characteristics:Hardy-Weinberg equilibrium (HWE) • In a population with genotypes BB, bb, and Bb, if p = freq(B), q =freq(b), the frequencies of BB, bb and Bb will be p2, q2, and 2 pq respectively at equilibrium, and will not change. • Assumptions for HWE: no mutation, no migration or emigration, infinite population size, no selective pressure, random mating. Could derivate from HWE if violated • It provides a baseline against which to measure change, e.g., inbreeding index: • More than 2 alleles:
SNP Characteristics:Linkage Disequilibrium • Equilibrium Disequilibrium • LD: If Alleles occur together more often than can be accounted for by chance, then indicate two alleles are physically close on the DNA • In mammals, LD is often lost at ~100 KB • In fly, LD often decays within a few hundred bases
SNP Characteristics:Linkage Disequilibrium • Statistical Significance of LD • Chi-square test with 1 df • eij = ni. n.j / nT
SNP Characteristics:Linkage Disequilibrium • Three ways to calculate LD Observed Expected
SNP Characteristics:Linkage Disequilibrium • Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots • Haplotype size distribution
SNP Characteristics:Linkage Disequilibrium • Can see haplotype block: a cluster of linked SNPs
SNP Characteristics:Linkage Disequilibrium • [C/T] [A/G] T X C [A/C] [T/A] • Possible haplotype: 24 • In reality, a few common haplotypes explain 90% variations • Tagging SNPs: • SNPs that capture most variations in haplotypes • removes redundancy Redundant
SNP Characteristics:Population Stratification • Population stratification: individuals selected from two genetically different populations, stratification may be environmental, cultural, or genetic • Could give spurious results in case control association studies – the example of “chopstick genes”
SNP Discovery Methods • Sequencing individuals for difference: too costly • First check whether big regions have SNPs • Basic idea: denature and re-anneal two samples, detect heterduplex • Can pool samples (e.g. 10 African with 10 Caucasians) to speed screening • Resequence to verify • dbSNP: 12M RefSNP, 6M validated
SNP Genotyping • For a known locus TT C/A AG, does this individual have CC, AA or AC? Many methods • Hybridization-based methods • Dynamic allele-specific hybridization • Molecular beacons • SNP-array chip (simultaneously genotype thousands of SNPs) • Enzyme-based methods • RFLP • PCR-based methods • Flap endonuclease • Primer extension • Oligonucleotide ligase assay • Other methods (based on physical properties of DNA)
SNP Array • One SNP at a time or genome-wide (SNP array) 2.5kb 5.8kb 0.30
40 Probes Used Per SNP • Allele call • AA, BB, AB • Signal • Theoretically 1A+1B, 2A, 2B • But could have 1A+3B Amplified!
SNP Chip for LOH • Loss of Heterozygosity: tumor suppressor gene inactivation by allelic loss in cancers Normal First genetic hit Cancer T T T T T OR T T X X X X A B A B A A A LOH
SNP Array for CNV • Collect normal / diseased samples on SNP arrays • Probe normalization, background subtraction • Use HMM to infer CNV
Integrate CNV with Expression toIdentify oncogene MITF in melanoma
Summary • SNP and CNV • SNP distribution and characteristics • Allele frequency (minor allele > 1%) • LD: linkage ~ physical proximity • Population stratification • SNP discovery: heteroduplex • SNP genotyping • SNP array • CNV detection: HMM
Acknowledgement • Stefano Monti • Tim Niu • Kenneth Kidd, Judith Kidd and Glenys Thomson • Joel Hirschhorn • Greg Gibson & Spencer Muse