INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS

INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS Mamoru Kato Cold Spring Harbor Laboratory, USA

Background • 2003 – The complete sequence of the human genome was released by International Human Genome Sequencing Consortium. • coverage ~99%; accuracy >99.99% • However, this complete sequence is an “average” sequence (derived from the DNA samples of multiple individuals). • Little information on variation/polymorphism in the sequence among multiple individuals

Background • The International HapMap Project started after the Human Genome Project to address variation/polymorphism in human sequences • Focused on single nucleotide polymorphism (SNP)– the simplest polymorphism • Catalogued SNP genotypes for 270 individuals in three ethnical populations (Asian, African, European) at 3 million SNP loci • Medical application as well as biological investigation • 2005 – The Phase I was published. • 2007 – The Phase II was published.

Background • An achievement in the HapMap Project • Linkage disequilibrium (LD) • LD – statistical association between different loci • Promotes genome-wide disease association studies LD LD • Genome-wide association studies • Find SNPs associated with a disease on the genomic scale • Many diseases: diabetes, rheumatoid arthritis, myocardial infarction, Crohn's disease, ... SNP SNP SNP

Background • When the HapMap Project was ongoing, a more complex type of genetic variation than SNP had been recognized. • 2004 Copy number variation (Sebat et al; Iafrate et al) • By microarrays • Until then, this variation was believed to be rare in normal individuals • 2005 Inversion (Stefansson et al) • These variations are collectively called structural variations.

Structural Variations • Structural variation – variation with a long length • Variation in which sequence segments >1 kb in size are involved Copy Number Variation Inversion >1 kb (from father) Homologous chromosome Individual 1 Homologous chromosome (from mother) Homologous chromosome Individual 2 Homologous chromosome Translocation Homologous chromosome Individual 3 Homologous chromosome

Copy Number Variation • Copy Number Variation (CNV) • The simplest type of structural variation • Difference in the number of copies >1 kb CNV represented by a reference genome CNV CNV segment CNV region >1 kb

CNV • Since the studies in 2004, many studies have been performed on the genomic scale for many individuals. • CNV regions cover 4-6% of the human genome • 3,000-6,000 regions • cf. Common SNPs: 0.3% in coverage, 10 million in number • CNV regions often include entire genes and their regulatory regions. • e.g., CCL3L1 gene – HIV • CNVs are likely to influence human phenotypes such as disease susceptibility. • Autoimmunity, autism, psoriasis, schizophrenia, ...

CNV • The principle of CNV detection • by microarrays and quantitative PCR (excl. pair-end mapping)

Problem • These techniques cannot discern the configurations of genotypes (pairs of alleles) of CNV Allele: copy number 1 As total number 3 Allele: copy number 2 As total number3 Allele: copy number0 Allele: copy number3

Problem These techniques cannot discern the configurations of genotypes (pairs of alleles) of CNV (# A, # G) = (2, 1) Allele: G Allele: A, A (# A, # G) = (2, 1) Allele: A Allele: G, A

Problem • This is problematic –most theories in population genetics are constructed based on alleles (or haplotypes), not on the total numbers. • Population differentiation Frequencies of alleles (haplotypes) in a population • Linkage disequilibrium

Problem • Some method is required to get information on alleles/haplotypes from observed data • Haplotype inference for CNVs • It handles • >50 individuals • One locus to the genomic level

Haplotype Inference for CNVs • Deterministic approach (Redon et al, 2006; McCarroll et al, 2006; Hinds et al, 2006) • One state from observed data • Statistical approach (Kato et al, 2008; Kato et al, 2008; Shindo et al, 2009) • Multiple states from observed data

Deterministic Approach • A simple approach to infer alleles – clustering the signal intensities of individuals (Redon et al, 2006) • If signal intensities are grouped into three clusters, they correspond to two homozygotes and one heterozygote • Assumption: two different alleles Homo Homo Hetero Allele 1 Ind. 1 Allele 2 Ind. 2 (Redon et al, 2006) Ind. 3 Ind. 4 CNV

Statistical Approach • Statistical approach (Kato et al, 2008; Kato et al, 2008; Shindo et al, 2009) • Consider multiple possible states consistent with the total numbers (, which I call diploid numbers) • Condition: diploid numbers are observed. • Unrelated individuals (opposite to pedigrees) • This approach is realized in the expectation-maximization (EM) algorithm

Statistical Approach • The EM algorithm is a statistical method to estimate parameters, often used for data with unobserved parameters. • Unobserved parameters here: diplotypes (pairs of haplotypes), since their configurations are not experimentally determined • First, it handles unobserved data as if the unobserved data are observed, by utilizing the observed data on other parameters • Observed parameters here: diploid numbers (the total numbers over a genotype) • Second, it iteratively calculates E and M steps to increase estimation accuracy.

List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers) Principle of the Algorithm Quantitative PCR, HMM in microarray (Kato et al, 2008) Ind. 1 Ind. 2 Ind. 3 Ind. 4 “/”: separator symbol bet haplotypes CNV

List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers) Principle of the Algorithm • Haplotype composed of CNV and SNP (Kato et al, 2008) a Ind. 1 t a Ind. 2 a a Ind. 3 a t Ind. 4 t SNP CNV “_”: separator symbol bet loci

List all diplotypes (pairs of haplotypes) that are consistent with diploid numbers (the total numbers) Principle of the Algorithm • Single Nucleotide Variations in CNVs (SNVCs) RETINA technique, HMM in microarray (Kato et al, 2008) ..A..G.. ..A..G.. Ind. 1 ..C..G.. ..C..G.. ..A..G.. ..A..T.. ..A..G.. Ind. 2 ..A..T.. ..C..G.. ..C..G.. ..A..T.. ..A..T.. Ind. 3 “,”: separator symbol bet copies SNVC1 SNVC2 “-”: deletion

Repeat E- and M-steps to estimate haplotype frequencies Using possible diplotypes obtained at the previous step Principle of the Algorithm F(x): frequency of x M step:Number of haplotypes, considering the weights→Haplotype frequencies E step: Haplotype frequencies→Diplotype frequencies→Update the weights Giving arbitrary values to haplotype frequencies Iteration

Application • Population differentiation, Fst • CNV regions with high Fst (indicating natural selection) • Microarray data for CEU and YRI populations (90 individuals each) • Frequencies of allelic copy numbers were estimated. (Kato et al, 2009)

LD (association) Application CNV SNP SNP Bi-allelic CEU YRI (Kato et al, 2009) CEU YRI Tri-allelic(del, dup)

Application • RETINA data for CEU and YRI populations (90 individuals each) • Estimation using only copy numbers • Estimation using both bases and copy numbers (SNVCs) (Kato et al, 2008) • More information in SNVC than in only copy number

Future Issues • CNV association studies • Find CNV regions associated with a disease • Currently, they are based on diploid numbers of copies (or categorized numbers like “2 copies”). • It wouldn't be necessary to infer haplotypes, as long as only copy numbers are examined. • However, it would be necessary to infer haplotypes, if SNVCs is associated with a disease • More complex, hard to analyze without haplotype inference • SNVCs = SNPs + copy number changes • Even only SNPs have a significant risk for diseases

Future Issues • Issues in the methodology • Use of pedigree information • Methods based on other algorithms • EM has a limitation. • Gibbs sampling, Coalescence-based sampling, ... • Assumption of the Hardy-Weinberg equilibrium • Errors in microarray data

Conclusions • Human genome to human variations/polymorphisms • SNP • CNV • Experimental technologies for CNV  diploid numbers • CNV haplotype inference • Deterministic approach • Statistical approach • Applications to population genetics • Future issues • Applications to CNV disease association studies – SNVC • Overcoming limitations of the current algorithms

CSHL Michael Q. Zhang Anthony Leotta Univ. of Tokyo Hiroyuki Aburatani Shumpei Ishikawa RIKEN Tatsuhiko Tsunoda Naoya Hosono Takahisa Kawaguchi Reiichiro Nakamichi Michiaki Kubo Naoyuki Kamatani Yusuke Nakamura Affymetrix Keith Jones Michael Shapero Acknowledgments • Funding: • National Cancer Institute • Japan Society for Promotion of Science

END

INFERRING HAPLOTYPES OF COPY NUMBER VARIATIONS