470 likes | 570 Vues
Population Structure, Association Studies, and QTLs. Stat 115/215. Structure Algorithm. One of the most widely-used programs in population genetics (original paper cited >9,000 times since 2000)
E N D
Population Structure, Association Studies, and QTLs Stat 115/215
Structure Algorithm • One of the most widely-used programs in population genetics (original paper cited >9,000 times since 2000) • Pritchard, Stephens and Donnelly (2000). Inference of Population Structure Using Multilocus Genotype Data, Genetics. 155:945-959. • Very flexible model can determine: • The most likely number of uniform groups (populations, K) • The genomic composition of each individual (admixture coefficients) • Possible population of origin
A simple model of population structure • Individuals in our sample represent a mixture of K (unknown) ancestral populations. • Each population is characterized by (unknown) allele frequencies at each locus. • Within populations, markers are in Hardy-Weinberg and linkage equilibrium.
The model • Let A1, A2, …, AK represent the (unknown) allele frequencies in each subpopulation • Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals – they are indicators • Assuming HWE and LE within subpopulations, the likelihood of an individual’s genotypes at various loci in subpopulation k is given by the product of the relevant allele frequencies:
More details • Probability of observing a genotype at locus l by chance in population is a function of allele frequencies: • Pl=pi2 for homozygous loci • Pl=2pipjfor heterozygous loci • Assuming no linkage among the markers, we have the product form as in the previous page.
Heuristics • If we knew the population allele frequencies in advance, then it would be easy to assign individuals (using Bayes rule). • If we knew the individual assignments, it would be easy to estimate frequencies. • In practice, we don’t know either of these, but we have the Gibbs sampler!
MCMC algorithm (for fixed K) • Start with random assignment of individuals to populations • Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it. • Step 2: Individuals are assigned to populations based on gene frequencies in each population. • And this is repeated... • Estimation of K performed separately
Alternative approach • Structure is very computationally intensive • Often no clear best-supported K-value • Alternative is to use traditional multivariate statistics to find uniform groups • Principal Components Analysis is most commonly used algorithm • EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190)
Principal Component Analysis • Efficient way to summarize multivariate data like genotypes • Each axis passes through maximum variation in data, explains a component of the variation
Human population assignment with SNPs • Assayed 500,000 SNP genotypes for 3,192 Europeans • Used Principal Components Analysis to ordinate samples in space • High correspondence between sample ordination and geographic origin of samples Individuals assigned to populations of origin with high accuracy
Genetic Association Tests • Review of typical approach: chi-square test • 2x3 table (or 2x2 table) • Alternatively, we can do a logistic regression
Genetic Models and Underlining Hypotheses Genotypic value is the expected phenotypic value of a particular genotype Genotypic Model Hypothesis: all 3 different genotypes have different effects AA vs. Aa vs. aa
Genetic Models and Underlining Hypotheses • Dominant Model Hypothesis: the genetic effects of AA and Aa are the same (assuming A is the minor allele) AA and Aa vs. aa
Genetic Models and Underlining Hypotheses • Recessive Model • Hypothesis: the genetic effects of Aa and aa are the same (A is the minor allele) AA vs. Aa and aa
Genetic Models and Underlining Hypotheses Allelic Model Hypothesis: the genetic effects of allele A and allele a are different A vs. a
Pearson’s Chi-squaredTest • Genotypic Model: • Null Hypothesis: Independence df = 2
Pearson’s Chi-squaredTest • Dominant Model: • Null Hypothesis: Independence df = 1
Pearson’s Chi-squaredTest • Recessive Model: • Null Hypothesis: Independence df = 1
Pearson’s Chi-squaredTest • Allelic Model: • Null Hypothesis: Independence df = 1
Test Statistic • Chi-squared Test Statistic: • O is the observed cell counts • E is the expected cell counts, under null hypothesis of independence
Other Options Fisher’s Exact Test: When sample size is small, the asymptotic approximation of null distribution is no longer valid. By performing Fisher’s exact test, exact significance of the deviation from a null hypothesis can be calculated. For a 2 by 2 table, the exact p-value can be calculated as:
Association Tool • PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/ • Case-control, TDT, quantitative traits.
Mapping Quantitative Traits • Examples: weight, height, blood pressure, BMI, mRNA expression of a gene, etc. • Example: F2 intercross mice
Quantitative traits (phenotypes) 133 females from our earlier (NOD B6) (NOD B6) cross Trait 4 is the log count of a particular white blood cell type.
Another representation of a trait distribution Note the equivalent of dominance in our trait distributions.
A second example Note the approximate additivity in our trait distributions here.
Trait distributions: a classical view In general we seek a difference in the phenotype distributions of the parental strains before we think seeking genes associated with a trait is worthwhile. But even if there is little difference, there may be many such genes. Our trait 4 is a case like this.
Data and goals Data Phenotypes: yi= trait value for mouse i Genotype: xij = 1/0 of mouse i is A/H at marker j (backcross); need two dummy variables for intercross Genetic map: Locations of markers Goals Identify the (or at least one) genomic region, called quantitative trait locus = QTL, that contributes to variation in the trait Form confidence intervals for the QTL location Estimate QTL effects
Models: GenotypePhenotype • Let y = phenotype, g = whole genome genotype • Imagine a small number of QTLw with genotypes g1,…., gp (2por 3p distinct genotypes for BC, IC resp). • We assume E(y|g) = (g1,…gp ), var(y|g) = 2(g1,…gp)
Models: GenotypePhenotype, ctd • Homoscedacity (constant variance) 2(g1,…gp) = 2(constant) • Normality of residual variation y|g ~ N(g ,2) • Additivity: (g1,…gp )= + ∑j gj (gj = 0/1 for BC) • Epistasis: Any deviations from additivity.
The simplest method: ANOVA • Split mice into groups according to genotype at a marker • Do a t-test/ANOVA • Repeat for each marker • Adjust for multiplicity LOD score = log10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model.
Interval mapping (IM) • Lander & Botstein (1989) • Take account of missing genotype data (uses the HMM) • Interpolates between markers • Maximum likelihood under a mixture model
Interval mapping, cont • Imagine that there is a single QTL, at position z between two (flanking) markers • Let qi= genotype of mouse i at the QTL, and assume • yi | qi ~ Normal( qi , 2 ) • We won’t know qi, but we can calculate • pig = Pr(qi = g | marker data) • Then, yi, given the marker data, follows a mixture of normal distributions, with known mixing proportions (the pig). • Use an EM algorithm to get MLEs of = (A, H, B, ). • Measure the evidence for a QTL via the LOD score, which is the log10 likelihood ratio comparing the hypothesis of a single QTL at position z to the hypothesis of no QTL anywhere.
Epistasis, interactions, etc • How to find interactions? • Stepwise regression • BEAM (Zhang and Liu 2007)
Naïve Bayes model Y X1 X2 X3 Xm
Augmented Naïve Bayes Y X2.21 Group 0 X01 X02 X2.22 Group 22 X2.12 X11 X12 X13 X2.11 X2.13 Group 1 Group 21
Acknowledgment • Terry Speed (some of the slides) • Karl Broman (U of Wisconsin) • Steven P. DiFazio (West Virginia U)