Controlling for population stratification and admixture in the MESA

Controlling for population stratification and admixture in the MESA Jasmin Divers, PHD Section on Statistical Genetics and Bioinformatics Department of Biostatistical Sciences Division of Public Health Sciences Wake Forest University Health Sciences

Population1 Population2 A B C a b c A B C a b c A B C a b c A B C a b c Admixed Population B b b B A a A a c C C c A B C a b c a b c A B C Creation of admixed population RECOMBINATION

Confounding issues due to admixture • The admixture process can, under some circumstances, create disequilibrium between pairs of unlinked loci and thus create confounding (spurious associations) in genetic association studies between trait and marker. • A classic example is found in Knowler et al (1988). They reported an association between an HLA haplotype and diabetes for Pima Indians. When the analysis was repeated stratifying subjects by amount of European ancestry, the observed association between HLA haplotype and diabetes was not present.

GI YI The Problems created by admixture Ai The Solution • In response to this problem, many authors have proposed a collection of methods.

Use IAP for control Use PCA to obtain a GBC Methods to control for PS in genetic association studies TDT-type approaches Genomic Control Methods Structured Association Tests

Admixture Admixture estimates Structured association test accounting for ME Ancestry QTL Phenotype ? Error How to control for ancestry and decide whether the phenotype is effectively associated with the QTL using the admixture estimates?

Why are the admixture estimates measured with error? • Only a fixed subset of markers are considered, therefore variation between the statistic (admixture) and the parameter (ancestry) should be expected. • AIMs are not perfectly informative that is, the allele frequency difference (at each marker) between the two parental populations is not equal to one. • It is usually very difficult to identify the parental populations that give rise to admixed individuals sampled in a particular study and to determine the allele frequency of each marker. • There can be evidence of population stratification even within the populations that are considered as founders for the admixed population under study.

Confounding effect of individual admixture in SAT • The higher the delta value of marker, the more likely it is to be falsely associated with this phenotypic variable when one does not control for IAP.

Issues in estimating ancestry proportion estimates Several factors can affect the quality of ancestry proportion estimates including: • degree of maker informativeness • There are very few markers that have a specific allele whose frequency is equal to one in one parental population and zero in all others; • The number of such markers decreases as the number of parental populations are considered increases; • preference for a less clear measure of marker informativeness • The absolute value of the allele frequency difference is used as a measure of marker informativeness for admixed population resulting from intermating between exactly 2 parental populations;

Issues in estimating ancestry proportion estimates • The definition of marker informativeness for estimating the ancestry proportion becomes less clear as the number of parental populations is increasing; • Some investigators choose to work with AIMs defined by considering only 2 of the parental populations at each time; • This can lead to some identifiability issues. • difficulty in identifying the parental populations • Poor knowledge regarding the allele frequency in the ancestral populations that intermate to create the admixed population. • Ignoring migration effect in the population that are considered as the parental populations.

Information for ancestry proportion estimates when K=3 • Choosing AIM based on delta values between the parental populations taken 2 by 2 may not be sufficient.

Obtaining the individual admixture estimates • Obtain a set of ancestry informative markers (AIM’s) • Using delta values (keeping in mind Pfaff et al., (2004)) • Using information-theoretic principles (Rosenberg et al., 2003) • Using Shannon information criteria (Smith et al., 2004) • Maximum likelihood based methods • Hanis (1986) • Chikhi et al. (2001) • Wang J (2003) • Tang et al.(2005)* • Bayesian methods • Pritchard et al. (Structure) • Paterson et al. (Ancestrymap) • Hoggart (Admixmap)

Ancestry estimates for the MESA assuming K=4 • One cannot tell whether the deviations observed in self-reported EA an AA correspond to the presence of admixture individuals or if they are due to the lack of identifiability discussed in the previous slide; • More variability in ancestry estimates can be observed with the self-reported Hispanic-Americans; • Need to be more specific in collecting self-reported ethnicity information from this group.

Using PCA to detect population stratification • The PCA approach tries to identify axes of variation that explain a large fraction of the variance observed in the data instead of estimating ancestry proportions; • This approach is less dependent on marker informativeness than estimating ancestry proportions; • Consequently, it works well in cases where the overall sample can be seen as a stratified sample (like in the MESA) or when there is no clear indication regarding the possible parental populations; • However, the effect of the PCA on the variable of interest may not be linear. In this case the statistical analysis may become slightly more involved than a simple linear model.

Prior use of PCA in population genetics • 2 PC are enough to recover variation between continents. Zhang et al. 2003

Variation within population • Clear separation between northerners a southerners on the second principal component. Wen et al. (2004).

Does it provide adequate Type-I error control? • In both cases, the GBC provide better control than Genomic control Zhang et al. 2003 Price et al. 2006

Principal Component analysis • The Hispanic American is the more heterogeneous group represented in the sample.

PCA in 3D of the MESA 1 2 • We conjecture that the 2 areas identified in the graph by 1 and 2 correspond to 2 specific subset of Hispanic-Americans.

Agreement between self-reported ethnicity and the 4 observed clusters • Cohen’s kappa=0.83 with the SR Hispanics Americans and 0.98 without them. • The hypothesis of marginal homogeneity is rejected when the SR Hispanic Americans are included in the analysis (Pvalue=0.001). This hypothesis cannot be rejected after removing the SR Hispanic Americans (Pvalue=0.06).

African American European American Hispanic American Chinese American Possible reasons to explain the variability observed with the SR Hispanics • The number of ancestry informative markers used in the analysis; • The self reported Hispanic Americans are one the most ethnically diverse group;

Control for population stratification • More markers seem to be significant when controlling for SRE than when controlling for a measure computed from the AIMs.

Controlling for population stratification and admixture in the MESA

Controlling for population stratification and admixture in the MESA

Presentation Transcript

Intravenous admixture

Population Policy: Controlling Demographic Processes

Patterns of population structure and admixture among human populations

Admixture Mapping

Population Stratification

Introduction to Population Stratification

PHARMACOGENETICS, POPULATION STRUCTURE AND ADMIXTURE

Admixture mapping

IV Admixture

Define Heterozygote Advantage, Random Genetic Drift and Population Stratification

Admixture Mapping

Population Stratification

Population stratification

Control of Population Stratification in Whole-Genome Scans

Concrete Admixture for Construction Work

Controlling the Deer Population on your Land

PHARMACOGENETICS, POPULATION STRUCTURE AND ADMIXTURE

IV Admixture

Admixture In Concrete