Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Center for Statistical Bioinformatics Institute for Applied Mathematics and Computational Science Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Note the Maroon color scheme! And the green MSU flag. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Apologies to Dr. Seuss TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Michigan State Grads at TAMU Mohsen Pourahmadi Soumen Lahiri TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Other Michigan State Contacts David Ruppert Anton Schick TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Outline • Problem: Case-Control Studies with Gene-Environment relationships • Theme I: Logistic regression is lousy for understanding interactions. We make assumptions that can double or triple the effective sample size

Outline • Problem: Case-Control Studies with Gene-Environment relationships • Theme II: There is a lousy estimator, and a good one that makes more assumptions. How do you protect yourself if the assumptions fail, and you want to analyze 500,00 SNP?

Outline • Problem: Case-Control Studies with Gene-Environment relationships • Theme III: How does all this work with actual data, as opposed to simulated data?

Software • SAS and Matlab Programs Available at my web site under the software button • R programs available from the NCI • New Statistical Science paper 2009, volume 24, 489-502 http://stat.tamu.edu/~carroll

Basic Problem Formalized • GeneandEnvironment • Question: For women who carry the BRCA1/2 mutation, does oral contraceptive use provide any protection against ovarian cancer?

Basic Problem Formalized • GeneandEnvironment • Question: For people carrying a particular haplotype in the VDR pathway, does higher levels of serum Vitamin D protect against prostate cancer?

Basic Problem Formalized • GeneandEnvironment • Question: If you are a current smoker, are you protected against colorectal adenoma if you carry a particular haplotype in the NAT2 smoking metabolism region?

Retrospective Studies • D = disease status (binary) • X = environmental variables • Smoking status • Vitamin D • Oral contraceptive use • G = gene status • Mutation or not • Multiple or single SNP • Haplotypes

Prospective and Retrospective Studies • Retrospective Studies: Usually called case-control studies • Find a population of cases, i.e., people with a disease, and sample from it. • Find a population of controls, i.e., people without the disease, and sample from it.

Prospective and Retrospective Studies • Retrospective Studies: Because the gene G and the environment X are sample after disease status is ascertained

Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction as they affect development of disease

Logistic Regression • Logistic Function: • The approximation works for rare diseases

Prospective Models • Simplest logistic model without an interaction • The effect of having a mutation (G=1) versus not (G=0) is

Prospective Models • Simplest logistic model with an interaction • The effect of having a mutation (G=1) versus not (G=0) is

Empirical Observations • Statistical Theory: There is a lovely statistical theory available • It says: ignore the fact that you have a case-control sample, and pretend you have a prospective study

When G is observed • Logistic regression is robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions • Much larger sample sizes are required for interactions that for just gene effects

Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies

G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

Gene-Environment Independence • If you are willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • This is NOT TRUE for prospective studies, only true for retrospective studies.

Gene-Environment Independence • The reason is that you are putting a constraint on the retrospective likelihood

Gene-Environment Independence • Our Methodology: Is far more general than assuming that genetic status and environment are independent • We have developed capacity for modeling the distribution of genetic status given strata and environmental factors • I will skip this and just pretend G-E independence here

More Efficiency, G Observed • Our model: G-E independence and a genetic model, e.g., Hardy-Weinberg Equilibrium

The Formulation • Any logistic model works • Question: What methods do we have to construct estimators?

Methodology • I won’t give you the full methodology, but it works as follows. • Case-control studies are very close to a prospective (random sampling) study, with the exception that sometimes you do not observe people

Methodology N Total Population Np1 Np0 Controls in the Population Cases in the Population Cases in the Sample n1 n0 Controls in the Sample Missing Cases Np1-n1 Np0-n0 Missing Controls % of Controls observed % of Cases observed

Pretend Missing Data Formulation • This means that there is a missing data problem. • The selection into the case control study is biased: cases are vastly over-represented • Ordinary logistic regression computes the probability of disease given the environment, given the gene, and given that the person was selected into the case control study

Pretend Missing Data Formulation • This means that there is a missing data problem. • Our method computes the probability of disease and the probability of gene given the environment and given that the person was selected into the case control study • The selection into the case control study is biased: cases are vastly over-represented

Methodology • Our method has an explicit form, i.e., no integrals or anything nasty • It is easy to program the method to estimate the logistic model • It is likelihood based. Technically, a semiparametric profile likelihood

Methodology • We can handle missing gene data • We can handle error in genotyping • We can handle measurement errors in environmental variables, e.g., diet

Methodology • Our method results in much more efficient statistical inference

More Data • What does More efficient statistical inference mean? • It means, effectively, that you have more data • In cases that G is a simple mutation, our method is typically equivalent to having 3 times more data

How much more data: Typical Simulation Example • The increase in effective sample size when using our methodology

Real Data Complexities • The Israeli Ovarian Cancer Study • G = BRCA1/2 mutation (very deadly) • X includes • age, • ethnic status (below), • parity, • oral contraceptive use • Family history • Smoking • Etc.

Real Data Complexities • In the Israeli Study, G is missing in 50% of the controls, and 10% of the cases • Also, among Jewish citizens, Israel has two dominant ethnic types • Ashkenazi (European) • Shephardic (North African)

Real Data Complexities • The gene mutation BRCA1/2 if frequent among the Ashkenazi, but rare among the Shephardic • Thus, if one component of X is ethnic status, then pr(G=1 | X) depends on X • Gene-Environment independence fails here • What can be done? Model pr(G=1 | X) as binary with different probabilities!

Israeli Ovarian Cancer Study • Question: Can carriers of the BRCA1/2 mutation be protected via OC-use?

Typical Empirical Example

Israeli Ovarian Cancer Study • Main Effect of BRCA1/2:

Israeli Ovarian Cancer Study

Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}

Haplotypes • Unfortunately, we cannot presently observe the two haplotypes • We can only observe genotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b)

Missing Haplotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b) • However, this is also consistent with a different diplotype, namely Hdip = {(am,Bm), (Af,bf)} • Note that the number of copies of the (a,b) haplotype differs in these two cases • The true diploid = haplotype pair is missing

Missing Haplotypes • Our methods handle unphased diplotyes (missing haplotypes) with no problem. • Standard EM-algorithm calculations can be used • We assume that the haplotypes are in HWE, and have extended to cases of non-HWE

Robustness • Robustness: We are making assumptions to gain efficiency = “get more data” • What happens if the assumptions are wrong? • Biases, incorrect conclusions, etc. • How can we gain efficiency when it is warranted, and yet have valid inferences?

Two Likelihoods • The two likelihoods lead to two estimators • The former is robust but not efficient • The latter is efficient but not robust • What to do?

Gene-Environment Case-Control Studies