POLYMORPHISMS & ASSOCIATION TESTS

POLYMORPHISMS & ASSOCIATION TESTS SaurabhSinha Mayo-Illinois Computational Genomics Workshop June14, 2019 Acknowledgment for some slides toAriánAvalos

OUTLINE • Molecular Markers • Genome Wide Association Studies (GWAS) • Functional Effects

MOLECULAR MARKERS • What is a SNP and a SNV? • Single Nucleotide Polymorphysm • Single Nucleotide Variant I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT

MOLECULAR MARKERS • A SNV is any change (e.g. a somatic mutation, even an artifact). • A SNP has defining criteria • Polymorphic SNV, have “Major” and “minor” alleles • Sometimes defined by frequency level (e.g. minimum allele frequency of 5%) • For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.

MOLECULAR MARKERS • Both types of variants are relevant depending on the field • Population geneticists conducting association test will focus on SNPs • Cancer geneticists will instead be interested in SNVs • The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)

GENETIC LINKAGE analysis • Example: • Cystic Fibrosis and the CFTR gene mutations • Approach: Genetic Linkage Analysis • Genotype family members (some individuals carrying the disease) • Find a marker that correlates with the disease • Disease gene lies close to this marker

GENETIC LINKAGE analysis • Limitations of Genetic Linkage Analysis • Requires data from entire families, preferably large ones, where the trait is segregating • Linkage analysis less successful with common diseases, e.g., heart disease or cancers. • Requires single, large effect loci

Genome-wide association studies (gwas)

GWAS • Hypothesize that common diseases are influenced by common genetic variation in the population • Implications: • Any individual variation (SNP) will have relatively small correlation with the disease • Multiple common alleles together influence the disease phenotype • This argues for population- rather than family-based studies.

GWAS Bush W. S. & Moore J. H. (2012) PLoSComput Biol 8(12): e1002822

GWAS: Resources • Zhang X. et al. (2012). PLoSComput Biol 8(12): e1002828. • Bush W. S. & Moore J. H. (2012) PLoSComput Biol 8(12): e1002822.

GWAS: Genotyping • Microarray – can assay 0.5 – 1.0 Million or more SNPs • Whole-genome sequencing (WGS) – assays (near) complete SNP profile • In non-human genetics, reduced-representation methods provide a middle-ground.

GWAS: Phenotyping • Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease) • Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels) • Possible to look at more than one phenotype?

GWAS • Case / Control Disease? I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -

GWAS • Before analysis and interpretations a few considerations: • Correlation is notcausation

GWAS • Before analysis and interpretations a few considerations: • Correlation is notcausation • Linkage disequilibrium (see later) • Population structure (see later) • Phenotyping

GWAS • Further consider that even if the analysis is successful, findings can be hard to interpret • Example: • SNP correlates well with heart disease • Biochemical link? Behavioral link (you particularly like bacon…)?

GWAS: Statistics • Case vs. Control I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT + I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT - I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +

GWAS: Statistics • Case vs. Control • The Fisher’s Exact Test All 14 Case 4 A 4 p-value < 0.05 3

GWAS: Statistics • An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test. • We conduct this test on EACH SNP separately, and get a corresponding p-value. • The smallest p-values point to the SNPs most associated with the disease.

GWAS: Statistics • Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease. • In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT • We therefore correlate genotype with phenotype of the individual

GWAS: Statistics • There are various options for a Case vs. Control genotypic association test • Example: • Dominant Model

GWAS: Statistics • There are various options for a Case vs. Control genotypic association test • Example: • Dominant Model • Recessive Model

GWAS: Statistics • There are various options for a Case vs. Control genotypic association test • Example: • Dominant Model • Recessive Model • 2x3 Table Chi-Squared Test

GWAS: Statistics • Quantitative Phenotypes • If no association, • The more the stronger the association • This is called linear regression

GWAS: Statistics • Quantitative Phenotypes • Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA) • Statistical models for GWAS can get quite involved (can give references on request)

GWAS: Statistics Lambert et al., 2013: Nature Genetics 45, 1452

GWAS: Statistics • Multiple Hypothesis Correction • What does p-value = 0.01 mean? • It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance. • What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation, just by chance (and by definition)

GWAS: Statistics • Multiple Hypothesis Correction • Bonferroni (Seen in statistics lecture) • Multiply the p-value by the number of tests • So if the original SNP had p-value , the new p-value is defined as • With , a p-value of is downgraded to: • False Discovery Rate (seen in Statistics lecture) This is quite good!

BEYOND SINGLE LOCUS • So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants • Maybe considering two SNPs together will identify a stronger correlation with phenotype • Main problem: Number of pairs ~

BEYOND probed snps • Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs) • But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million? • Not necessarily

Linkage disequilibrium • Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD) • In this situation, lack of recombination events have made the inheritance of those two sites dependent • If two such sites have high LD, then one site can serve as proxy for the other

Linkage disequilibrium • So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y • In this way a reduced panel can represent a larger number (all?) of the common SNPs

Linkage disequilibrium • A problem is that if X correlates with a disease, the causal variant may be either X or Y

GWAS: DIscussion • In many cases, able to find SNPs that have significant association with disease. • GWAS Catalog : http://www.genome.gov/26525384 • Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases. • “Finding the Missing Heritability of Complex Diseases” http://www.genome.gov/27534229

GWAS: DIscussion • Increasingly, whole-exome and even whole-genome sequencing used for variant detection • Taking on the non-coding variants. Use functional genomics data as template • Network-based analysis rather than single-site or site-pairs analysis • Complement GWAS with family-based studies

FUNCTIONAL EFFECTS • How do we predict how a variant is likely to be affecting protein function?

FUNCTIONAL EFFECTS • Case: • I found a SNP inside the coding sequence. Knowing how to translate the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP. • Will that impact the protein’s function? • (And I don’t quite know how the protein functions in the first place ...)

FUNCTIONAL EFFECTS • Two popular approaches: • PolyPhen 2.0 • Adzhubei, I. A. et al. (2010). Nat Methods 7(4):248-249 • SIFT • Kumar P. et al., (2009). Nat Protoc 4(7):1073-1081

One popular method • PolyPhen 2.0

Polyphen 2.0 • The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data. • Specifically the HumDiv data base which is • A compilation of all the damaging mutations with known effects of molecular function • A collection of non-damaging differences between human proteins and those of closely related mammalian homologs

Polyphen2.0 features

Multiple sequence alignments • A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:

Evolutionary conservation score • Of interest is the Position Specific Independent Count (PSIC) Score. • This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA

PSIC score • Example:

Psic score • To derive the PSIC score we first calculate the frequency of each amino acid:

Psic score • The idea: • is not the raw count of amino acid “” at position but rather it is adjusted for the many closely related sequences in the MSA • The PSIC score of a SNP at position is given by:

What to do with the score? • Ultimately your derived score can be compared with the existing scores from HumDiv

Combining all features of snp to predict impact: machine learning • Classification • Naive Bayes method • A type of classifier. Other classification algorithms include “Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc. • Sometimes called “Machine Learning” • What is a classification algorithm? • What is a Naive Bayes method/classifier?

Classification or ‘supervised learning’ • + • + • … • - • - • … Positive examples Negative examples Training Data MODEL “Supervised Learning”

POLYMORPHISMS & ASSOCIATION TESTS