1 / 59

POLYMORPHISMS & ASSOCIATION TESTS

POLYMORPHISMS & ASSOCIATION TESTS. Saurabh Sinha Mayo -Illinois Computational Genomics Workshop June 14, 2019 Acknowledgment for some slides to Arián Avalos. OUTLINE . Molecular Markers Genome Wide Association Studies (GWAS) Functional Effects. MOLECULAR MARKERS.

howes
Télécharger la présentation

POLYMORPHISMS & ASSOCIATION TESTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. POLYMORPHISMS & ASSOCIATION TESTS SaurabhSinha Mayo-Illinois Computational Genomics Workshop June14, 2019 Acknowledgment for some slides toAriánAvalos

  2. OUTLINE • Molecular Markers • Genome Wide Association Studies (GWAS) • Functional Effects

  3. MOLECULAR MARKERS • What is a SNP and a SNV? • Single Nucleotide Polymorphysm • Single Nucleotide Variant I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT

  4. MOLECULAR MARKERS • A SNV is any change (e.g. a somatic mutation, even an artifact). • A SNP has defining criteria • Polymorphic SNV, have “Major” and “minor” alleles • Sometimes defined by frequency level (e.g. minimum allele frequency of 5%) • For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.

  5. MOLECULAR MARKERS • Both types of variants are relevant depending on the field • Population geneticists conducting association test will focus on SNPs • Cancer geneticists will instead be interested in SNVs • The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)

  6. GENETIC LINKAGE analysis • Example: • Cystic Fibrosis and the CFTR gene mutations • Approach: Genetic Linkage Analysis • Genotype family members (some individuals carrying the disease) • Find a marker that correlates with the disease • Disease gene lies close to this marker

  7. GENETIC LINKAGE analysis • Limitations of Genetic Linkage Analysis • Requires data from entire families, preferably large ones, where the trait is segregating • Linkage analysis less successful with common diseases, e.g., heart disease or cancers. • Requires single, large effect loci

  8. Genome-wide association studies (gwas)

  9. GWAS • Hypothesize that common diseases are influenced by common genetic variation in the population • Implications: • Any individual variation (SNP) will have relatively small correlation with the disease • Multiple common alleles together influence the disease phenotype • This argues for population- rather than family-based studies.

  10. GWAS Bush W. S. & Moore J. H. (2012) PLoSComput Biol 8(12): e1002822

  11. GWAS: Resources • Zhang X. et al. (2012). PLoSComput Biol 8(12): e1002828.  • Bush W. S. & Moore J. H. (2012) PLoSComput Biol 8(12): e1002822.

  12. GWAS: Genotyping • Microarray – can assay 0.5 – 1.0 Million or more SNPs • Whole-genome sequencing (WGS) – assays (near) complete SNP profile • In non-human genetics, reduced-representation methods provide a middle-ground.

  13. GWAS: Phenotyping • Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease) • Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels) • Possible to look at more than one phenotype?

  14. GWAS • Case / Control Disease? I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -

  15. GWAS • Before analysis and interpretations a few considerations: • Correlation is notcausation

  16. GWAS • Before analysis and interpretations a few considerations: • Correlation is notcausation • Linkage disequilibrium (see later) • Population structure (see later) • Phenotyping

  17. GWAS • Further consider that even if the analysis is successful, findings can be hard to interpret • Example: • SNP correlates well with heart disease • Biochemical link? Behavioral link (you particularly like bacon…)?

  18. GWAS: Statistics • Case vs. Control I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT + I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT + I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT - I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT - I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +

  19. GWAS: Statistics • Case vs. Control • The Fisher’s Exact Test All 14 Case 4 A 4 p-value < 0.05 3

  20. GWAS: Statistics • An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test. • We conduct this test on EACH SNP separately, and get a corresponding p-value. • The smallest p-values point to the SNPs most associated with the disease.

  21. GWAS: Statistics • Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease. • In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT • We therefore correlate genotype with phenotype of the individual

  22. GWAS: Statistics • There are various options for a Case vs. Control genotypic association test • Example: • Dominant Model

  23. GWAS: Statistics • There are various options for a Case vs. Control genotypic association test • Example: • Dominant Model • Recessive Model

  24. GWAS: Statistics • There are various options for a Case vs. Control genotypic association test • Example: • Dominant Model • Recessive Model • 2x3 Table Chi-Squared Test

  25. GWAS: Statistics • Quantitative Phenotypes • If no association, • The more the stronger the association • This is called linear regression

  26. GWAS: Statistics • Quantitative Phenotypes • Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA) • Statistical models for GWAS can get quite involved (can give references on request)

  27. GWAS: Statistics Lambert et al., 2013: Nature Genetics 45, 1452

  28. GWAS: Statistics • Multiple Hypothesis Correction • What does p-value = 0.01 mean? • It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance. • What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation, just by chance (and by definition)

  29. GWAS: Statistics • Multiple Hypothesis Correction • Bonferroni (Seen in statistics lecture) • Multiply the p-value by the number of tests • So if the original SNP had p-value , the new p-value is defined as • With , a p-value of is downgraded to: • False Discovery Rate (seen in Statistics lecture) This is quite good!

  30. BEYOND SINGLE LOCUS • So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants • Maybe considering two SNPs together will identify a stronger correlation with phenotype • Main problem: Number of pairs ~

  31. BEYOND probed snps • Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs) • But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million? • Not necessarily

  32. Linkage disequilibrium • Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD) • In this situation, lack of recombination events have made the inheritance of those two sites dependent • If two such sites have high LD, then one site can serve as proxy for the other

  33. Linkage disequilibrium • So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y • In this way a reduced panel can represent a larger number (all?) of the common SNPs

  34. Linkage disequilibrium • A problem is that if X correlates with a disease, the causal variant may be either X or Y

  35. GWAS: DIscussion • In many cases, able to find SNPs that have significant association with disease. • GWAS Catalog : http://www.genome.gov/26525384 • Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases. • “Finding the Missing Heritability of Complex Diseases” http://www.genome.gov/27534229

  36. GWAS: DIscussion • Increasingly, whole-exome and even whole-genome sequencing used for variant detection • Taking on the non-coding variants. Use functional genomics data as template • Network-based analysis rather than single-site or site-pairs analysis • Complement GWAS with family-based studies

  37. FUNCTIONAL EFFECTS • How do we predict how a variant is likely to be affecting protein function?

  38. FUNCTIONAL EFFECTS • Case: • I found a SNP inside the coding sequence. Knowing how to translate the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP. • Will that impact the protein’s function? • (And I don’t quite know how the protein functions in the first place ...)

  39. FUNCTIONAL EFFECTS • Two popular approaches: • PolyPhen 2.0 • Adzhubei, I. A. et al. (2010). Nat Methods 7(4):248-249 • SIFT • Kumar P. et al., (2009). Nat Protoc 4(7):1073-1081

  40. One popular method • PolyPhen 2.0

  41. Polyphen 2.0 • The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data. • Specifically the HumDiv data base which is • A compilation of all the damaging mutations with known effects of molecular function • A collection of non-damaging differences between human proteins and those of closely related mammalian homologs

  42. Polyphen2.0 features

  43. Multiple sequence alignments • A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:

  44. Evolutionary conservation score • Of interest is the Position Specific Independent Count (PSIC) Score. • This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA

  45. PSIC score • Example:

  46. Psic score • To derive the PSIC score we first calculate the frequency of each amino acid:

  47. Psic score • The idea: • is not the raw count of amino acid “” at position but rather it is adjusted for the many closely related sequences in the MSA • The PSIC score of a SNP at position is given by:

  48. What to do with the score? • Ultimately your derived score can be compared with the existing scores from HumDiv

  49. Combining all features of snp to predict impact: machine learning • Classification • Naive Bayes method • A type of classifier. Other classification algorithms include “Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc. • Sometimes called “Machine Learning” • What is a classification algorithm? • What is a Naive Bayes method/classifier?

  50. Classification or ‘supervised learning’ • + • + • … • - • - • … Positive examples Negative examples Training Data MODEL “Supervised Learning”

More Related