Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 8 – Variants to Networks Part 1 – How to annotate variants and prioritize potentially relevant ones Jüri Reimand Bioinformatics for Cancer Genomics May 25-29, 2015 Informatics and Biocomputing Ontario Institute for Cancer Research

Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? • What variant annotations can I use? • How do impact prediction models work? • How to use an annotation tool: Annovar (LAB)

Introduction

Variant vs Gene Information We have to consider information at two levels: • Gene • Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) • Is the gene sensitive to perturbation? (e.g. haploinsufficiency) • Variant • What is the variant effect on the gene product?

Integrating Different Evidences Variant Recurrence Gene Product Function / Pathway Variant Gene Product Effect

On Variant Size Small: 1-50 bp • SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect • Small In/dels: a bit more challenging to detect • Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp • Insertions, Deletions, Translocations, Complex re-arrangements • Most challenging to detect • More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp • Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing • More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor • Other Effect Scoring • PhyloP (conservation) • CADD • Splicing-regulatory predictions

Variant databasesand allele frequencies

1000 Genomes (Phase 3) • Goal: • Identify all variants at > 1% frequency in represented human populations • Subjects: 2,504 • Apparently healthy • Ethnicities: caucasian European, admixed Latin Americans, African, South Asians, East Asians • Platform: Illumina • Low coverage (2-4x) whole genome • Exon (50x coverage)

NHLBI-ESP • Goal: • discover heart, lung and blood disorder variants at frequency < 1% • Subjects: 6,503 (ESP 6500 release) • Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) • Ethnicities: 2,203 African-Americans, 4,300 European-Americans • Platform: Illumina, exome sequencing (average 110x)

ExAC (Exome Aggregation Consortium) • Goal: • Compile the largest set of exomes ever • Subjects: 60,706 (unrelated) • Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia and cancer, but removed individuals with severe pediatric disease • Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South Asians, East Asians, Other • Platform: Illumina, exome • Variant calling: • GATK

dbSNP • Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) • Submissions before and after NGS era • Includes polymorphisms found in general population • Includes rare germline disease-associated (or suspected to be) • Includes somatic variants (also in COSMIC) • Good to look up variants • If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)

COSMIC • “Catalogue of Somatic Mutation In Cancer” • Reference database for somatic variation in cancer • Worth following up variants matching COSMIC entries • How many studies/samples was it found in? 1, many? • Does the variant overlap a hotspot? • Is the gene frequently mutated?

Gene mapping

Gene Mapping:Types of Genes Types of genes: • Protein-coding genes • Non-protein-coding RNA genes (e.g. miRNA) • Different functional relevance • Different knowledge of variant effects

Gene Mapping:Parts of Genes • Protein-coding genes have these parts: • UTR (transcribed, not translated) • Coding exons (translated) • Introns (spliced out, not translated) • Splice sites Also: • Upstream, downstream transcribed gene • Inter-genic

Gene Mapping: Annovar’s priority system • Gene types and parts: what if they overlap..? • Whenever more than one mapping is possible, Annovar will follow this priority system • You can also ask Annovar to report all possible effects

Gene Mapping: Annovar’s priority system Protein Coding Gene G1 >> >> >>>> >>>> >> >>>>>>>>>> TSS of G1 (Transcription Start Site) Non-coding RNA ncR1 (e.g. miRNA)

Gene Mapping: Annovar’s priority system >> >> >>>> >>>> >> >>>>>>>>>> G1 Intronic G1 Upstream G1 Intronic ncR1 G1 Exonic G1 Exonic G1 Exonic G1 UTR 3’ G1 UTR 5’ G1 Intronic G1 Exonic ncR1 G1 Downstream ncR1 Downstream G1 Splicing ** ** Splice sites after the first were omitted to avoid clutter

Gene Mapping:Database • Goal: map our variants to (coding and non-coding) genes • RefSeq is the suggested database for transcribed gene and coding sequence definition • In the lab we will use Annovar with RefSeqdatabase • Other databases available: UCSC known genes, Ensembl

Gene product effect type

Gene Product Effect • Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) • Protein-coding sequences: how is protein sequence affected? • Definitely easier to chase after protein effects • But should don’t forget other gene products exist…

Gene Product Effect: Protein-coding • Stop-gain SNV: adds a STOP codon  truncated protein • Frameshift In/Del: shifts the reading frame  protein translated incorrectly from that point • Splicing: alters key sites guiding splicing • In-frame In/Del: removes/add one or more aminoacids • Stoploss: loss of STOP codon  extra piece in the protein • Missense SNV: modifies one amino acid • Synonymous: no amino acid change

Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: • What percentage of the protein is affected? • Are there multiple transcript isoforms? • Splicing effect difficult to predict • Cryptic splice sites • Frameshift can be rescued by another frameshift or bypassed by splicing

Missense Variants: Tell Me More.. • How do we tell if a missense alters protein function? • Type of amino acid change (amino acid groups) • Conservation across species • Conserved protein domain • Secondary protein structure • Tertiary (3D) protein structure + simulation • Other functional features (e.g. phosphosite) • Machine learning model tying all of these together • What training set?

Missense Example: Back to BRAF BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested

Conservation andMissense Variant Scoring Models

Conservation • Conservation is a powerful and broadly used idea • How conserved is a given nucleotide or genomic interval, comparing different species to human? • How conserved is an amino acid in a protein sequence? • Available from UCSC (nucleotide conservation): • PhyloP score – useful to assess single variants • PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins • Multi-species alignment – generally useful

Look for coding exons, UTRs and third nucleotide within codons

PhyloP • PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift • Only where aligned sequence available! • PhyloP score • Positive: conserved (e.g. PhyloP > 2) • Zero: neutral • Negative: more diverged than neutral • Species group: • All vertebrates • Only placental mammals • Only primates

Conservation • Main caveat: • if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!

Missense Variant Effect:Scoring Models Overview Criteria to keep in mind: • What features are used? • Nucleotide / amino acid conservation • Amino acid physicochemical properties • Direct scoring versus Machine learning • Machine learning models are heavily dependent on the training-set used • What data-set used for assessment / learning / optimization? • E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations • E.g. Mendeliandisorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)

SIFT • Broadly used, relatively old (first published: 2001) • Designed for deleterious mutation (i.e. disruptive of protein function) • Based uniquely on protein sequence (amino acid) conservation • Start from query protein sequence • Identify similar protein sequences (PSI-BLAST) • Multiple alignment of protein sequences (orthologs and paralogs) • Amino acid x residue probability matrix (PSSM) • For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency)  Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.

PolyPhen2 • Integrates multiple features • 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) • Machine learning method (Naïve Bayes)  Requires training set • Set 1: HumDiv • Positive: damaging alleles for known Mendelian disorders (Uniprot) • Negative: nondamaging differences between human proteins and related mammalian homologs • Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) • Set 2: HumVar • Positive: all human disease causing mutations (Uniprot) • Negative: non-synonymous SNPs without disease association • Richer model than SIFT • More biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.

MutationAssessor • Direct / theoretical model (no machine learning) • Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) • Entropy-based score based on protein sequence alignment • Performs well for (recurrent) somatic variants Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118

CADD • Intended as a measure of “deleteriousness” for coding and non-coding sequence, not biased to known disease variation • However does not model gene specific constrain in detail • Machine learning model (Linear SVM) • Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome • Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates • Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks  includes missense predictions and nucleotide-level conservation • Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.

CADD Pathogenic ClinVar vs NHLBI-ESP > 5%

Splicing Regulatory Predictions • Goal: predict how SNVs affect exon inclusion / exclusion • Strategy: • Learn “Wild Type” splicing code based on reference genome sequence motifs and experimentally-measured splicing patterns in human tissues • “Mutant” code: predicts splicing change when variant alters splicing-guiding sequence motif • Does not learn based on known disease splicing alterations Science 2015

Phosphorylation and other protein modifications • Post-translational modifications (PTMs) extend protein function • Human: >130,000 PTM sites, 12% of protein sequence • Enriched in inherited disease and somatic cancer mutations • Negatively selected in population • Often not detected with mutation assessment tools Reimand et al, 2013 Mol Sys Bio; 2015 PLOS Genet

Effect Scoring:Conclusive Remarks • Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected • Missense scoring models are powerful, but their strengths and weaknesses need to be understood • Variants should be always reviewed putting all information in context • Consider conservation and effect scores using different models • Review the amino acid change and sequence context • Look for clusters of somatic variants and protein domain • Don’t forget gene-level information!

We are on a Coffee Break & Networking Session

Canadian Bioinformatics Workshops