BIG Data: Knowledge for Improving Vaccine Virus Selection

BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI

BIG DATA Big Data

Big Data Volumes

Big Data in Biology

Big Data 3 V’s

Biological data types and analysis objectives • Genomics • Nucleotide genome sequences, metagenomic sequences • Gene finding, functional annotation, sequence alignment, homology determination, comparative analysis, phylogenetic inferencing, association analysis, mutation functional prediction, species distribution analysis • Transcriptomics • RNA expression levels, transcription factor binding, chromatin structure information • Differential expression, clustering, functional enrichment, transcriptional regulation/causal reasoning • Proteomics • Proteins levels, protein structures, protein interactions • Protein identification, protein functional predictions, structural predictions, structural comparison, molecular dynamic simulation, mutation functional prediction, docking predictions, network analysis • Metabolomics • Metabolite/small molecule levels • Pathway/network analysis • Imaging • Microscopy images, MRI images, CT scans • Feature extraction, high content screening • Cytometry • Cell levels, cell phenotypes • Cell population clustering, cell biomarker discovery • Systems biology • All of the above • Network analysis, causal reasoning, reverse causal reasoning, drug target prediction, regulatory network analysis, information flow, population dynamics, modeling and simulation

Variety

No Variety

Big Data Volume + Variety = Value Variety = Metadata

DMID Genomics Courtesy of Alison Yao, DMID

Bioinformatics Resource Centers (BRCs) www.vectorbase.org www.eupathdb.org www.patricbrc.org www.viprbrc.org www.fludb.org

IRD Home Page www.fludb.org • Comprehensive collection flu-related data and analysis tools • Free use without restrictions • Standardization and integration

Data in IRD IRD Data Summary

GSC-BRC Metadata Working Group • Collaboration between U.S. Genome Sequence Centers for Infectious Diseases and Bioinformatics Resource Centers • What kind of data should be collected for a sequencing specimen? • How should the information be represented? • Decisions driven by usage

Specimen Isolation temporal interval date/time ID gender age health status denotes has_part denotes spatial region GPS location temporal-spatial region has quality CS2/3 CS5/6 CS14 CS18 CS13 CS7 CS1 CS8 CS4 CS9/10 located_in organism denotes specimen source role spatial region geographic location plays environmental material located_in CS11/12 has_quality ID has part environment denotes pathogenic disposition has disposition organism has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of isolation protocol IRB/IACUC approval has_affiliation denotes CS15/16 specimen isolation procedure type affiliation name

Metadata Processes Quality Assessment Investigation temporal-spatial region Specimen Isolation Material Processing qualities located_in has_output temporal-spatial region quality assessment assay temporal-spatial region has_quality has_input located_in located_in has_output has_output has_input specimen source – organism or environmental specimen isolation process sample processing enriched NA sample has_input specimen instance_of denotes has_specification has_part has_part specimen collector type ID isolation protocol microorganism genomic NA microorganism is_about Data Processing Sequencing Assay has_output data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data is_about input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician located_in located_in denotes located_in temporal-spatial region equipment temporal-spatial region GenBank ID temporal-spatial region

Data Standards Dugan V, et al. PLOS One 2014, submitted.

Genetic Drift and Escape from Protective Immunity Can we monitor influenza genetic drift and predict when a new variant has escaped protective immunity?

Evolutionary drivers Viruses experiences 2 main drivers of evolution: • Selection against deleterious amino acid substitutions in order to maintain important structural and functional elements • Selection for amino acid mutations that result in viruses that evade pre-existing immunity and other characteristics of enhanced fitness Immune Pressure Functional Constraint Purifying selection Diversifying selection

Selective Pressures on HA • Hemagglutinin (HA) protein is: • Responsible for virus attachment and entrance into the host cell • A major antigenic component of the virus • If we can determine which regions of HA are targets of protective immunity, we can monitor genetic drift in those regions to predict escape. • Regions undergoing diversifying selection as HA naturally evolves would correspond to the relevant epitopes for protective immunity • This information could be used to help predict when new vaccine strains are warranted

Approach • Map all experimentally defined immune epitopes on the H1 HA protein • Identify sites that have experienced diversifying selection in pre-pandemic H1N1 strains and use to select immune epitopes likely to be targets of protective immunity. • Determine whether these regions are being targeted for the mutation during the ongoing evolution of the pandemic H1N1 lineage Pre-pandemic HA Pre-pandemic HA Pandemic HA

B-cell Epitopes from Immune Epitope Database (IEDB)

Identifying Sites Experiencing Diversifying Selection Selection Pressure using Fast Unconstrained Bayesian Approximation (FUBAR) – Murrell B, et al. (2013) Mol. Biol. Evol. 30(5):1196–1205: dN : Rate of non-synonymous substitutions dS : Rate of synonymous substitutions Non-synonymous Substitution: CTA (Leu)  CCA (Pro) Synonymous substitution: CTA (Leu)  CTG (Leu) The non-synonymous and synonymous rates are estimated for each site by calculating the posterior probability, Prob(dNsite, dSsite│Datasite, Tree, Codon Substitution Rate, Codon Freq). Sites are considered to be under diversifying selection if the (dN/dS)observed > (dN/dS)expected has a Bayesian score > 0.9. Calculated using all H1 NA sequences prior to the 2009 pandemic (pre-pandemic) – 2105 full length HA protein sequences

Sites Experiencing Diversifying Selection Found 7 sites experiencing diversifying selection in pre-pandemic H1 HA Threshold = 0.9 Bayesian Score 172 177 179 203 204 278 468

B-cell Epitopes with Diversified Sites 172 177 179 203 204 468 278 } p = .02

172 177 179 Relevant B-cell Epitopes 203 204 278 468 Sb Caton et al. 1982 Sa • 5/7 diversifying sites correspond to two well characterized B cell/antibody epitopes that may be targets of protective immunity • 2/7 sites do not correspond to any previously characterized B cell/antibody epitope • Highlight “evolutionary regions of interest”

Test Predictions on Pandemic Drift • Meta-CATS (Pickett BE, et al. (2013) Virology, 447:45-51) is a statistical tools that determines if nucleotide or amino acid residues at each position in a multiple sequence alignment are significantly different between groups of sequences using a chi-squared statistic • Group 1 (Early Pandemic Isolates): • Original outbreak sequences (21 earliest 2009 pandemic North American sequences) • Group 2 (Late Pandemic Isolates): • California 12-13 and 13-14 season (15 sequences) • Florida 12-13 and 13-14 season (21 sequences) • New York 12-13 and 13-14 season (13 sequences)

Meta-CATS Results (California) Group 1: Early pandemic Group 2: Late CA pandemic (season 12-13 and 13-14)

Results Sa Sb T-cell

172 177 179 180 202 203 204 Test Relevant B-cell Epitopes 468 468 273 278 391 251 300 220 114 Sb Sa

Tree Analysis Flu Season 180 273 13-14 516 300 114 12-13 11-12 10-11 Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids 09-10 08-09

Tree Analysis 202 391 220 468 Flu Season 13-14 12-13 11-12 10-11 Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids 09-10 08-09

Tree Analysis Summary A273T K180Q K300E E516K D114N D114N S468N S202T E391K S220T Flu Season 12-13 10-11 11-12 13-14 08-09 09-10

Big Data to Knowledge Volume + Variety = Value Variety = Metadata Data + Metadata + Integration + Interpretation = Knowledge

Big Data for Vaccine Selection • Large scale statistical genomic analysis can identify sites experiencing diversifying selection • Help determine how much sequence data is needed • When integrated with immune epitope data, could pinpoint those regions important for protective immunity and predict relevant antigenic drift • Natural experiment to identify correlates of protective immunity • Monitoring genetic drift in these regions could augment approaches like antigenic cartography/landscape analysis to determine when vaccine candidates should be adjusted

U.T. Southwestern/JCVI Richard Scheuermann (PI) Burke Squires JyothiNoronha Alex Lee Brian Aevermann Brett Pickett Yun Zhang MSSM Adolfo Garcia-Sastre Eric Bortz Gina Conenello Peter Palese Vecna Chris Larsen Al Ramsey LANL Catherine Macken Mira Dimitrijevic U.C. Davis Nicole Baumgarth Northrop Grumman Ed Klem Mike Atassi Kevin Biersack Jon Dietrich WenjieHua Wei Jen Sanjeev Kumar Xiaomei Li Zaigang Liu Jason Lucas Michelle Lu Bruce Quesenberry Barbara Rotchford Hongbo Su Bryan Walters JianjunWang Sam Zaremba LiweiZhou ZhipingGu Acknowledgments • IRD SWG • Gillian Air, OMRF • Carol Cardona, Univ. Minnesota • Adolfo Garcia-Sastre, Mt Sinai • ElodieGhedin, Univ. Pittsburgh • Martha Nelson, Fogarty • Daniel Perez, Univ. Maryland • Gavin Smith, Duke Singapore • David Spiro, JCVI • Dave Stallknecht, Univ. Georgia • David Topham, Rochester • Richard Webby, St Jude • USDA • David Suarez • Sage Analytica • Robert Taylor • Lone Simonsen • CEIRS Centers N01AI40041 HHSN272201200005C

BIG Data: Knowledge for Improving Vaccine Virus Selection