BioPortal:Disease and Bioagent Information Sharing, Surveillance, Analysis, and Visualization Research Team University of Arizona University of California, Davis Kansas State University Arizona Department of Public Health University of Utah New York State Department of Health/HRI California Department of Health Services/PHFE U.S. Geological Survey The SIMI Group Acknowledgements: NSF, ITIC, DHS, DOD/AFMIC, IDIWC, AZDPS
University of Arizona University of California, Davis Kansas State University University of Utah Arizona Department of Public Health New York State Department of Health/HRI California Department of Health Services/PHFE U.S. Geological Survey The SIMI Group NSF CIA/ITIC DHS DOD/AFMIC CDC AZDPS Research Partners and Supports
Dr. Hsinchun Chen Dr. Daniel Zeng Lu Tseng Cathy Larson Kira Joslin Wei Chang James Ma Hsinmin Lu Ping Yan Aaron Sun Keith Alcock Sapna Brahmanandam Milind Chabbi Yuan Wang UA Team Members
Project Background BioPortal V1.0 Achievements System Architecture System Functionalities BioPortal Collaboration Framework New Developments International Foot-and-mouth Disease Monitoring Syndromic Surveillance Livestock Health Surveillance Outline
BioPortal Background Acknowledgment: NSF, ITIC, NYSDH, CDHS, USGS (Drs. Kvach and Ascher)
Background (I) • In September, 2002, representatives of 18 different agencies, including DOD, DOE, DOJ, DHS, NIH/NLM, CDC, CIA, NSF, and NASA, were convened to discuss “disease surveillance.” • An interagency working group called Disease Informatics Senior Coordinating Committee (DISCC) was established. • DISCC established an Infectious Disease Informatics Working Committee (IDIWC) to survey the field and identify gaps. • IDIWC developed “requirements” for a National Infectious Disease Informatics Infrastructure (NIDII).
Background (II) • In June, 2003, IDIWC was charged with the task of developing one or more rapid prototype systems to demonstrate interoperability and innovation across species and jurisdictions. • Botulism and West Nile virus were selected as diseases. • States of New York and California were selected as partners. • The University of Arizona was chosen as integrator and was provided with a supplement to an existing NSF grant.
BioPortal Project Goals • Demonstrate and assess the technical feasibility and scalability of an infectious disease information sharing (across species and jurisdictions), alerting, and analysis framework. • Develop and assess advanced data mining and visualization techniques for infectious disease data analysis and predictive modeling. • Identify important technical and policy-related challenges in developing a national infectious disease information infrastructure.
BioPortal V1.0 Accomplishments • Prototype system design and development • Initial design and implementation of interoperable messaging backbones • Live prototype systems • Preliminary user evaluation • Information sharing • Data sharing agreements/memoranda of understanding (MOUs) developed • Many disease datasets integrated into the portal • Analysis and visualization • Hotspot analysis research • Spatial-Temporal Visualizer (STV)
Information Sharing Infrastructure Design Portal Data Store (MS SQL 2000) Data Ingest Control Module Cleansing / Normalization Info-Sharing Infrastructure Adaptor Adaptor Adaptor SSL/RSA SSL/RSA PHINMS Network XML/HL7 Network NYSDOH CADHS New
Public health professionals, researchers, policy makers, law enforcement agencies & other users Browser (IE/Mozilla/…) SSL connection Spatial- Temporal Visual- ization Analysis / Prediction HAN or Personal Alert Management Dataset Privileges Management Data Search and Query Web Server (Tomcat 4.21 / Struts 1.2) WNV-BOT Portal Access Privilege Def. User Access Control API (Java) Data Store Data Store (MS SQL 2000) Data Access Infrastructure Design
BioPortal Collaboration Framework • A Memorandum of Understanding (MOU) is used to document the relationship between parties that will be sharing data: • Who the entities are and how they will act independently and cooperatively • What the mutual interests, benefits, and purposes of sharing data are • How each party will maintain control over and share their resources, and what each party shall provide to the other (e.g., system accounts, portal access) • Which types of data are to be shared (e.g., dead bird surveillance)
Summary of MOU • Confidentiality • Data is not to be shared outside of the project. • Data is to be returned or destroyed after 5 years. • Ownership • Original data is owned by providers. • Data analysis is jointly owned. • Scope • Specific diseases are listed. • Additional diseases can be added. • Parties agree separately on which data elements can be shared (e.g., species, gender, etc.) • Purpose • Data may be used for system development, for example.
Communications/Messaging • Scalable, flexible, light-weight, and extendible. Easy to include: • New diseases • New jurisdictions • New techniques! • Messaging infrastructure – installed and tested • NYSDOH-UA: PHIN MS • CADHS-UA: Regional message broker • NWHC-UA: PHIN MS • XML generation/conversion • NY_DeadBird, NY_Alerts, NY_BotHuman, NY_WNVHuman, NY_CaptiveAnimal, NY_Mosquito • CA_BotHuman, CA_WNVHuman, CA_DeadBird, CA_Chicken, CA_Mosquito • USGS_Epizoo
BioPortal Research Framework • BioPortal – Demo: Develop the system for demonstration purposes using scrubbed data. Refine system functionality and performance based on user feedback. • BioPortal – Operation: Develop the system for production mode with real data and real users. • BioPortal – Research: Continue to develop advanced technologies and practical sharing policies. Expand to new diseases and jurisdictions.
Spatio-Temporal Data Mining & Hotspot Analysis • A hotspot is a condition indicating some form of clustering in a spatial and temporal distribution(Rogerson & Sun 2001; Theophilides et al. 2003; Patil & Tailie 2004). • For WNV, localized clusters of dead birds typically identify high-risk disease areas (Gotham et al. 2001). • Automatic detection of dead bird clusters using hotspot analysis can help predict disease outbreaks and aid in effective allocation of prevention/control resources.
Existing Hotspot Analysis Approach: SaTScan • The spatial scan statistical techniques implemented in SaTScan are widely used to detect and evaluate disease outbreaks (Kulldorff 2001). • NYSDOH has used SaTScan to develop an early warning system for WNV (Gotham et al. 2001). • An important factor considered by spatial scan statistical analysis is the baseline. • The significance of the density of dead birds depends on the historical distribution of bird deaths, human population, and so on.
Other Hotspot Analysis Approaches: CrimeStat and RSVC • Hotspot analysis techniques applied to crime analysis: CrimeStat (Levine 2002). • CrimeStat’s Risk-Adjusted Nearest Neighbor Hierarchical Clustering (RNNH): Uses a kernel density estimation obtained from baseline data to adjust the threshold that controls whether data points can be grouped together. • Risk-Adjusted Support Vector Machine Clustering (RSVC): It combines the power and flexibility of support vector machine-based clustering and the risk adjustment idea of RNNH.
Case Study (NY WNV) • On May 26, 2002, the first dead bird with WNV was found in NY • Based on NY’s test dataset 140 records 224 records March 5 May 26 July 2 new cases baseline
Analysis results from SaTScan and RNNH • SaTScan picks large cluster • 71 new • - 7 baseline SaTScan #2 Zoom in Hotspots high density area Zoom in RNNH RNNH picks small cluster - 53 new - 6 baseline RNNH SaTScan SaTScan #1 Close-up of the hotspots Baseline + new cases in zoomed-in area Baseline cases in zoomed-in area Hotspot analysis results NY Dead bird 2002
Hotspot Analysis Findings • RSVC delivers similar recall levels and higher precision than SaTScan. • RNNH matches RSVC precision, but has very low recall. • RSVC significantly outperforms other methods in the F-measure. • Techniques could be complementary for different hotspot analysis tasks.
Spatial-Temporal Visualization • Integrates four visualization techniques • GIS View • Periodic Pattern View • Timeline View • Central Time Slider • Visualizes the events in multiple dimensions to identify hidden patterns • Spatial • Temporal • Hotspot analysis • Phylogenetic (planned)
Dataset name Advanced Search criteria Spatial / Temporal Select background maps Results listed in table Available dataset list User main page Positive cases Time range Select NY / CA population, river and lakes County / State Choose WNV disease data Select CA dead bird, chicken and NY dead bird data Select CA dead bird, chicken and NY dead bird data Positive cases User Login Positive cases Start STV Specify bird species
GIS Timeline Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern NY dead bird temporal distribution pattern NY dead bird temporal distribution pattern NY dead bird temporal distribution pattern NY dead bird temporal distribution pattern Periodic Pattern Close Zoom in NY Close Zoom in Control panel Year 2001 data Move time slider, year 3 Move time slider, year 2 Concentrated in May / Jun Similar time pattern Overall pattern Similar time pattern 2 weeks window View all 3 year data 1 year window in 3 year span
Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Spatial distribution pattern Dead bird cases migrate from long island Into upstate NY Season end Move time slider Overlay population map Dead bird cases distribute along populated areas near Hudson river Enable population map
BioPortal HotSpot Analysis:RSVC, SaTScan, and CrimeStat Integrated (first visual, real-time hotspot analysis system for disease surveillance) • West Nile virus in California
Select hotspot to highlight case points Regular STV Select algorithms Hotspots found! Select baseline and case periods Select baseline and case periods Select target geographic area Hotspot Analysis-Enabled STV
BioPortal New Developments • NSF Infectious Disease Informatics Grant (2004-9) • International Foot-and-Mouth Disease BioPortal (2005-6); FMD Lab, UC Davis • Human Syndromic Surveillance System; Arizona State Department of Health (2005-6) • Livestock Syndromic Surveillance System; Kansas State University RSVP-A (2005-6)
New Research Directions • Analytical Algorithms • Prospective hotspot analysis & auto baseline discovery • Spatial-Temporal correlation analysis • Dynamic Network Analysis • Visualization • International FMD news visualization • Phylogenetic Spatial-Temporal visualization • Syndromic Surveillance • Syndromic surveillance system survey • Emergency room chief complaint syndromic classification • Livestock syndromic surveillance
Extended BioPortal Research Framework • BioPortal – Demo • BioPortal – Operation • BioPortal – Research • FMD – BioPortal: A dedicated instance of BioPortal customized for International Foot-and-Mouth disease monitoring. Additional functionalities such as gene sequence analysis and FMD News are added • BioPortal – Syndromic Surveillance: A specialized BioPortal instance that processes chief complainants using a hybrid method of ontology and knowledge rules • BioPortal – Livestock: A BioPortal instance devoted in Livestock syndromic surveillance case management and data analysis
International FMD BioPortal Acknowledgment: DHS, DOD, UC Davis (Drs. Thurmond and Lynch)
Introduction • Foot-and-mouth disease (FMD) is the top disease on the Office International des Epizooties (OIE) List A, which can infect all cloven-hoofed animals. • FMD is the most contagious infectious diseases of livestock animals: • Massive shedding of virus and contamination of the environment. • Transmitted by direct or indirect contact (droplets), animate vectors (humans), inanimate vectors (vehicles • Serologically diverse with seven distinct types (A, O, C, SAT1, SAT2, SAT3, Asia1), which makes diagnosis and vaccination problematic, and genetic diversity likely. • Endemic in Africa, Asia, Middle East and South America • Potential cost for U.S. outbreaks: >$10 billion • Broader economic impact: trade and travel restrictions.
International FMD BioPortal Goals • Real-time, web-based situational awareness of FMD outbreaks worldwide through the establishment of an international information sharing and analysis system • FMDv characterization at the genomic level integrated with associated epidemiological information and modeling tools to forecast national, regional, and/or international spread and the prospect of import into the U.S. and the rest of North America • Web-based crisis management of resources—facilities, personnel, diagnostics, and therapeutics
Research Plans • Global FMD epidemiological data • (Near) real-time data collection • Web-based information sharing and analysis • International FMD news • Indexed collection of global FMD news • Search and visualization of the FMD news via the web • FMD genetic/sequence data • Predictive model using phylogenetic, spatial, and temporal information to stop FMD at the boarder • Visualization for FMD event in time, space, and genetic space
Preliminary Global FMD Dataset • Provider: UC Davis FMD Lab • Information sources: reference labs and OIE • Coverage: 28 countries globally • Time span: May, 1905 – March, 2005 • Dataset size: 30,000+ records of which 6789 records are complete • Host species: Cattle, Caprine, Ovine, Bovine, Swine, NK, Elephant, Buffalo, Sheep, Camelidae, Goat
FMD Migration Visualization using BioPortal (cases in South Asia) FMD Cases travel back and forth between countries
International FMD News • Provider: UC Davis FMD Lab • Information sources: Google, Yahoo, and open Internet sources • Time span: Oct 4, 2004 – present (real-time messaging under development) • Data size: 460 events (6/21/05) • Coverage: 51 countries (Africa:11, Asia:16, Europe:12, Americas:12)
Searching FMD News • http://fmd.ucdavis.edu/ • Searchable by • Date range • Country • Keyword
FMD Genetic Information Analysis • Genome clustering analysis • Phylogenetic clustering • Spatial clustering • Temporal clustering • Hotspot detection among gene sequences • Create a tree structure based on semantic distance between gene sequences. • Automatically detect the dense portion of the tree. • Identify the connection between the semantic cluster and the geographic pattern of gene sequences.
FMD Genetic Visualization • Goal: Extend STV to incorporate 3rd dimension, phylogenetic distance • Include a phylogenetic tree. • Identify phylogenetic groups and color-code the isolate points on the map. • Leverage available NCBI tools such as BLAST. • Proof of concept: SAT 2 & 3 analysis • Data: 54 partial DNA sequence records in South Africa received from UC Davis FMD Lab (Bastos,A.D. et al. 2000, 2003) • Date range: 1978-1998 • Countries covered: South Africa, Zimbabwe, Zambia, Namibia, Botswana
Sample FMD Sequence Records Color-coded View (MEGA3) Textual View of Gene Sequence
Interactive Phylogenetic Tree Color coding shows similarity of sequences User-adjustable grouping threshold to change clusters
Identify 6 groups within 2 major families (MEGA3; based on sequence similarity) Phylogenetic Treeof Sample FMD Data Group6 Group1 Group5 Group2 Group4 Group3