Data Interoperability -the vision of seamless data-sharing in Health Informatics Jenny Ure School of Informatics Univ. of Edinburgh Health Informatics Resource: Data Interoperability
Health informatics increasingly relates to data-sharing - across sites, scales and formats ………… Wrapper Genes Adapted from www.fbirn.net
Data Interoperability allows • pooling of data across sites and scales for knowledge discovery (think Google Earth) • faster turnaround times in translational research from lab bench to bedside • more re-use of research outcomes and less fragmentation and duplication of work in the same disease domain
Proteins sequence 2º structure 3º structure DNA sequences alignments This can involve huge datasets… billions Protein-Protein Interactions metabolism pathways receptor-ligand 4º structure Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc. Polymorphism and Variants genetic variants individual patients epidemiology millions millions Hundredthousands ESTs Expression patterns Large-scale screens Genetics and Maps Linkage Cytogenetic Clone-based MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... billions ...atcgaattccaggcgtcacattctcaattcca... millions
e.g. linking genetic factors to clinical or scan data.. .. Though you all have to agree on how to name, code and format data sets, and how they relate to a disease!
Bridges Project www.brc.dcs.gla.ac.uk/projects/bridges/
Or large scale computational analysis http://www.clinical-escience.org/
However - combining different datasets may help create a bigger picture……but it may be the wrong one
The social life of information • As in the expression of genetic information - health information is also shaped by factors in the local environment Seely Brown and Duguid, 2000 ‘The Social Life of Information’ Harvard School Press
Recurring problem scenarios at different stages the human process 1.sampling 2.collecting 3.coding 4.cleaning 5.linkage 6.analysis 7.use the technical process
Harmonisation across multiple national biobanks such as P3G identified issues in • Different populations • Different environments • Different study designs • Different tools • Different populations • Different formats
30% Collection Errors ? • Missing of helpful data i.e. data that was almost certainly known but was not filled in • Incomplete data e.g. the patient ID being specified but not the issuer of the patient ID • Incorrect data e.g. the patient's name being entered as "brain" • Incorrectly formatted data e.g. a patient name being specified so that the surname is “CameronDavid”. • Data in the wrong field e.g the series being described as "knee“ • Inconsistent data within a single file, e.g. if the patient's age is inconsistent with image date minus birth date.
So what about data cleaning?: • A 46:36 waist/hip ratio reading – is it an input error or just a sample from West Lothian?
Other Strategies • Wireless notepads for data collection • Provenance metadata • Links to original data • Local QA/ethics/linkage committees • Error trawls and spot checks combined with error-trapping software
The myth of shared protocols. • Trace a line around the region of interest in all subjects • Compare differences in area across control and experimental grops
Harmonising different tools and platforms • Microarray • In situ hybridisation • Scanners
Adapted from Keator et al (2006) Presentation to the UK-BIRN workshop Different Disease Effects or Different Scanners? Harmonisation strategies?
Effect or Artefact? • Different equipment • Different populations • Different raters • Different contexts • Different protocols • Different coding • Different metadata
Designing for e-Health: Recurring Scenarios in Developing Grid-based Medical Imaging Systems • Conclusions • In organic communities, the processes of structuring collaboration, coordination and control structures happens as a matter of course. NeuroGrid is employing an early prototype to generate engagement and dialogue, to enable early discussion of requirements for more complex services, compute capability and workflows, as well as data quality and configurational issues. • In addition to ameliorating the recurring issue of requirements ‘creep’, late in the design process, it allows disparate groups to engage with the real issues, and possible solutions in a shared context. • Introduction • NeuroGRID www.neurogrid.ac.uk is a three-year, £2.1M project funded through the UK Medical Research Council to:- • develop a Grid-based research environment to facilitate the sharing of MR and CT scans of the brain and clinical patient data in the diagnosis of psychoses, dementia and stroke • bring together clinicians, researchers and e-scientists at Oxford, Edinburgh, Nottingham and London • create a toolset for image registration, analysis, normalisation, anonymisation, real-time acquisition and error trapping • ensure rapid, reliable and secure access, authentication and data sharing Data Quality Issues: The Social Life of Information Challenge: The large scale aggregation of diverse datasets offers both potential benefits and risks, particularly if the outputs are to be used with patients in a clinical context. Thus aggregating data is a key issue for e-Health, yet data is not independent of the context in which it is generated. Within small communities of practice a degree of shared and updated knowledge and experience allows judicious use of resources whose provenance is known and whose weaknesses are often already transparent. The same is not true of aggregated data from multiple sources. Approach: Early use of prototypes to provide a ‘sandpit’ for promoting both technical and inter-community dialogue and engagement, and start the process of identifying, sharing and updating knowledge of emerging issues. Early trials with known datasets aim to generate an awareness of the types of variance that can arise and ways in which it might be minimized, harmonized, or made transparent to users Socio-technical Issues Aligning Technical and Human Systems Challenge:Integrating the technical work of system building, with the socio-political work of generating the governance of the new risks and opportunities they generate Approach: The creation of real and virtual ‘shared spaces’ (e.g. via Access Grid) and the use of an early prototype for engagement in areas of shared professional concern, to help this new hybrid community develop its own rules of engagement, and start making collective sense of local requirements in relation to common project goals. • Semantic Issues Il nome della rosa • Challenge: • Multi-site studies raise issues such as different naming conventions for files, different coding and classification systems, different protocols, and different conceptualisations of domains. • Approach: • The project agreed on core and node specific metadata and will use an OWL-based ontology (logic-based domain map) to allow human and machine-readable searching and basic reasoning across the datasets. In this there is a trade-off between the benefits of share-ability and automated reasoning, on the one hand, and the formalisation of concepts and relationships that are evolving. • Challenge: • Aligning and representing datasets at different levels of granularity. While NeuroGrid uses MR and CT scans, other relevant datasets such as diffusion tensor imaging, genetic, proteomic datasets also contribute to an understanding of neurological processes. • Approach: • The project is adopting a two –pronged approach • developing task specific ontologies • developing a reference ontology based on the Foundational Model of Anatomy adopted by the BIRN • Human Brain Project. • This allows a degree of alignment between datasets and ontologies in future collaborations Acknowledgements The authors would like to acknowledge the support of the UK Medical Research Council (Grant Ref no: GO600623 ID number 77729), the UK e-Science programme and the NeuroGrid Consortium. Imaging Issues: Artefact or Actuality? Researchers use innovative imaging techniques to detect features that can refine a diagnosis, classify cases, track normal or often subtle physiological changes over time and improve understanding of the structural correlates of clinical features. Variance is attributable to a complex variety of procedures involved in image acquisition, transfer and storage, and it is crucial, but difficult, for true disease-related effects to be separated from those which are artifacts of the process For further information For information on this and related projects contact Jenny.Ure@ed.ac.uk or go to www.neurogrid.ac.uk Designing for e-Health: Recurring Scenarios in Developing Grid-based Medical Imaging Systems John Geddesa, Clare Mackaya, Sharon Lloydb, Andrew Simpsonb , David Powerb, Douglas Russellb, Marina Jirotkab, Mila Katzarovab, Martin Rossorc, Nick Foxc, Jonathon Fletcherc, Derek Hilld, Kate McLeishd, Yu Chend , Joseph V Hajnale, Stephen Lawrief, Dominic Jobf, Andrew McIntoshf, Joanna Wardlawg, Peter Sandercockg, Jeb Palmerg, Dave Perryg, Rob Procterh, Jenny Ureh,, Mark Hartswoodh, Roger Slackh, Alex Vossh, Kate Hoh, Philip Bathi, Wim Clarkei, Graham Watsoni aDepartment of Psychiatry, University of Oxford, bComputing Laboratory, University of Oxford, cInstitute of Neurology, University College London, dCentre for Medical Image Computing (MedIC), University College London, eImaging Sciences Department, Imperial College London, fDepartment of Psychiatry, University of Edinburgh, gDepartment of Clinical NeuroSciences, University of Edinburgh, hSchool of Informatics, University of Edinburgh, iInstitute of Neuroscience, University of Nottingham  Corresponding Author: Jenny Ure, School of Informatics, University of Edinburgh, Jenny.Ure@ed.ac.uk The concept of the collaboratory is central to the e-Science vision, yet there has been limited concern with the generation of the community and coordination infrastructures which will coordinate and sustain it. • Real or artefactual differences? • Different scanners • Different populations • Different raters • Different centres • Different protocols
…there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know.
Usability issues in data integration • across sites(horizontal) • across scales (vertical …think Google Earth
across time-scales DGEMapwww.dgemap.org HDBR http://www.hdbr.org EMAGEhttp://genex.hgu.mrc.ac.uk
How to agree a common spatio-temporal infrastructure for sharing data? Site 3 Site 2 Site 1 organs organs organs tissues tissues tissues cells cells cells Stage 1 Stage 2 Stage 3
Shared frames of reference for imaging data • Shared anatomical ‘map’ reference points • Somewhere to hang distributed data BIRN www.fbirn.net
so technical infrastructure needs community infrastructure to define.. • Shared spaces • Shared frames of reference • Shared tools • Shared naming conventions • Shared ethical and legal conventions • Shared costs and risks
Current examples of data curation communitiessuch as wikipediacan • Can achieve shared aims faster – re-use • Can create de facto standards • Can cut cost & risk • Can achieve critical mass for funding
Open Source projects in eHealth such as http://www.nbirn.net/ www.prg.org
Identifying risks and the opportunities at each stage the human process 1.sampling 2.collecting 3.coding 4.cleaning 5.linkage 6.analysis 7.use the technical process
The eHealth Vision of seamless data sharing? • Still more of a vision than a reality !