Leveraging Clinical and Administrative Systems for Advancing Research: Insights and Challenges

Using Clinical and Administrative Information Systems to Support the Research Enterprise:The Good, the Bad and the Ugly Mark Weiner, MD Institute for Translational Medicine and Therapeutics (ITMAT) Center for Clinical Epidemiology and Biostatistics (CCEB) Clinical Research Computing Unit (CRCU) Office of Human Research (OHR) University of Pennsylvania School of Medicine Philadelphia, PA 19104.6021

In attempting to arrive at the truth, I have applied everywhere for information, but in scarcely an instance have I been able to obtain hospital records fit for any purpose of comparison. If they could be obtained, they would enable us to decide many other questions besides the one alluded to. They would show the subscribers how their money was being spent, what good was really being done with it, or whether the money was not doing mischief rather than good.

Framing Questions • Can data from clinical information systems be used • To enable comparisons of the relative effectiveness of competing therapies? • To evaluate risks of therapies after they reach the market? • To inform drug development towards new innovations where existing therapies are ineffective or risky? • To ensure earlier studies of effectiveness are still relevant given more recent drug developments? • To find interesting cohorts with characteristics that are especially relevant for genomic analysis?

The Database underlying our Research Enterprise • Pennsylvania • Integrated • Clinical and • Administrative • Research • Database } The PICARD System

Data Sources - Billing • IDX –Professional charges for ambulatory and inpatient activity • 200 primary care and subspecialty practice sites • 1.5 million ambulatory visits/year (primary and subspecialty) among 449,000 patients • 602K visits/year to primary care practices among 222K patients • SMS – Facility charges for Admissions and ED encounters and hospital-based ambulatory procedures (ancillary tests, labs) • 36K admissions/year, 34K ED visits/year (HUP) • 13.5K admissions/year, 21.4K ED visits/year (PMC) • 25.5K admissions/year, 18K ED visits/year (PAH)

Data Sources - Clinical • Cerner – • Laboratory and pathology results - both inpatient and outpatient • 75 common chemistries, hematology and serology results August 1997 - February 1999 • Since February 1999 -- all labs with numerical results • Since 2001 – Microbiology results

Data Volume • 400 GB storage on an Oracle 9i server • 1,843,922 patients (cumulative since 1997) • 25,297,970 visits (ambulatory encounters, nursing home visits, inpatient consultations) • 46,474,033 diagnoses assigned • 153,097,826 labs

What is available? • Ambulatory Data • Primary and Subspecialty Data - Jan 1997-Present • Patient information • Location, Gender, Race, Birthdate, Insurance carrier • Scheduling Information • When was the visit scheduled? • Status of visit (arrived, cancelled, no show) • Visit information • Date, Location, Physician, Diagnoses, Procedures with charges and reimbursements

What is available? • Inpatient data • Patient information • Admission Detail – Detail data since FY2000 for HUP, Presbyterian and Pennsylvania • Admission, DC dates, LOS, discharge disposition • DRG, Diagnoses • Major Procedures • Charges for minor procedures/room/ancillary services etc. • Medications

Data Sources – Electronic Health record • EPIC - Ambulatory Electronic Medical Record • In use at about 60 ambulatory care sites, 8 of which are primary care • Patient counts : • 184,000 patients with at least 1 visit (overall) • >100,000 patients seen within ambulatory care practices since 2001 • 72,900 primary care patients since Jan 2001 • 90,000 patients with at least 1 visit to any EPIC provider in past year • 55,000 patients with at least 1 visit to EPIC primary care practices in past year

What is available? • Electronic Medical Record Data • Records patient history and physical exam as unstructured text • Linked to SQL Server database containing discrete components of EpicCare • Vital Signs • Medications Ordered • Social History (smoking, Alcohol use) • Family History • Problem lists

Work in Progress • Sunrise Clinical Manager – Order entry and nursing documentation from HUP and PMC inpatient settings • Recent conversion to “linkable” systems • Cardiology Data - Cath, EKG, Stress tests, Echo • Pulmonary Function Tests • Available, but need to draw discrete content from semi-structured reports • Pathology Data • Radiology results • Endoscopy/bronchoscopy results

How Accurate is PICARD? • Accuracy = truth? • If PICARD says a patient has asthma, then the patient has asthma • Accuracy = true representation of the source data? • If PICARD says a patient has asthma, then the source data for the patient includes a code for asthma or other diagnostic testing results consistent with asthma

How Accurate is PICARD? • We have worked to make PICARD a true representation of the underlying data • Physician patient communication/misunderstandings • Busy doctors don’t code/document everything • Idiosyncrasies of the clinical setting in which data is collected • Ambiguity inherent in the practice of medicine • Code creep- • Early codes before diagnosis of gallstones is confirmed may suggest simple abdominal pain • URI/bronchitis/tracheitis/pharyngitis/sinusitis all have similar symptoms Adapted and expended from O’Malley KJ, Cook KF, Price MD, 14 KR, Hurdle JF, Ashton CA Measuring Diagnoses: ICD Code Accuracy Health Services Research 2005. 40:1620-39.

Idiosyncrasies in Data • Research using PICARD must account for all of the realities inherent in the underlying data • Absence of evidence is not evidence of absence • Just because you don’t see evidence of a disease doesn’t mean the patient doesn’t have the disease • To find patients with a certain disease, you need to consider all the ways the disease may be represented in diagnosis codes and ancillary test results • The first instance of a disease in the database is not necessarily the time the disease first appeared

Addressing the Idiosyncrasies • Corrections for problems may increase the accuracy of the data, but make the analytical data set less generalizable • Requiring an echocardiogram to definitively rule in or rule out a diagnosis of CHF limits your cohort to people who were sick enough to require an echo – even among patients who turn out NOT to have CHF by echo • Finding incident cases by limiting a cohort to people who have existed in the system for a while without evidence for a disease of interest, and then suddenly a code for the disease appears.

The good… • Discrete data enables searching for reasonably objective clinical information that can refine coarse billing diagnoses. • Not all patients with Hypertension are equally hypertensive • Not all patients with Hypertension are treated as aggressively or require the same aggressive treatment • More homogeneous cohort specification, or at least better ability to recognize and adjust for imbalance of clinical characteristics. • Better assessment of differences in care, resource utilization and outcomes – Clinical trial simulation • More contextual data leads to more rational attribution of cause and effect

The (currently) bad … • Still a great deal of data from which it is difficult to pull discrete concepts (EKGs, echo’s, Path??) • Not all discrete concepts are as accurate as we’d like • Not all discrete concepts mean what we think they mean! • Even if discrete concepts were extractable, it is difficult to resolve pervasive conflicts in concepts within clinical charts for a single patient across different providers and notes. • How to deal with uncaptured clinical care data from unaffiliated health care settings? • Corollary: How do you know if the first instance of a condition in the chart is truly an incident occurrence.

The (potentially) ugly • Even if you were to able to derive discrete concepts from unstructured text, you still need to account for the idiosyncrasies of clinical care as opposed to the research setting • Patients often seek medical care when they are not feeling well, rather than being seen at regular intervals per protocol • Diagnostic testing and interventions are usual provided for a specific reason, not in a randomized fashion • Doctors are busy and don’t record every piece of information every time. • How can we express/account for ambiguity?? • Appropriate use of this data for research (particularly clinical trials simulation) requires larger data sets than you think

Proposed solutions • More robust data integration with semantic interoperability enabled by data standards and a truly Universal Identifier. • This is easier said than done!

Obstacles for which purely automated solutions will be challenging • Integrating more databases offers the promise of filling in gaps in the continuum of care, • But it also increases the likelihood of finding clinical conflicts in the data for an individual • Semantic interoperability will enable different information technology systems to understand the true meaning of data being sent • But have you ever seen two DOCTORS agree upon the meaning of what they hear?

Obstacles for which purely automated solutions will be challenging • Standards will enable computer systems to share a common language to describe clinical concepts • But the precision inherent in these vocabularies often exceeds the precision of medicine • The Universal Identifier problem for identifying individuals across systems will be solved with better algorithms • But then the state of the science will demand defined links between family members!

Issues in moving forward • Is PICARD best characterized as • A database? • Does it have its own well defined data model? • An interface? • Can users interact with it on their own? • Is it self populating from other information resources? • A service? • Do we provide facilitated, as opposed to direct access? • How do we link our clinical practice database to tissue and genomic databases that are supposedly anonymous? • How do we work with departmental owners of data to achieve comfort in sharing information centrally?

Where we are today

Where we’d like to be

In Summary • With their more comprehensive, longitudinal contents, Clinical Practice Databases like PICARD overcome several of the limitations of older, mostly administrative databases used in research • Can be used to help find patients for recruitment into clinical trials • Can provide data for clinical trials simulation • to extend the generalizability of known clinical trial results, • to confirm if older results are still valid. • to provide insight into a research question if a formal clinical trial would be prohibitively expensive or unethical to conduct • Further work is needed to optimize semantic interoperability among component systems.

Leveraging Clinical and Administrative Systems for Advancing Research: Insights and Challenges