130 likes | 249 Vues
This presentation by Dr. Nicoleta Serban explores the dimensions of Big Data, focusing on size and complexity in healthcare analytics. It highlights critical aspects, such as data storage and retrieval infrastructures, privacy safeguards, and computational methods. Additionally, the Medicaid Project serves as a proof of concept, illustrating the challenges and opportunities in managing vast datasets, particularly within pediatric asthma care pathways. The goal is to advance methodologies in health analytics while ensuring compliance and integrity in processing and utilizing patient data.
E N D
Big Data: Size, Complexity and Analytics Nicoleta Serban, PhD Associate Professor H. Milton Stewart School of Industrial & Systems Engineering Georgia Institute of Technology
What is Big? Size or Quantity • Gigabyte ( bytes) vs Terabyte ( bytes) vs. Petabyte ( bytes) vs. Exabyte ( bytes) Complexity or Heterogeneity • Dependencies: temporal, spatial or network • Randomness: sampling scheme • High dimensionality: multiple features • Depth: multiple hierarchies
Why Size Matters? Infrastructure for managing information: • Storage – relational database vs. distributed systems vs. cloud computing • Retrieval – random vs. sequential access • Representation – level of knowledge vs. derivation of features • Safeguards – protection of privacy and confidentiality
Why Complexity Matters? Translation of information to data to knowledge: • Infrastructure – supercomputers vs. distributed computers • Computation - single-threaded vs. parallelizable computational methods • Analytics – exponentially growing number of hypotheses • Inference – the dangers of ‘blind’ data mining vs. mathematical rigor
Data Science Framework • Data • Representation • Sampling • Information • Infrastructure • Management • Decisions • System engineering • Knowledge • Computation • Tools • Data architectures • Data integration, sharing and federation • Data privacy rules • Data wrangling • Deriving hypotheses • Validating hypotheses • Eliciting causal relations • Designing, planning, and optimizing • Testing, ranking, scoring • System dynamics • Data mining • Machine learning • Statistical inference • Network analysis • Simulations • Visualization
A Proof of Concept: Medicaid Project • Information: • Identifiable patient-level claims data • 5 years+14 states = • 266,839,307,070 Observations • 2 Terabytes of information • Data: • Represented as patient care trajectories: utilization, cost and patient characteristics • Sampled by disease Challenge #1: HIPPA and CMS data safeguards compliance - data environment: access, sharing, linking, storage Challenge #2: Database backbone - projected research needs - projected computational needs Challenge #3: Data Processing - unavailability of tools to process-mine claims - additional data and information needs - expert opinion & collaborations
Medicaid Project: Health Analytics • Data: • Condition: Pediatric Asthma • Baseline Metrics • Care Pathway • Access & Outcomes • Knowledge • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways Process Mining Spatial Statistical Models Functional Data Analysis Unsupervised classification Sequence clustering Markov-decision processes Optimization
Medicaid Project: Health Analytics • Knowledge: • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways • Decision Making: • Policy interventions • Network Interventions Markov-decision processes Causal Inference Optimization Modeling Simulations
Medicaid Project: Resources • Legal Process & CMS Approval (~ 2yrs) • Costly IT infrastructure implementation • Extensive IT support • Constrained computing infrastructure • Large team of students • Funding & Deliverables • Visibility
Medicaid Project: Opportunities • Developing the proof of concept in developing larger infrastructures for protected information • Becoming the center for deployment of tools for mining claims data • Advancing rigor in health analytics • Educating students and visiting researchers • Informing policy making in understanding and managing the healthcare system
Acknowledgements Co-Principal investigator: Dr. Swann Supporting Institutes and Organizations • National Science Foundation (CAREER Award) • Institute of People and Technology • Children’s Healthcare of Atlanta Research Team IT Staff: Matthew Sanders and Paul Diederich Postdoctoral fellow: Dr. Monica Gentili Undergraduate students: Yuchen Zheng, Alex Terry, Pravara Harati, Qiming Zhang, Sean Monahan Graduate students: Kevin Johnson (MS), Erin Garcia, Ben Johnson, Zihao Li, Ross Hilton
Contact Us NicoletaSerban nserban@isye.gatech.edu Julie Swann jswann@isye.gatech.edu