140 likes | 252 Vues
Explore the intricacies of big data, its size, and complexity, and delve into the world of analytics. Learn about infrastructure management, translation of information to knowledge, and the data science framework. Discover a proof of concept in the Medicaid Project, featuring challenges, analytics, processes, and opportunities for advancements in healthcare. Acknowledging the contributions and support received from various institutes and individuals, this comprehensive guide offers insights into navigating the realm of big data analytics efficiently.
E N D
Big Data: Size, Complexity and Analytics Nicoleta Serban, PhD Associate Professor H. Milton Stewart School of Industrial & Systems Engineering Georgia Institute of Technology
What is Big? Size or Quantity • Gigabyte ( bytes) vs Terabyte ( bytes) vs. Petabyte ( bytes) vs. Exabyte ( bytes) Complexity or Heterogeneity • Dependencies: temporal, spatial or network • Randomness: sampling scheme • High dimensionality: multiple features • Depth: multiple hierarchies
Why Size Matters? Infrastructure for managing information: • Storage – relational database vs. distributed systems vs. cloud computing • Retrieval – random vs. sequential access • Representation – level of knowledge vs. derivation of features • Safeguards – protection of privacy and confidentiality
Why Complexity Matters? Translation of information to data to knowledge: • Infrastructure – supercomputers vs. distributed computers • Computation - single-threaded vs. parallelizable computational methods • Analytics – exponentially growing number of hypotheses • Inference – the dangers of ‘blind’ data mining vs. mathematical rigor
Data Science Framework • Data • Representation • Sampling • Information • Infrastructure • Management • Decisions • System engineering • Knowledge • Computation • Tools • Data architectures • Data integration, sharing and federation • Data privacy rules • Data wrangling • Deriving hypotheses • Validating hypotheses • Eliciting causal relations • Designing, planning, and optimizing • Testing, ranking, scoring • System dynamics • Data mining • Machine learning • Statistical inference • Network analysis • Simulations • Visualization
A Proof of Concept: Medicaid Project • Information: • Identifiable patient-level claims data • 5 years+14 states = • 266,839,307,070 Observations • 2 Terabytes of information • Data: • Represented as patient care trajectories: utilization, cost and patient characteristics • Sampled by disease Challenge #1: HIPPA and CMS data safeguards compliance - data environment: access, sharing, linking, storage Challenge #2: Database backbone - projected research needs - projected computational needs Challenge #3: Data Processing - unavailability of tools to process-mine claims - additional data and information needs - expert opinion & collaborations
Medicaid Project: Health Analytics • Data: • Condition: Pediatric Asthma • Baseline Metrics • Care Pathway • Access & Outcomes • Knowledge • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways Process Mining Spatial Statistical Models Functional Data Analysis Unsupervised classification Sequence clustering Markov-decision processes Optimization
Medicaid Project: Health Analytics • Knowledge: • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways • Decision Making: • Policy interventions • Network Interventions Markov-decision processes Causal Inference Optimization Modeling Simulations
Medicaid Project: Resources • Legal Process & CMS Approval (~ 2yrs) • Costly IT infrastructure implementation • Extensive IT support • Constrained computing infrastructure • Large team of students • Funding & Deliverables • Visibility
Medicaid Project: Opportunities • Developing the proof of concept in developing larger infrastructures for protected information • Becoming the center for deployment of tools for mining claims data • Advancing rigor in health analytics • Educating students and visiting researchers • Informing policy making in understanding and managing the healthcare system
Acknowledgements Co-Principal investigator: Dr. Swann Supporting Institutes and Organizations • National Science Foundation (CAREER Award) • Institute of People and Technology • Children’s Healthcare of Atlanta Research Team IT Staff: Matthew Sanders and Paul Diederich Postdoctoral fellow: Dr. Monica Gentili Undergraduate students: Yuchen Zheng, Alex Terry, Pravara Harati, Qiming Zhang, Sean Monahan Graduate students: Kevin Johnson (MS), Erin Garcia, Ben Johnson, Zihao Li, Ross Hilton
Contact Us NicoletaSerban nserban@isye.gatech.edu Julie Swann jswann@isye.gatech.edu