1 / 0

Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012

Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012. Characteristics of Big Data – 1a . Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov , social networks, etc. .

yamin
Télécharger la présentation

Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big DataKirk BorneGeorge Mason UniversityLSST All Hands MeetingAugust 13 - 17, 2012

  2. Characteristics of Big Data – 1a Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc.
  3. Characteristics of Big Data – 1b Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc.
  4. Characteristics of Big Data – 1c Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc.
  5. Characteristics of Big Data – 2 Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc. But… What do we mean by “big”? Gigabytes? Terabytes? Petabytes? Exabytes? The meaning of “big” is domain-specific and resource-dependent (data storage, I/O throughput, computation cycles, communication costs) I say … we all are dealing with our own “tonnabytes”
  6. Characteristics of Big Data – 3 There are 4 dimensions to the Big Data challenge: Volume(“tonnabytes” data challenge) Variety (complexity, curse of dimensionality) Velocity (rate of data and information flowing at us) Verification (verifying inference-based models from data) Therefore, we need something better to cope with the tonnabytes …
  7. Data Science – Informatics & Statistics
  8. This graphic says it all … Clustering – examine the data and find the data clusters (clouds), without considering what the items are = Characterization ! Classification – for each new data item, try to place it within a known class (i.e., a known category or cluster) =Classify ! Outlier Detection – identify those data items that don’t fit into the known classes or clusters = Surprise ! Graphic provided by S. G. Djorgovski, Caltech
  9. Data-Enabled Science:Scientific KDD (Knowledge Discovery from Data) Characterize the new (clustering, unsupervised learning) Assign the known (classification, supervised learning) Discover the unknown (outlier detection, semi-supervised learning) Graphic from S. G. Djorgovski Graphic from S. G. Djorgovski The two major benefits of BIG DATA: best statistical analysis of “typical” events automated search for “rare” events
  10. Basic Astronomical Knowledge Problems – 1 The clustering problem: Finding clusters of objects within a data set What is the significance of the clusters (statistically and scientifically)? What is the optimal algorithm for finding friends-of-friends or nearest neighbors? N is >1010, so what is the most efficient way to sort? Number of dimensions ~ 1000 – therefore, we have an enormous subspace search problem Are there pair-wise (2-point) or higher-order (N-way) correlations? N is >1010, so what is the most efficient way to do an N-point correlation? algorithms that scale as N2logN won’t get us there
  11. Basic Astronomical Knowledge Problems – 2 Outlier detection: (unknown unknowns) Finding the objects and events that are outside the bounds of our expectations (outside known clusters) These may be real scientific discoveries or garbage Outlier detection is therefore useful for: Novelty Discovery – is my Nobel prize waiting? Anomaly Detection – is the detector system working? Data Quality Assurance – is the data pipeline working? How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)? How do we measure their “interestingness”?
  12. Basic Astronomical Knowledge Problems – 3 The dimension reduction problem: Finding correlations and “fundamental planes” of parameters Number of attributes can be hundreds or thousands The Curse of High Dimensionality ! Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?
  13. Basic Astronomical Knowledge Problems – 4 The superposition / decomposition problem: Finding the defining features that separate different classes objects that overlap in simple parameter spaces What if there are 1010 objects that overlap in a 103-D parameter space? What is the optimal way to separate and extract the different unique classes of objects?
  14. The LSST Big Data Manifesto More data is not just more data … more is different! Discover the unknown unknowns. Massive Data-to-Knowledge challenge.
  15. The LSST Big Data Challenges Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). Massive 20-Petabyte database: more than 20 billion objects need to be classified, and most will be monitored for important variations in real time. Massive event stream: knowledge extraction in real time for ~2,000,000 events each night. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Look at these in more detail ...
  16. LSST big data challenges # 1, 2 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night ≈ 40,000 CDs filled with data): A truckload of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 100 million CDs after 10 yrs The challenge is to find the new, the novel, the interesting, and the surprises (the unknown unknowns)within all of these data. Yes, more is most definitely different !
  17. LSST big data challenge # 3 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? Characterize first ! (Unsupervised Learning) Classify later. flux time
  18. Characterization includes … Feature Detection and Extraction: Identifying and describing features in the data via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science) Extracting feature descriptors from the data Curating these features for search, re-use, & discovery Finding other parameters and features from other archives, other databases, other information sources – and using those to help characterize (ultimately classify) each new event. … hence, coping with a highly multivariate parameter space
  19. Data-driven Discovery (Unsupervised Learning) i.e., What can I do with characterizations? Class Discovery – Clustering Principal Component Analysis – Dimension Reduction Outlier Detection – Surprise / Anomaly / Deviation / Novelty Discovery Link Analysis – Association Analysis – Network Analysis and more.
  20. Addressing the D2K (Data-to-Knowledge) Challenge Complete end-to-end application of Data Science: Data management, metadata management, data search, information extraction, data mining, knowledge discovery Applies to any discipline (not just science disciplines) Skilled workforce needed to take data to knowledge
  21. Informatics in Education and An Education in Informatics
  22. Data Science Education: Two Perspectives Informatics in Education– working with data in all learning settings Informatics (Data Science) enables transparent reuse and analysis of data in inquiry-based classroom learning. Learning is enhanced when students work with real data and information (especially online data) that are related to the topic (any topic) being studied. http://serc.carleton.edu/usingdata/ (“Using Data in the Classroom”) Example: CSI The Cosmos An Education in Informatics– students are specifically trained: … to access large distributed data repositories … to conduct meaningful inquiries into the data … to mine, visualize, and analyze the data … to make objective data-driven inferences, discoveries, and decisions Numerous Data Science programs now exist at several universities (GMU, Caltech, RPI, Michigan, Cornell, U. Illinois, Indiana U., … ) http://cds.gmu.edu/ (Computational & Data Sciences @ GMU)
  23. Responses to Big Data – 1 2.5 approaches to dealing with Big Data: Data Science = Informatics & Statistics (and data-intensive computing) Citizen Science = Human Computation Or else … (where possible) combine these two – use the very effective human cognitive skills of pattern recognition and anomaly detection to generate training sets of relevant features (characterizations) to improve the machine algorithms.
  24. Responses to Big Data – 2 LSST Informatics & Statistics Science Collaboration: breakout @ 11am in TB-A New Journal: Astronomy & Computing Poster and flyers available in hallway http://www.journals.elsevier.com/astronomy-and-computing/ New AAS Working Group on Astroinformatics & Astrostatistics Members: “Bill” ZeljkoIvezic (chair), Kirk Borne, George Djorgovski, Eric Feigelson, Eric Ford, Alyssa Goodman, AnetaSiemiginowska, Alex Szalay, Rick White. Visit https://www.facebook.com/AstroInformatics
  25. LSST Informatics & Statistics Breakout Session 11:00am-12:30pm today – Tortolita Ballroom A Brief “lightning” talks by 7 team members : JogeshBabu: Statistical Resources Kirk Borne: Outlier Detection for Surprise Discovery in Big Data Matthew Graham: Characterizing and Classifying CRTS Joseph Richards: Time-Domain Discovery and Classification Sam Schmidt: Upcoming Challenges for Photometric Redshifts Lior Shamir: Automatic Analysis of Galaxy Morphology John Wallin: Citizen Science and Machine Learning Open Discussion : LSST Publication Reviews: informatics & statistics participation LSST Science Book chapter Research Roadmap document
More Related