Knowledge Discovery from Mining Big Data

(the Borne Identity) Data Literacy for all ! (the Borne ultimatum) Knowledge Discovery fromMining Big Data Kirk Borne @KirkDBorne George Mason University School of Physics, Astronomy, & Computational Sciences http://classweb.gmu.edu/kborne/

The Big Data Manifesto (the Borne Ultimatum) • More data is not just more data … more is different! • Discover the unknown unknowns. • Address massive Data-to-Knowledge (D2K) challenge. • Data Literacy for all !

Ever since we first began to explore our world…

… humans have asked questions and … We have collected evidence (data) to help answer those questions.

… humans have asked questions and … We have collected evidence (data) to help answer those questions. The journey from traditional science to …

… Data-intensive Science is a Big Challenge 6

http://www.economist.com/specialreports/displaystory.cfm?story_id=15557443http://www.economist.com/specialreports/displaystory.cfm?story_id=15557443

Scary News:Big Data is taking us to a Tipping Point http://bit.ly/HUqmu5 8 http://goo.gl/Aj30t

Promising News: Big Data leads to Big Insights and New Discoveries 9 http://news.nationalgeographic.com/news/2010/11/photogalleries/101103-nasa-space-shuttle-discovery-firsts-pictures/

Good News: Big Data is Sexy http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1 http://dilbert.com/strips/comic/2012-09-05/ 10

Characteristics of Big Data – 1234 • Computing power doubles every 18 months (Moore’s Law) ... • 100x improvement in 10 years • I/O bandwidth increases ~10% / year • <3x improvement in 10 years. • The amount of data doubles every year ... • 1000x in 10 years, and 1,000,000x in 20 yrs. • How much data are there in the world? • From the beginning of recorded time until 2003, • we created 5 billion gigabytes (exabytes) of data. • In 2011 the same amount was created every two days. • In 2013, the same amount is created every 10 minutes. • http://money.cnn.com/gallery/technology/2012/09/10/big-data.fortune/index.html

Characteristics of Big Data – 1234 • Computing power doubles every 18 months (Moore’s Law) ... • 100x improvement in 10 years • The amount of data doubles every year (or faster!) ... • 1000x in 10 years, and 1,000,000x in 20 yrs. • I/O bandwidth increases ~10% / year • <3x improvement in 10 years.

Characteristics of Big Data – 1234 • Computing power doubles every 18 months (Moore’s Law) ... • 100x improvement in 10 years • The amount of data doubles every year ... • 1000x in 10 years, and 1,000,000x in 20 yrs. • I/O bandwidth increases ~10% / year • <3x improvement in 10 years. Moore’s Law of Slacking will not help ! http://arxiv.org/abs/astro-ph/9912202

Characteristics of Big Data – 1234 • Big quantities of data are acquired everywhere. • It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, national security, media, etc.

Characteristics of Big Data – 1234 • Big quantities of data are acquired everywhere. • It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, national security, media, etc. • LSST project (www.lsst.org) : • 20 Terabytes of astronomical imaging every night • 100-200 Petabyte image archive after 10 years • 20-40 Petabyte database • 2-10 million new sky events nightly that need to be characterized and classified – potential new discoveries!

Characteristics of Big Data – 1234 • Job opportunities are sky-rocketing • Extremely high demand for Data Science skills • Demand will continue to increase • Old: “100 applicants per job”. New: “100 jobs per applicant”

Characteristics of Big Data – 1234 • Job opportunities are sky-rocketing • Extremely high demand for Data Science skills • Demand will continue to increase • Old: “100 applicants per job”. New: “100 jobs per applicant” • McKinsey Report (2011**) : • Big Data is the new “gold rush” , the “new oil” • 1.5 million skilled data scientist shortage within 5 years • **http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

Data Sciences: A National Imperative 1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) http://www.nap.edu/catalog.php?record_id=5504 2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from http://www.sis.pitt.edu/~dlwkshop/report.pdf 3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from http://www.ncar.ucar.edu/cyber/cyberreport.pdf 4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005) downloaded from http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf 5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research Agenda, (2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf 6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf 7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf 8. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf 9. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) downloaded from http://www.sis.pitt.edu/~repwkshop/NSF-JISC-report.pdf 10. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme Scale, (2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf 11. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf 12. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf 13. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded from http://www.nap.edu/catalog.php?record_id=12615 14. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) downloaded from http://www.cra.org/ccc/docs/reports/DES-report_final.pdf 15. National Big Data Research and Development Initiative, (2012) downloaded from http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf

The Fourth Paradigm: Data-Intensive Scientific Discoveryhttp://research.microsoft.com/en-us/collaboration/fourthparadigm/ • The 4 Scientific Paradigms: • Experiment (sensors) • Theory (modeling) • Simulation (HPC) • Data Exploration (KDD)

Characteristics of Big Data – 1234 • The emergence of Data Science and Data-Oriented Science (the 4th paradigm of science). • “Computational literacy and data literacy are critical for all.” - Kirk Borne

Characteristics of Big Data – 1234 • The emergence of Data Science and Data-Oriented Science (the 4th paradigm of science). • “Computational literacy and data literacy are critical for all.” - Kirk Borne • A complete data collection on any complex domain (e.g., Earth, or the Universe, or the Human Body) has the potential to encode the knowledge of that domain, waiting to be mined and discovered. • “Somewhere, something incredible is waiting to be known.” - Carl Sagan

Characteristics of Big Data – 1234 • The emergence of Data Science and Data-Oriented Science(the 4thparadigm of science). • “Computational literacy and data literacy are critical for all.” - Kirk Borne • A complete data collection on any complex domain (e.g., Earth, or the Universe, or the Human Body) has the potential to encode the knowledge of that domain, waiting to be mined and discovered. • “Somewhere, something incredible is waiting to be known.” - Carl Sagan • We call this “X-Informatics”: addressing the D2K (Data-to-Knowledge) Challenge in any discipline X using Data Science. • Examples: Astroinformatics, Bioinformatics, Geoinformatics, Climate Informatics, Ecological Informatics, Biodiversity Informatics, Environmental Informatics, Health Informatics, Medical Informatics, Neuroinformatics, Crystal Informatics, Cheminformatics, Discovery Informatics, and more.

Characterizing the Big Data Hype • If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

Characterizing the Big Data Hype • If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”. • Big Data characteristics: the 3+n V’s = • Volume (lots of data = “Tonnabytes”) • Variety (complexity, curse of dimensionality) • Velocity(rate of data and information flow) • V • V • V • V • V

Characterizing the Big Data Hype • If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”. • Big Data characteristics: the 3+n V’s = • Volume (lots of data = “Tonnabytes”) • Variety (complexity, curse of dimensionality) • Velocity(rate of data and information flow) • Veracity • Variability • Venue • Vocabulary • Value

Characterizing the Big Data Hype • If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”. • Big Data characteristics: the 3+n V’s = • Volume (lots of data = “Tonnabytes”) • Variety (complexity, curse of dimensionality) • Velocity (rate of data and information flow) • Veracity (verifying inference-based models from comprehensive data collections) • Variability • Venue • Vocabulary • Value

Characterizing the Big Data Hype • If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”. • Big Data characteristics: the 3+n V’s = • Volume (lots of data = “Tonnabytes”) • Variety (complexity, curse of dimensionality) • Velocity (rate of data and information flow) • Veracity (verifying inference-based models from comprehensive data collections) … as I said earlier: • Variability • Venue • Vocabulary • Value A complete data collection on any complex domain (e.g., Earth, or the Universe, or the Human Body) has the potential to encode the knowledge of that domain, waiting to be mined and discovered.

Characterizing the Big Data Hype • If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”. • Big Data characteristics: the 3+n V’s = • Volume • Variety : this one helps us to discriminate subtle new classes (= Class Discovery) • Velocity • Veracity • Variability • Venue • Vocabulary • Value Big Data Example :

Insufficient Variety: stars & galaxies are not separated in this parameter

Sufficient Variety: stars & galaxies are separated in this parameter

4 Categories of Scientific KDD(Knowledge Discovery in Databases) • Class Discovery • Finding new classes of objects and behaviors • Learning the rules that constrain the class boundaries • Novelty Discovery • Finding new, rare, one-in-a-million(billion)(trillion) objects and events • Correlation Discovery • Finding new patterns and dependencies, which reveal new natural laws or new scientific principles • Association Discovery • Finding unusual (improbable) co-occurring associations

This graphic says it all … • Clustering – examine the data and find the data clusters (clouds), without considering what the items are = Characterization ! • Classification– for each new data item, try to place it within a known class (i.e., a known category or cluster) =Classify ! • Outlier Detection – identify those data items that don’t fit into the known classes or clusters = Surprise ! Graphic provided by Professor S. G. Djorgovski, Caltech

Scientists have been doing Data Mining for centuries “The data are mine, and you can’t have them!” • Seriously ... • Scientists love to classify things ... (Supervised Learning. e.g., classification) • Scientists love to characterize things ... (Unsupervised Learning. e.g., clustering) • And we love to discover new things ... (Semi-supervised Learning. e.g., outlier detection) 33

Data-Driven Discovery:Scientific KDD (Knowledge Discovery from Data) Class Discovery Novelty Discovery Correlation Discovery Association Discovery Graphic from S. G. Djorgovski Graphic from S. G. Djorgovski • Benefits of very large datasets: • best statistical analysis of “typical” events • automated search for “rare” events

Scientific Data-to-Knowledge Problem 1-a • The Class Discovery Problem : (clustering) • Find distinct clusters of multivariate scientific parameters that separate objects within a data set. • Find new classes of objects or new behaviors. • What is the significance of the clusters (statistically and scientifically)? • What is the optimal algorithm for finding friends-of-friends or nearest neighbors in very high dimensions (complex data with Variety)? • N is >1010, so what is the most efficient way to sort? • Number of dimensions > 1000 – therefore, we have an enormous subspace search problem

Scientific Data-to-Knowledge Problem1-b • The superposition / decomposition problem: • Finding the parameters or combinations of parameters (out of 100’s or 1000’s) that most cleanly and optimally (parsimoniously) distinguish different object classes • What if there are 1010 objects that overlap in a 103-D parameter space? • What is the optimal way to separate and accurately classify the different unique classes of objects?

Class Discovery: feature separation and discrimination of classes Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf • The separation of classes improves when the “correct” features are chosen for investigation, as in the following star-galaxy discrimination test: the “Star-Galaxy Separation” Problem Good Not good 37

Scientific Data-to-Knowledge Problem1234 • The Novelty Discovery Problem : • Anomaly Detection, Deviation Detection, Surprise Discovery, Novelty Discovery: Finding objects and events that are outside the bounds of our expectations (outside known clusters) • Finding new, rare, one-in-a-million(billion)(trillion) objects and events – Finding the Unknown Unknowns • These may be real scientific discoveries or garbage • Outlier detection is therefore useful for: • Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? • Novelty Discovery – is my Nobel prize waiting? • How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)? • How do we measure their “interestingness”?

Novelty Discovery: Improved Discovery of Rare Objects or Events across Multiple Data Sources

Scientific Data-to-Knowledge Problem1234 • The Correlation Discovery Problem = Dimension Reduction Problem: • Finding new correlations and “fundamental planes” of parameters. • Such correlations, patterns, and dependencies may reveal new physics or new scientific relations. • Number of attributes can be hundreds or thousands = • The Curse of High Dimensionality ! • Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?

Fundamental Plane for 156,000 Elliptical Galaxies: plot shows variance captured by first 2 Principal Components as a function of local galaxy density. Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008 • Slide Content • Slide content • Slide content • Slide content % of variance captured by PC1+PC2 low (Local Galaxy Density) high 41

Scientific Data-to-Knowledge Problem1234 • The Association Discovery Problem : Link Analysis – Network Analysis – Graph Mining • Identify connections between different events (or objects) • Find unusual (improbable) co-occurring combinations of data attribute values • Find data items that have much fewer than “6 degrees of separation” • Identifying such connectivity in our scientific databases and knowledge repositories can lead to new insights, new knowledge, new discoveries.

There are many technologies associated with Big Data http://siliconangle.com/blog/2012/07/13/big-data-nightmares/ 43

One approach to Big Data:Computational Science (Hadoop,Map/Reduce) http://www.bigdatabytes.com/wp-content/uploads/2012/01/big-data.jpg 44

Another approach to Big Data: Data Science (Informatics) 45

A third approach to Big Data: Citizen Science (crowdsourcing) 46

Galaxy Zoo: example ofCitizen Science (crowdsourcing) http://www.zooniverse.org http://astrophysics.gsfc.nasa.gov/outreach/podcast/wordpress/index.php/2010/10/08/saras-blog-be-a-scientist/ 47

Astroinformatics Research paper available ! Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and Education”, Journal of Earth Science Informatics, vol. 3, pp. 5-17.See also http://arxiv.org/abs/0909.3892 Addresses the data science challenges, research agenda, application areas, use cases, and recommendations for the new science of Astroinformatics.

LSST = Large Synoptic Survey Telescope 8.4-meter diameter primary mirror = 10 square degrees! http://www.lsst.org/ Hello ! • 100-200 Petabyte image archive • 20-40 Petabyte database catalog

Observing Strategy: One pair of images every 40 seconds for each spot on the sky, then continue across the sky continuously every night for 10 years (~2021-2031), with time domain sampling in log(time) intervals (to capture dynamic range of transients). • LSST (Large Synoptic Survey Telescope): • Ten-year time series imaging of the night sky – mapping the Universe ! • ~2,000,000 events each night – anything that goes bump in the night ! • Cosmic Cinematography! The New Sky! @ http://www.lsst.org/

Knowledge Discovery from Mining Big Data