Mainlining Data Mining:
E N D
Presentation Transcript
Mainlining Data Mining: Jim Gray Microsoft Panel talk at ICDE2000 San Diego, 2 Mar 2000
Is data mining still a niche technology? • 97,363 items on Northern Light re “data mining” • 9,075,288 items re “data base” or “database” • Is 100,000 items a niche? (OR: 14K, XML: 250K) • Today data mining tools for experts (statisticians). (Decision Trees, Clusters, K-means, Neural nets…) • High tech and High Touch aka: consulting and license fees And the vendors like it that way. • Claim that you MUST understand the technology to use it.
But.. The Petabytes are Coming!! • We will be/are drowning in data/email/web.. • Abstraction & categorization are key technologies • But, • They have to work. • They have to be trivial to learn. • Successful Ubiquitous data mining (clustering/classifiers…) • Mail Filters/Classifiers • Resume readers • Shopping recommendations, Community finders • Web search engines
Key technical/research issues for transition to the mainstream? PROCESS PROBLEMS: • Getting data into tool is hell • Scrubbing data is hell • Then comes the easy part: mining • Then comes the really hard part: visualization and understanding • Most of us: • Can’t understand neural nets (that’s bad). • Can’t understand statistics (that’s a fact).
Key technical/research issues for transition to the mainstream? Opportunities: It’s not just numbers • Text mining • Time series • Domain specific • Web logs • Protein patterns • Spatial (e.g. geology, astronomy) • Image
1990 FORD 1991 CHEVY 1992 1993 By Year By Make By Make & Year RED WHITE BLUE By Color & Year By Make & Color Sum By Color New opportunities for KDM? • Make data capture/scrub/import trivial • Provide intuitive manipulation interfaces • Provide simpler analysis concepts support/confidence concept precision/recall ranking pivot & rollup & cube • Provide interactivevisual data explorer. • Case in point: I have yet to see a nice data cube visualizer.
Research challenges that will impact data mining? • Simpler analysis concepts • Visualization tools to navigate data • Better algorithms = Better answers