1 / 14

Data Mining Status and Risks

Data Mining Status and Risks. Dr. Gregory Newby UNC-Chapel Hill http://ils.unc.edu/gbnewby. Overview. What is data mining and related concepts? Fundamentals of the science and practice of data mining What data sources are available? Causality and correlation Risks of data mining

elke
Télécharger la présentation

Data Mining Status and Risks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Status and Risks Dr. Gregory Newby UNC-Chapel Hill http://ils.unc.edu/gbnewby

  2. Overview • What is data mining and related concepts? • Fundamentals of the science and practice of data mining • What data sources are available? • Causality and correlation • Risks of data mining • Future moves

  3. Data Mining • “An information extraction activity whose goal is to discover hidden facts contained in databases. …[D]ata mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.”(Via http://www.twocrows.com/glossary.htm)

  4. Data Mining • Is: Seeking new information from relations among data, possibly from different sources • Is: An important area of academic, corporate and government research • Is: Important from a security standpoint, because data mining might yield emergent information that would otherwise remain unknown

  5. The Bigger Picture Information retrieval Data mining Data fusion The Data Universe (all data, all sources)

  6. The Data Universe • All data • All topics • All sources • Numeric, textual • Discrete, longitudinal • Lots and lots of data! • The data universe is growing constantly, and many new data sources are being created as a result of security concerns & technological progress

  7. Challenges of the Data Universe • Scale: too much data to deal with • Format: many different formats which are difficult to merge or query • Access: most data (over 90%?) are not Web-accessible • Databases • Proprietary or internal data • Formatting problems or issues

  8. Solutions • Figure out how to get data from one format to another. Standards such as XML and EDI help • Develop cooperative relationships among data holders for data exchange. This is happening much more in government • Develop tools to identify relationships among data. This is the focus of data mining

  9. Data Mining != Web Searching • On the Web, we’re doing high precision information retrieval • We want the first ranked documents to be relevant • We don’t want to see irrelevant documents • The data universe for Web search engines is vast, making this a relatively straightforward problem (though a big engineering challenge!)

  10. Data Mining != Web Searching • Data mining is all about recall, not precision • Recall means we find all the relevant documents, regardless of how many irrelevant documents • This is a tougher problem, since the set of responses to a given inquiry can be huge • It’s tougher : data formats, data merging, access, etc. • The data miner’s goal is to set a threshold over which relationships are “interesting” • Data miners can also search for particular patterns, i.e. related to an individual or group

  11. Today • Law enforcement, industry and government are making their data sources more open to each other (these data sources are not generally publicly available) • Data integrity issues are a major concern • Data mining is still tough. “False positive” relationships are easy to spot • Correlation vs. causality • Seek and ye shall find • Lots of data yields lots of matches

  12. Today’s Data Sources • Credit and other financials • Law enforcement records • Travel history • Health data • Whatever you put on the InternetIf you are targeted: • Wiretap data (‘net, phone, etc.) • Surveillance data • HUMINT, etc., etc.

  13. Tomorrow • Decreased barriers among different data sources (this is a main impact of PATRIOT, but more is coming) • Increased data collection (via PATRIOT plus technological trends) • Better tools for data mining, and new technologies making data sharing and integration easier

  14. Contact Info • Greg Newby is moving from UNC to UAF • New position: • Research Faculty at the Arctic Region Supercomputing CenterUniversity of Alaska, Fairbanks • newby@arsc.edu

More Related