1 / 26

Special Topics in Database Systems

Special Topics in Database Systems. Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009. Introduction. [Fayyad, Piatetsky-Shapiro & Smyth 96].

brygid
Télécharger la présentation

Special Topics in Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009 CMPT 884, SFU, Martin Ester, 1-09

  2. Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] • Knowledge discovery in databases (KDD)is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown • and potentially useful. • Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application. CMPT 884, SFU, Martin Ester, 1-09

  3. Introduction • Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based inferences • focus on numeric data • Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data • Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases CMPT 884, SFU, Martin Ester, 1-09

  4. Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Introduction KDD Process [Han & Kamber 2000] Databases KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Data Mining Trans- formation Pre-processing Evaluation Focussing Pattern Knowledge Database CMPT 884, SFU, Martin Ester, 1-09

  5. • • • • • • • • • • • • b b a b b a a a b b a a • • • • • • • • • • • • Data Mining • Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. • Data-Mining Tasks clustering classification A and B  C association rules generalisation other tasks: regression, outlier detection . . . CMPT 884, SFU, Martin Ester, 1-09

  6. Trends in KDD Research • KDD 2000 Conference • New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process CMPT 884, SFU, Martin Ester, 1-09

  7. Trends in KDD Research • KDD 2002 Conference • Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications CMPT 884, SFU, Martin Ester, 1-09

  8. Trends in KDD Research • KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . . .) CMPT 884, SFU, Martin Ester, 1-09

  9. Trends in KDD Research • KDD 2006 Conference • Clustering • Classification / supervised ML • Privacy • Web / Graph Mining • Web / Text Mining • Frequent Pattern Mining • Structured Data CMPT 884, SFU, Martin Ester, 1-09

  10. Trends in KDD Research • KDD 2008 Conference • Text Mining • Data Integration • Social Networks • Graph Mining • Distance Functions and Metric Learning • Active and Semi-supervised Learning • Pattern Mining • Collaborative Filtering CMPT 884, SFU, Martin Ester, 1-09

  11. Trends in KDD Research • Some Hot Topics • Social Networks THE hot topic of KDD 08 topic of the only panel • Graph mining • Text mining and information extraction / integration • Collaborative Filtering more general, recommender systems $1M NetFlix prize CMPT 884, SFU, Martin Ester, 1-09

  12. Overview of this Course • Prerequisites • Foundations of database systems and statistics • Introductory graduate data mining course or equivalent • Objectives • Introduction into some hot topics of data mining research • Training in research methodology • Presentation skills • start thesis work after this class! CMPT 884, SFU, Martin Ester, 1-09

  13. Overview of this Course • Topics • Graph mining social network analysis and analysis of biological networks as driving applications • Recommender systems in particular trust-based recommendation • Information extraction and integration integration with existing databases CMPT 884, SFU, Martin Ester, 1-09

  14. Overview of this Course • Format • Tutorial surveys by instructor • Written research paper reviews by students • Research paper presentations by students discussions in class • Course research projects by students on a topic of their choice CMPT 884, SFU, Martin Ester, 1-09

  15. Overview of this Course • Tentative Grading Scheme • Paper review (20 %) • Paper presentation (20 %) • Course project report (40%) two steps: project proposal, final project report • Course project presentation (20 %) •  marking criteria: originality, technical quality, presentation CMPT 884, SFU, Martin Ester, 1-09

  16. Overview of this Course • Types of Course Projects • Literature surveysummarize the state-of-the-art and identify open research problems • New problemintroduce and analyze a new problem • New algorithm for known problemimplement and evaluate algorithm • Improvement of existing algorithmimplement and compare algorithm • Comparison of existing algorithms on a new, interesting datasetidentify criteria for choice of algorithms / open research problems CMPT 884, SFU, Martin Ester, 1-09

  17. Graph Mining • Motivating Applications • Social network analysis • What communities exist? • How does information about a new product spread? • What customers should be targeted to maximize the profit of a marketing campaign? • Analysis of biological networks o What are the functional modules of an organism? o How do biological networks evolve in the course of time? o What protein should be targeted to inhibit some virulent bacteria? CMPT 884, SFU, Martin Ester, 1-09

  18. Graph Mining • Methods • Frequent subgraph mining • frequent pattern mining approach • Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters • Graph generative models probabilistic models that generate graphs similar to real graphs / networks CMPT 884, SFU, Martin Ester, 1-09

  19. Graph Mining • Challenges • Complexity of graph algorithms • Many graph mining problems are NP-hard. • Real graphs tend to be extremely large.  need efficient algorithms • Attribute data • Many graphs have attributes associated with the nodes. • Transformation into weighted graph looses a lot of information.  need new models / algorithms considering relationship and attribute data CMPT 884, SFU, Martin Ester, 1-09

  20. Recommender Systems • Motivating Applications • MotivationoThe internet provides a flood of information on all kinds of items.o There is a great need for personalized recommendations. o The internet also provides a wealth of item ratings / reviews. • Typical applications • Movie recommendation • Product recommendation • Keyword recommendation CMPT 884, SFU, Martin Ester, 1-09

  21. Recommender Systems • Methods • Collaborative filteringoUses only a database of user – item ratings.o Recommendation based on ratings by users with similar rating patterns. • Content-based recommender systems • o Uses information about the content of items and / or the properties of users. • o Recommends items that have content similar to items liked by user. • Trust-based recommender systems • Assume a social network / trust network. Trust can be defined explicitly or implicitly. • Recommendation based on ratings by trusted neighbors. CMPT 884, SFU, Martin Ester, 1-09

  22. Recommender Systems • Challenges • High dimensionality and sparsity of dataoThe overwhelming majority (> 99%) of user item ratings is unknown.o Recommendation especially hard for cold start users and controversial items. •  dimensionality reduction, model based methods, trust-based approach • Fraud • o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings. •  trust-based approach more robust to fraud • Privacy issues with trust network data • o only very few trust networks are public domain CMPT 884, SFU, Martin Ester, 1-09

  23. Information Extraction and Integration • Motivating Applications • Importance of unstructured text data oThe overwhelming majority (>= 80%) of human generated information is not in structured form, but in unstructured text. • Biomedical literature • o Contains a wealth of valuable information that cannot be processed / searched automatically. • o Extraction of entities and relationships such as proteins and their localizations. • Online product reviews • o A lot of product „reviews“ available online in community databases or blogs. • o Companies want to know what customers think of their products. CMPT 884, SFU, Martin Ester, 1-09

  24. Information Extraction and Integration • Methods • Basic NLP methods oPart-of-speech tagging • o Lexica, ontologies, . . . • Machine learning methods • o Typically, supervised classification. • o CRFs and similar methods are state-of-the-art. • Bootstrapping approach • o Using a small labeled training dataset, find textual extraction patterns. • o Using these patterns, extract further entities / relationships and continue. CMPT 884, SFU, Martin Ester, 1-09

  25. Information Extraction and Integration • Challenges • Text data is hard to understand oMany of the NLP problems are still essentially unsolved.  relatively simple NLP methods often sufficient for information extraction • Portability across domains • o Extraction methods need to be portable from one domain to another. • o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive. •  machine learning methods • Entity mentions need to be resolved • o Information extraction produces strings referencing an entity of a given type. • o Without mapping to known real world entities, extracted information is of limited usefulness. need to integrate extracted information with existing databases CMPT 884, SFU, Martin Ester, 1-09

  26. References • Graph mining • X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial KDD 08 • Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007 • Recommender systems • Joseph Konstan, “Introduction to Recommender Systems”, Tutorial SIGMOD 2008 • Information extraction and integration - Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and Integration”, Tutorial KDD 06 • - AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan,“Managing Information Extraction”, Tutorial SIGMOD 2006 CMPT 884, SFU, Martin Ester, 1-09

More Related