1 / 17

Course on Data Mining (581550-4): Seminar Meetings

16.11. 02.11. 23.11. 09.11. 30.11. Seminar by Mika. M. Seminar by Pirjo. P. Course on Data Mining (581550-4): Seminar Meetings. Ass. Rules. Clustering. P. P. Episodes. KDD Process. M. P. Text Mining. Home Exam. M. Course on Data Mining (581550-4): Seminar Meetings.

beata
Télécharger la présentation

Course on Data Mining (581550-4): Seminar Meetings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 16.11. 02.11. 23.11. 09.11. 30.11. Seminar by Mika M Seminar by Pirjo P Course on Data Mining (581550-4): Seminar Meetings Ass. Rules Clustering P P Episodes KDD Process M P Text Mining Home Exam M

  2. Course on Data Mining (581550-4): Seminar Meetings • R. Feldman, M. Fresko, H. Hirsh, et.al.: "Knowledge Management: A Text Mining Approach", Proc of the 2nd Int'l Conf. on Practical Aspects of Knowledge Management (PAKM98), 1998 • B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Conference on Knowledge Discovery in Databases and Data Mining, 1997. Today 16.11.2001

  3. Course on Data Mining (581550-4): Seminar Meetings • Both papers refer to the Agrawal and Srikant paper we had last week: Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns. Int'l Conference on Data Engineering, 1995. Good to Read as Background

  4. Knowledge Management: A Text Mining Approach R. Feldman, M. Fresko, H. Hirsh, et.al Bar-Ilan University and Instict Software, ISRAEL; Rutgers University, USA; LIA-EPFL, Switzerland Published in PAKM'98 (Int'l Conf. on Practical Aspects of Knowledge Management) Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

  5. KM: A Text Mining Approach • Basic idea (see selected phases on the next slides): 1. Get input data in SGML (or XML) format Select only the contents of desired elements! (title, abstract, etc.) 2. Do linguistic preprocessing: 2.1 Term extraction (use linguistic software for this) 2.2 Term generation (combine adjacent terms to morpho- syntactic patterns like "noun-noun", "adj.-noun", etc. by calculating association coefficients) 2.3 Term filtering (select only the top M most frequent ones) 3. Create taxonomies (there is a tool for this) 4. Generate associations (you may constrain the creation) 5. Visualize/explore the results

  6. 2.1: Term Extraction

  7. 3: Taxonomy Construction

  8. 4: Association Rule Generation

  9. 4: Association Rule Generation

  10. 5.1: Visualization/Exploration

  11. 5.2: Visualization/Exploration

  12. Discovering Trends in Text Databases Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant IBM Almaden Research Center, USA Published in KDD'97 Data Mining course Autumn 2001/University of Helsinki Summary by Mika Klemettinen

  13. Discovering Trends in Text Databases • Basic ideas: • Identify frequent phrases using sequential patterns mining (see the slides & summaries from the Agrawal et. al paper "Mining Sequential Patterns" (MSP)) • Generate histories of phrases • Find phrases that satisfy a specified trend • Definitions: • Phrase: phrase p is  (w1)(w2) … (wn ), wherew is a word • 1-phrase:  (IBM) (data)(mining)  • 2-phrase:  (IBM) (data)(mining)   (Anderson) (Consulting)  (decision)(support)  • Itemset, sequence, is contained, etc.: as in MSP paper

  14. Discovering Trends in Text Databases • Gaps: Minimum and maximum gaps between adjacent words: identify relations of words/phrases inside sentences/paragraphs, between words/phrases in different paragraphs, between words/phrases in different sections, etc. • Sentence boundary: 1000 • Paragraph boundary: 100.000 • Section boundary: 10.000.000 • Phases: • Partition data/documents based on their time stamps, create phrases for each partition (Lent & al. have patent data documents) • Select the frequent phrases and save their frequences • Define shape queries using SDL (Shape Definition Language)

  15. Discovering Trends in Text Databases

  16. Discovering Trends in Text Databases

  17. Discovering Trends in Text Databases

More Related