1 / 22

COMP3410 DB32: Technologies for Knowledge Management

COMP3410 DB32: Technologies for Knowledge Management. 10 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts,

wood
Télécharger la présentation

COMP3410 DB32: Technologies for Knowledge Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP3410 DB32:Technologies for Knowledge Management 10: Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)

  2. “Most international organizations produce more information in a week than many people could read in a lifetime”Adriaans and Zantinge What has Machine Learning got to do with Computing / Information Systems?

  3. Data mining is about discovering patterns in data. For this we need: KD/DM techniques, algorithms, tools, eg BootCat, WEKA A methodological framework to guide us, in collecting data and applying the best algorithms: CRISP-DM Objectives of knowledge discovery or data mining

  4. Data Mining was originally about “learning” patterns from DataBases, data structured as Records, Fields Knowledge Discovery is “exotic term” for DM??? Increasingly, data is unstructured text (WWW), so Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data Data Mining, Knowledge Discovery, Text Mining

  5. Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining define: data mining

  6. Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. ...en.wikipedia.org/wiki/Text_mining define: text mining

  7. Knowledge discovery is the process of finding novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test.www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery define: knowledge discovery

  8. Data Mining: Overview Concepts, Instances or examples, Attributes Data Mining Concept Descriptions Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.

  9. Input to a data mining algorithm is in the form of a set of examples, or instances. Each instance is represented as a set of features or attributes. Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record. In text-mining, instance is word/term in a corpus. The concepts to be learned are formed from patterns discovered within the set of instances. Instances

  10. The types of concepts we try to ‘learn’ include: Key “differences” – terms specific to our domain corpus Clusters or ‘Natural’ partitions; Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. Eg “Mature students studying information systems with high grade for General Studies A level are likely to get a 1st class degree” General Associations Eg “People who buy nappies are in general likely also to buy beer” concepts

  11. The types of concepts we try to ‘learn’ include: Numerical prediction Eg look for rules to predict what salary a graduate will get, given A level results, age, gender, programme of study and degree result – this may give us an equation: Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???) More concepts

  12. DB Example: weather to play?

  13. @relation weather @attribute outlook {sunny,overcast,rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes /usr/local/weka-3-4-5/data/weather.arff

  14. “First catch your rabbit” (Mrs Beaton’s cookbook): Other tools are possible, but WWW-BootCat *should* be easier to use … First: sign up for Domain, SketchEngine account, Google key; download seeds-en from http://corpus.leeds.ac.uk/internet.html (see coursework spec for URLs) Text mining example: discovering terms in a domain, using WWW-BootCat

  15. Advanced Search option with parameter settings: using SergeSharoff's seed-en http://corpus.leeds.ac.uk/internet/seeds-en list of typical medium-frequency English words as seed-words, Google key set to the Key which I set up beforehand at https://www.google.com/accounts/NewAccount Language set to English Select URLs ticked, so I can cut-and-paste the list of urls to a textfile (TO HAND IN WITH CW) Corpus name set to EnglishUK (in my case), or English?? (change ?? To your Domain) email address set to USERNAME@comp.leeds.ac.uk Query Extension set to site:.uk (in my case), or site:.?? (change ?? To your Domain) other Advanced Options left at default values...??? ... then click on Build a corpus!, follow instructions as they appear, and (after some wait) download the corpus in raw and vertical formats (either direct from URL or wait for email to tell you URL…) First collect your corpus

  16. WWW-Bootcat: log in, Advanced options: upload seed-en, check URLs, site:.??; Build Corpus If it crashes, ?bad HTML in website?, try again Download your corpus, because… 500,000-word quota – room for 2 corpuses (only), so you can only compare 2 at a time in WWWBootCat Or compare on your linux account… /home/www/db32/cw/EnglishUS , EnglishUK Problems?

  17. Aim: to find terms in C1 not in C2? and terms in C2 not in C1? Sort C1, C2 in Vertical format (1 word per line) to give C1termlist, C2termlist: sort C1 > C1termlist; sort C2> C2termlist diff C1termlist C2termlist BUT this shows LOTS of differences many “not significant”: 1 example (hapax legomena) Comparing text corpora

  18. Better: to find “significant” terms in C1 not in C2 sort C1 | uniq -c | sort -n -r > C1termlist Terms with frequencies – most common first Can be compared “OLAP-style” – you can spot high-freq words in one list but not the other ? No need for further processing? Comparing “significant” terms

  19. BootCat (and others, eg Paul Rayson) offer tools to compare frequencies of words – to find words used MUCH MORE in one corpus than another Several different metrics available, eg “mutual information”, “normalised frequency difference”,… Not necessary for DB32 coursework (probably) … BUT I will be impressed if you do use these advanced metrics! Comparing word-frequencies

  20. Knowledge Discovery (Data Mining) tools semi-automate the process of discovering patterns in data. Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)… … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses) Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered Knowledge Discovery: Key points

  21. You should be able to: Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives. Decide which is the most appropriate form of output. Self-test

More Related