Text Mining Overview

Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001

WUT DMG NOV 2001 Topics • Natural Language Processing • Text Mining vs. Data Mining • The toolbox • Language processing methods • Single document processing • Document corpora processing • Document categorization – a closer look • Applications • Classic • Profiled document delivery • Related areas • Web Content Mining & Web Farming

WUT DMG NOV 2001 Natural language processing (NLP) anything that deals with text content Natural language understanding (NLU) semantics and logic Natural Language Processing • Natural language – test for Artificial Intelligence • Alan Turing • NLP and NLU • Linguistics – exploring mysteries of a language • William Jones • Comparative linguistics - Jakob Grimm, Rasmus Rask • Noam Chomsky • I-Language and E-Language • poverty of stimulus • Statistical approaches – Markov and Shannon

WUT DMG NOV 2001 Information explosion • Increasing popularity of the Internet as a publishing medium • Electronic media’s minimal duplication costs Primitive information retrieval and data management tools

WUT DMG NOV 2001 Data Mining Data Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. – Piatetsky-Shapiro • Association rule discovery • Sequential pattern discovery • Categorization • Clustering • Statistics (mostly regression) • Visualization

WUT DMG NOV 2001 Data Mining area Knowledge pyramid Wisdom Semantic level Knowledge Information Data Signals Resources occupied

WUT DMG NOV 2001 Text Mining – a definition Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories. Text Mining = Data Mining (applied to text data) + basic linguistics

WUT DMG NOV 2001 Language tools Single document tools Multiple document tools Text Mining tools • Linguistic analysis • Thesauri, dictionaries, grammar analysers etc. • Machine translation • Automatic feature extraction • Automatic summarization • Document categorization • Document clustering • Information retrieval • Visualization methods

WUT DMG NOV 2001 Language analysis • Syntactic analysers construction • Grammatical sentence decomposition • Part-of-speech tagging • Word sense disambiguation This is not that simple – consider for example This is a delicious butter - noun You should butter your toast - verb Rule based systems or self-learning classification systems (using VMM and HMM)

WUT DMG NOV 2001 Cell phone Fax machine Electronic mail Telephone AD BT RT Construction can be manual (but this is a laborious process) or automatic. Telecommunications Data transmission network Post and telecom The U.S.S Nashvillearrived inColon harbour with 42 marines With the warshipinColon harbour, the Colombian troops withdrew Thesaurus construction • Thesaurus (semantic network) stores information about relationships between terms • Ascriptor-Descriptor relations • „Broader term” – „Narrower term” relations • „Related term” relations

WUT DMG NOV 2001 Source:Polish Target: English Word level W łóżku jest szybka In bed is window-pane Syntactic level W łóżku jest szybka She is a window-pane in bed Semantic level W łóżku jest szybka She is quick in bed Knowledge representation W łóżku jest szybka She is quick in bed Formal knowledge representation language Machine translation • Different vocabularies • Different grammars and flexion rules • Even different character sets Problems

Fully automatic approach WUT DMG NOV 2001 Based on learning word usage patterns from large corpora of translated documents (bitext) • Problems • Still quite few bitexts exist • Sentences must be aligned prior to learning • Keyword matching • Sentence length based alignment • Parameterisation is necessary Książka okazała sięadjective, The book turned out to be adjective

WUT DMG NOV 2001 Data bases Databases Knowledge discovery in databases Micro$oft MineIT Microsoft Knowledge discovery in databases Knowledge discovery in large databases Knowledge discovery in big databases Feature extraction Not all words are equally important • Technical multiword terminology • Abbreviations • Relations • Names • Numbers • Discovering important terms • Finding lexical affinities • Gap variance measurement • Dictionary-based methods • Grammar based heuristics

WUT DMG NOV 2001 Document summarization Abstracts Indicative summaries Summaries Extracts Informative summaries • Summary creation methods: • statistical analysis of sentence and word frequency + dictionary analysis (i.e. „abstract”, „conclusion” words etc.) • text representation methods – grammatical analysis of sentences • document structure analysis (question-answer patterns, formatting, vocabulary shifts etc.)

WUT DMG NOV 2001 Repository Class 2 Class 1 Unknown document Class fingerprints categorization Document categorization & clustering Clustering – dividing set of documents into groups Categorization – grouping based on predefined category scheme Typical categorization scenario Step 1 : Create training hierarchy Step 2 : Perform training Step 3 : Actual classification

WUT DMG NOV 2001 Documents Categorization/clustering system Representation conversion Classic DM algorithm Clustering – k-means, agglomerative,... Categorization – kNN, DT, Bayes,... Representation processingDeriving metrics

WUT DMG NOV 2001 Information retrieval • Two types of search methods • exact match –in most cases uses some simple Boolean query specification language • fuzzy – uses statistical methods to estimate relevance of the document Modern IR tools seem to be very effective... 1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl 2000 data - 40-50% of the Web indexed at all

WUT DMG NOV 2001 IR – exact match Most popular method – inverted files • Very fast • Boolean queries very easy to process • Very simple

WUT DMG NOV 2001 Initial query Repository Output Selection Output IR It’s possible to perform it iteratively – relevance feedback IR – fuzzy search Documents are represented as vectors over word (feature) space Query can be a set of keywords, a document, or even a set of documents – also represented as a vector

WUT DMG NOV 2001 Island represents several documents sharing similar subject, and separated from others - hence creating a group of interest Water represents assorted documents, creating semantic noise Peak represents many strongly related documents Document visualization

WUT DMG NOV 2001 Document visualization

Document categorization A closer look

WUT DMG NOV 2001 DB DB – document database dr – relevant documents ds ds – documents labelled as relevant dr Measuring quality Binary categorization scenario is analogous to document retrieval

WUT DMG NOV 2001 Metrics

WUT DMG NOV 2001 Multiple class scenario M={M1, M2,...,Ml} Mk Micro-averaging Macro-averaging PR={PR1, PR2, ..., PRl}

WUT DMG NOV 2001 Categorization example

WUT DMG NOV 2001 Document representations • unigram representations (bag-of-words) • binary • multivariate • n-gram representations • -gram representation • positional representation

WUT DMG NOV 2001 Bigram example Twas brillig, and the slithy toves Did gyre and gimble in the wabe

WUT DMG NOV 2001 unigram said has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut prisoner no bigram Consider your white queen shook his head and rang through my punishments. She ought to me and alice said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and without the room in a thing that a king and butter. Probabilistic interpretation • Operations: • R(D) – creating representation R from document D • G(R) – generating document D based on representation R

Positional representation WUT DMG NOV 2001

Creating positional representation WUT DMG NOV 2001 2r f(k)=2 (before norm.) k Word occurences

Examples WUT DMG NOV 2001

WUT DMG NOV 2001 Stopwords? Processing representations Zipf’s law There is no information about penguins in this document information penguins document

Expanding and trimming WUT DMG NOV 2001 • Expanding • Trimming • Scaling functions • Attribute selection • Remapping attribute space

Representation processing WUT DMG NOV 2001 Scaling TF/IDF term frequency tfi, document frequency dfi N – all documents in system Attribute present in one document Attribute present in all documents Expanding Laplace Lidstone

Attribute selection WUT DMG NOV 2001 Example – Information Gain P(wi) – probability of encountering attribute wiin a randomly selected document P(kj) – probability, that randomly selected document belongs to class kj P(kj|wi) – probability, that document selected from these containing wi belongs to class kj Statistical tests can be also applied to check if a feature – class correlation exists

Attribute space remapping WUT DMG NOV 2001 Attribute space remapping Attribute clustering Representation matrix processing (example - SVD) Semantic clustering Attribute – class clustering Clustering according to density function similarity

WUT DMG NOV 2001 Applications • Classic • Mail analysis and mail routing • Event tracking • Internet related • Web Content Mining and Web Farming • Focused crawling and assisted browsing

Thank you

Text Mining Overview

Text Mining Overview

Presentation Transcript

Text Mining Tools

Text Mining Concepts

Text Mining

Text mining- text analytics- data mining

SQL Text Mining

Text Mining

Overview of Text Data Mining

Contextual Text Mining

Overview of Text Mining Expertise @ SCD

Biomedical text mining

Text Mining

Text Mining

Text Mining

Comparative Text Mining

Text Mining