1 / 31

Text Mining

Text Mining. Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be. Centre for Dutch Language and Speech (CNTS). Part of department of linguistics, University of Antwerp Staff

agnes
Télécharger la présentation

Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining Walter Daelemans CNTS Department of Linguistics University of Antwerp walter.daelemans@ua.ac.be

  2. Centre for Dutch Language and Speech (CNTS) • Part of department of linguistics, University of Antwerp • Staff 2 tenured + 10-15 with temporary funding from EU, IWT, FWO, NTU, language industry, BOF, … • Topics • Corpus Linguistics (mainly Dutch) • Child language acquisition / computational psycholinguistics • Language Technology • machine learning of language • shallow parsing • text mining

  3. Information Overload • Language is the most natural and most used knowledge representation formalism • Non-structured or weakly structured information • Text • Databases with text fields • Web-pages, e-mail messages, blogs, chat, … • (Non-structured) information overload • Doubles every three months (Gardner) • Hampers knowledge management and business intelligence • Translation bottleneck

  4. Word meaning Morphological analysis Complex Word Interpretation Word Sense Disambiguation Sentence Meaning Syntactic structure (parsing) Sentence interpretation Discourse Meaning World Knowledge Frames, scenarios, grounding, intentions, … Fremdzugehen External train marriages The box is in the pen I eat a pizza with extra cheese I eat a pizza with a fork I eat a pizza with my daughter The mayors didn’t want the students to strike because they feared violence The mayors didn’t want the students to strike because they preached the revolution Natural Language Understanding?

  5. State of the Art • Robust, efficient, accurate, unrestricted language understanding will not be available for a long time • AI-complete problem • Alternative: • text mining: automatic extraction of reusable knowledge from text, based on linguistic analysis of the text

  6. Approach • Text analysis tools (shallow instead of deep understanding) • Robust / Efficient / Accurate • Text Mining applications • Question Answering • Summarization • Ontology extraction • Information extraction • Text categorization For embedding in • End user applications related to knowledge search / management / discovery / communication

  7. Examples • Application Areas: • Data mining (KDD) from unstructured and semi-structured data • (Corporate) Knowledge Management • “Intelligence” • Example Applications: • Email routing and filtering (spam filtering) • Finding protein interactions in biomedical text • Brokering • Matching on-line resumes and vacancies • Buying and selling property • …

  8. Text Data Mining (Discovery) • Find relevant information • Information extraction • Text categorization • Analyze the text • Text mining • Discovery new information • Integrate different sources • Data mining

  9. Magnesium deficiency implicated in migraine (?) Text analysis output Don Swanson 1981: medical hypothesis generation • stress is associated with migraines • stress can lead to loss of magnesium • calcium channel blockers prevent some migraines • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD) is implicated in some migraines • high levels of magnesium inhibit SCD • migraine patients have high platelet aggregability • magnesium can suppress platelet aggregability • …

  10. CNTS text analysis tools • MBSP • Flexible and adaptable • Dutch and English • State of the Art accuracy and efficiency • ~ 90% sentences / ~ 1000 words/sec • Configurable combination of linguistic modules • Modules developed using Machine Learning • TiMBL • Adaptation through re-training and semi-supervised learning • Client-server set-up

  11. Text Tokenisation POS tagging NP chunking NER Relation finding CNTS shallow understanding

  12. Text Tokenisation POS tagging NP chunking NER Relation finding • Insulatard is an isophane insulin suspension (NPH).

  13. Text Tokenisation POS tagging NP chunking NER Relation finding • Insulatard is an isophane insulin suspension (NPH). • Insulatard • is • an • isophane • insulin • suspension • ( • NPH • ) • .

  14. Text Tokenisation POS tagging NP chunking NER Relation finding • Insulatard is an isophane insulin suspension (NPH). Insulatard NNP is VBZ an DT isophane JJ insulin NN suspension NN ( Punc NPH NNP ) Punc . Punc

  15. Tekst Tokenization POS tagging NP chunking NER Relation finding • Insulatard is an isophane insulin suspension (NPH). [NP Insulatard] [VP is] [NP an isophane insulin suspension( NPH )]

  16. Text Tokenisation POS tagging NP chunking NER Relation finding • Insulatard is an isophane insulin suspension (NPH). Insulatard = Medicine name NPH = Hormone

  17. Text Tokenization POS tagging NP chunking NER Relation finding • Insulatard is an isophane insuline suspension (NPH). [SBJ Insulatard] is [PREDC an isophane insuline suspension ( NPH )]

  18. Application: Question Answering • Give answer to question (document retrieval: find documents relevant to query) • Who invented the telephone? • Alexander Graham Bell • When was the telephone invented? • 1876

  19. QA System: Shapaqa • Parse question When was the telephone invented? • Which slots are given? • Verb invented • Object telephone • Which slots are asked? • Temporal phrase linked to verb • Document retrieval on internet with given slot keywords • Parsing of sentences with all given slots • Count most frequent entry found in asked slot (temporal phrase)

  20. Shapaqa: example • When was the telephone invented? • Google: inventedAND“the telephone” • produces 835 pages • 53 parsed sentences with both input slots and with a temporal phrase is through his interest in Deafness and fascination with acoustics that the telephone was inventedin 1876 , with the intent of helping Deaf and hard of hearing The telephone was invented by Alexander Graham Bell in 1876 When Alexander Graham Bell inventedthe telephonein 1876 , he hoped that these same electrical signals could

  21. Shapaqa: frequency ranking • So when was the phone invented? • Internet answer is noisy, but robust • 17: 1876 • 3: 1874 • 2: ago • 2: later • 1: Bell • … • System was developed quickly • Precision 76% (Google 31%) • International competition (TREC): MRR 0.45

  22. Application: Biomedical text mining (EU project BioMinT) IR IE Linguistic / Semantic Features Templates Factoids Text Analysis Medline abstracts

  23. (Partial) Factoids The mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens. CELL-LINE The mouse lymphoma assay MLA O S the Tk gene DNA part utilizing is widely used to identify O chemical mutagens

  24. <!DOCTYPE MBSP SYSTEM 'mbsp.dtd'> <MBSP> <S cnt="s1"> <NP rel="SBJ" of="s1_1"> <W pos="DT">The</W> <W pos="NN" sem="cell_line">mouse</W> <W pos="NN" sem="cell_line">lymphoma</W> <W pos="NN">assay</W> </NP> <W pos="openparen">(</W> <NP> <W pos="NN" sem="cell_line">MLA</W> </NP> <W pos="closeparen">)</W> <VP id="s1_1"> <W pos="VBG">utilizing</W> </VP> <NP rel="OBJ" of="s1_1"> <W pos="DT">the</W> <W pos="NN" sem="DNA_part">Tk</W> <W pos="NN" sem="DNA_part">gene</W> </NP> <VP id="s1_2"> <W pos="VBZ">is</W> <W pos="RB">widely</W> <W pos="VBN">used</W> </VP> <VP id="s1_3"> <W pos="TO">to</W> <W pos="VB">identify</W> </VP> </VP> <NP rel="OBJ" of="s1_3"> <W pos="JJ">chemical</W> <W pos="NNS">mutagens</W> </NP> <W pos="period">.</W> </S> </MBSP>

  25. Extracted IEX Templates from shallow parser output NP(<X protein>) contain NP(Y "domain") EVENT: contain PROTEIN: <protein> DOMAIN: “domainf” NP(<X protein>) be associated with NP(Y “disease”) EVENT: associated_with PROTEIN: <protein> DISEASE: “head” NP(<X protein>) regulate NP(Y) EVENT: regulate PROTEIN: <protein> Y: Jee-Hyub Kim (Geneva) (): to be extracted, <>: semantic constraint, "": lexical constraint

  26. Application: Ontology Extraction • Clustering of head nouns of Subject-Verb and Verb-Object relations • Combine with pattern matching and heuristics • Case study: Medline 4 million words hepatitis, SwissProt corpus • Results: • Better clusters with shallow parsing • Useful in knowledge management, thesaurus development, … Ontobasis (IWT)

  27. Example (SwissProt corpus) gene| show |significant homology, amino_acid_sequence | have/indicate/lack/reveal/show | homology protein| show |homology, immunoreactivity, reactivity, sequence similarity protein| inhibit |catalytic activity, apoptosis, protein synthesis... protein| exhibit |significant homology protein| bind |copper, ubiquitin protein| correspond |isoelectric point induction| requires |protein synthesis Edman degradation| of | intact protein regulatory subunit|of |cAMP-dependent protein kinase …

  28. liver related_to related_to hepatitis cirrhosis sim sim sim infection disease HBV prevented by antibody immunization vaccination produced by produced by culture antisera

  29. Further development • Semantic roles • Faster adaptation to new domains • Domain semantics (NER / concept tagging) • Active Learning / semi-supervised learning • More analytic power • Negation, modality, quantification • Limited event and scenario recognition

  30. Conclusions • Text Mining tasks benefit from text analysis • Understanding can be formulated as a flexible heterarchy of classifiers • These classifiers can be trained / adapted on annotated corpora and can eventually approximate deep understanding

  31. Questions? • Walter Daelemans • A1.10 Campus Drie Eiken • (September: Stadscampus) • Walter.daelemans@ua.ac.be

More Related