1 / 32

Causality Knowledge Extraction based on A Single Sentence from Thai Textual Data

Causality Knowledge Extraction based on A Single Sentence from Thai Textual Data. Chaveevan Pechsiri Dhurakij Pundij University Assoc. Prof. Dr. Asanee Kawtrakul NAiST Laboratory, Kasetsart University SNLP 2007 14 December, 2007. Outline. Motivation Introduction Related work

Télécharger la présentation

Causality Knowledge Extraction based on A Single Sentence from Thai Textual Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Causality Knowledge Extraction based on A Single Sentence from Thai Textual Data Chaveevan Pechsiri Dhurakij Pundij University Assoc. Prof. Dr. Asanee Kawtrakul NAiST Laboratory, Kasetsart University SNLP 2007 14 December, 2007

  2. Outline • Motivation • Introduction • Related work • Crucial Problems • System Overview • Evaluation • Conclusion Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  3. Motivation • Most of Knowledge is spread throughout the text. • Instead of reading huge amount of report, we need the automatic system of Knowledge Extraction from text to gain the causality knowledge for diagnosis problems , decision support or question answering systems. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  4. Introduction • What is knowledge? • “Knowledge is the awareness and understanding of facts, truths or information gained in the form of experience or learning. (Wikipedia encyclopedia, 2006) • “The information, understanding, and skills that you gain through education or experience” (Oxford advanced learner’s Dictionary, 2000) • Knowledge types (Jana Trnková, Wolfgang Theilmann,2004) • Orientation knowledge (“know what a topic is about”) • Action knowledge (“know how”) • Explanation knowledge (“know why something is the way it is”) • Reference knowledge (“know where to find additional information”). Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  5. Introduction • What is causality? • refers to the set of all particular "causal" or "cause-and-effect" relations (Wikipedia Encyclopedia :http://en.wikipedia.org/wiki/Main_Page ) • The relationship between something that happens and the reason for it happening (Oxford advanced learner’s Dictionary, 2000) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  6. Causality Knowledge • Inter-causal EDU (20%) • If aphids suck sap from plant, leaves will be yellow and flowers start to drop out. • Plant leaves shrink because the aphids destroy the plant. • Intra-causal EDU (7%) • Earthquake generates Zunami. (NP1 V NP2) • Bird Flu is caused by virus ‘H5N1’.(NP1 cue NP2) • Leaves have black spots from bacteria. (NP1 V NP2 Prep NP3) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  7. Related Work Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  8. Causal Verb (linking verb) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  9. Crucial Problems • How to identify causality with in one sentence • Implicit noun phrase : as zero anaphora Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  10. How to identify causality • By using causal verb (linking verb) • List of causal verbs from Girju, 2002(1%) ราในถั่วผลิตอัลฟาทอกซิน/Fungus in peanut produces alpha toxin. • Cue phase set (Chang and Choi, 2004)(2%) ไรรัสH5N1เป็นสาเหตุให้เกิดโรคไข้หวัดนก Bird Flu is caused by virus ‘H5N1’. • General verb+information+preposition phrase • Verb + preposition phrase 4% Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  11. Causal Verb • General verb + information +preposition phrase • General verb = {เป็น/be, มี/have, ได้รับ/get, …} • Information= {แผล/scar, จุด/spot, รอย/mark, ขีด/scratch, ตำหนิ/defect, โรค/disease…..} • Preposition ={from, with} “NP1 Verb [NP2] Prep NP3” For example: เป็น/be+ โรค/disease = get disease A kid get disease from virus ‘H5N1’. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  12. How to identify causality “NP1 Verb [NP2] Prep NP3” Ex. 1.“พืช/Plant เป็น isโรค/disease จาก / from เชื้อรา /fungi” 2. “โรค/ Disease เกิด/ occurs จาก/ from ไวรัส/ virus” 3. “เด็ก/ Kid ตาย/ dies ด้วย/ with โรคไข้หวัดนก/ the Bird flu disease” 4. “เด็ก/ Kid ได้รับ/ gets เชื้อ/ disease จาก/ from การสัมผัสไก่ติดเชื้อไข้หวัดนก/ touching the infected chicken” Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  13. Problems of using causal verb • Verb ambiguity • Causality: • “ใบพืช/Plant leaf มี/hasจุดสีน้ำตาล /brown sports จาก/fromเชื้อรา/fungi” • “คนไข้/The patientตาย/diesด้วย/withโรคมะเร็ง/cancer” • Non causality: • “ใบพืช/Plant leaf มี/hasจุดสีน้ำตาล /brown sports จาก/fromโคนใบ/the leaf base” • “คนไข้/patientตาย/diesด้วย/withความสงสัย/suspicion” Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  14. Zero Anaphora Problem • For example • “โรคไข้หวัดนก /The Bird flu diseaseเป็น /is โรคที่สำคัญโรคหนื่ง /an important disease . เกิด /occurจาก / from ไวรัส H5N1/ H5N1 virus. ” where  is zero anaphora = Bird flu disease. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  15. System Overview Corpus Preparation WordNet, Lexitron, Plant encyclopedia Text Causality learning Learnt model Causality extraction Knowledge base Cause-effect relation Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  16. Corpus Preparation • Word segmentation (Sudprasert and Kawtrakul, 2003 ) • Name entity determination(Chanlekha and Kawtrakul, 2004 ) • EDU segmentation(Charoensuk and et al.,2005) • EDU (Elementary Discause Unit) is the minimal building blocks of a discourse tree. Mann and Thompson (1988, p. 244) ;simple sentence, clause Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  17. Corpus Preparation • Mamually feature annotation (reference to WordNet and Plant encyclopedia, and Lexitron dictionary) for learning <EDU type=causality> <NP1 concept=‘plant organ#1’>ใบพืช</NP1> <Verb =‘have’> มี</Verb> <NP2 concept=‘symptom#1’> จุดสีน้ำตาล</NP2> <Preposition =‘from’>จาก</Preposition> <NP3 concept=‘fungus#1’> เชื้อรา</NP3> </EDU> Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  18. Causality Learning • ID3 (Mitchell T.M., 1997) • SVM(Cristianini and Shawe-Taylor, 2000) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  19. ID3 ID3 uses the statistical property called information gain as shown in the following with the entropy measurement to measure the ability of a given attribute (A; e.g. NP1, Verb, NP2, Preposition, NP3) in separating the collected examples (S) according to their target classification. where the entropy is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (Charniak E., 1993), c is the different values of the target attribute, and pi is the proportion of S belonging to class i. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  20. ID3 NP3 pathogen food poisoning contraction prep prep prep from with from verb verb verb be have infect Causality Causality Causality Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  21. ID3 -Rule mining by using Wekatool Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  22. ID3 • Rule Generalization &Verifying • There are some rules having the same general concept which can be combined into one rule as in the following example: R1: IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep= จาก/from>^ <NP3= fungi > then causality R2: IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep= จาก/from>^ <NP3= bacteria> then causality R3: IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep= จาก/from>^ <NP3=pathogen > then causality Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  23. ID3 • Verifying rules • The testing corpus from agricultural and health news domains of 2000 EDUs contain 102 EDUs of the specified sentence pattern, which only 87 EDUs are causality within 20 causal verb rules. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  24. SVM The following linear function, f(x), of the input x = (x1…xn) assigned to the positive class if f(x) 0, and otherwise to the negative class if f(x) <0 (where xi is each of five features as NP1, Verb, NP2, Preposition, and NP3 of the specified sentence pattern from the annotated corpus ) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  25. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  26. Causality Extraction • Causality identification • Use causal verb rules from ID3 • Use weight vectors with the bias from SVM • Solving zero anaphora • Using the heuristic rule (Ching-Long Yeh and Chris Mellish, 1997) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  27. Evaluation • 2000 EDUs from the agricultural and health news for training. And 2000 EDUs for testing base on precision and recall for training • The result is then evaluated by experts with max win voting. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  28. Evaluation Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  29. Discussion • The reason that the precision of the extraction through using SVM is higher than ID3 is that ID3 is based on feature occurrences which will not effect to SVM • the 73% of recall can be increased if we use a larger corpus Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  30. Conclusion our model will be very beneficial for causal question answering and causal generalization for knowledge discovery. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  31. Future work • Knowledge generalization Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

  32. Thank you Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University

More Related