1 / 114

Information Extraction from Scientific Texts

Information Extraction from Scientific Texts. Junichi Tsujii Graduate School of Science University of Tokyo Japan. Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with

andrew
Télécharger la présentation

Information Extraction from Scientific Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information ExtractionfromScientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

  2. Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with the other sources like data bases, numerical data, etc. Natural Language Processing--IE

  3. Retrieval Module Corpus • Request enhancement • Spawn request • Classify documents Information Extraction Module • Identify & classify terms • Identify events Interface Module Database • GUI • HTML conversion • System integration Ontology Markup language Data model Raw(OCR) Text Structure Annotated Background Knowledge Document Named-Entity Event Overview of GENIA System MEDLINE Corpus Module • Markup generation / compilation • Annotated corpus construction • User • IR Request • Abstract • Full Paper Security Database Module Concept Module • DB design / access / management • DB construction • BK design / construction / compilation

  4. Plan • What is IE ? • General Framework of NLP • Basic IE techniques • IE in Biology Automatic Term Recognition (S. Ananiadou)

  5. What is IE ?

  6. Application Tasks of NLP (1)Information Retrieval/Detection To search and retrieve documents in response to queries for information (2)Passage Retrieval To search and retrieve part of documents in response to queries for information (3)Information Extraction To extract information that fits pre-defined database schemas or templates, specifying the output formats (4) Question/Answering Tasks To answer general questions by using texts as knowledge base: Fact retrieval, combination of IR and IE (5)Text Understanding To understand texts as people do: Artificial Intelligence

  7. Ranges of Queries (1)Information Retrieval/Detection (2)Passage Retrieval Pre-Defined: Fixed aspects of information carried in texts (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding

  8. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  9. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  10. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  11. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  12. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  13. production of 20, 000 iron and metal wood clubs [company] [set up] [Joint-Venture] with [company] FASTUS Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names set up new Twaiwan dallors 2.Basic Phrases: Simple noun groups, verb groups and particles a Japanese trading house had set up 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

  14. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

  15. Information Extraction ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Name of the Venture: Yaxing Benz Products: buses and bus chassis Location: Yangzhou,China Companies involved: (1)Name: X? Country: German (2)Name: Y? Country: China

  16. Information Extraction A German vehicle-firm executive was stabbed to death …. ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Crime-Type: Murder Type: Stabbing The killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general manager Location: Nanjing, China Different template for crimes

  17. User User System System System Interpretation of Texts (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding

  18. Characterization of Texts Queries IR System Collection of Texts

  19. Knowledge Interpretation Characterization of Texts Queries IR System Collection of Texts

  20. Knowledge Interpretation Characterization of Texts Queries Passage IR System Collection of Texts

  21. Knowledge Interpretation Characterization of Texts Queries Structures of Sentences NLP Templates Passage IR System IE System Collection of Texts Texts

  22. Knowledge Interpretation IE System Templates Texts

  23. Knowledge General Framework of NLP/NLU IE as compromise NLP Interpretation IE System Templates Texts Predefined

  24. Rather clear A bit vague Rather clear A bit vague Very vague Performance Evaluation (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding

  25. Query N: Correct Documents M:Retrieved Documents C: Correct Documents that are actually retrieved N M C C Precision: Recall: P M N C R 2P・R F-Value: P+R Collection of Documents

  26. Query N: Correct Templates M:Retrieved Templates C: Correct Templates that are actually retrieved N M C C Precision: Recall: P M N C R 2P・R F-Value: P+R Collection of Documents More complicated due to partially filled templates

  27. General Framework of NLP

  28. General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

  29. General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu Syntactic Analysis Semantic Analysis Context processing Interpretation

  30. General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation

  31. Pred: RUN Agent:John General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation

  32. Pred: RUN Agent:John General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation John is a student. He runs.

  33. General Framework of NLP Tokenization Morphological and Lexical Processing Part of Speech Tagging Inflection/Derivation Compounding Syntactic Analysis Term recognition (Ananiadou) Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999

  34. Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

  35. Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Open class words Terms Term recognition Named Entities Company names Locations Numerical expressions Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

  36. Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Syntactic Analysis Semantic Analysis Context processing Interpretation

  37. Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

  38. Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation

  39. Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in English are ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation

  40. Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Structural Ambiguities Predicate-argument Ambiguities Semantic Analysis Context processing Interpretation

  41. Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a nice car. Semantic Ambiguities(2) Every man loves a woman. Co-reference Ambiguities Structural Ambiguities (1)Attachment Ambiguities John bought a carwith large seats. John bought a car with $3000. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith (2) Scope Ambiguities young women and men in the room (3)Analytical Ambiguities Visiting relatives can be boring.

  42. Combinatorial Explosion Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Structural Ambiguities Predicate-argument Ambiguities Semantic Analysis Context processing Interpretation

  43. Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning

  44. Framework of IE IE as compromise NLP

  45. Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

  46. Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

  47. Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems

  48. 95 % FSA rules Statistic taggers Part of Speech Tagger Local Context Statistical Bias F-Value 90 Domain Dependent General Framework of NLP Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word> Machine Learning: HMM, Decision Trees Rules + Machine Learning Context processing Interpretation

  49. FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Anaysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

  50. FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Anaysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

More Related