1 / 77

Information Extraction

Information Extraction. Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology. Outline. Information Extraction Introduction Applications Table Reading Citation Extraction Chinese Named Entity Recognition. Introduction. Information Extraction.

Télécharger la présentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology

  2. Outline • Information Extraction • Introduction • Applications • Table Reading • Citation Extraction • Chinese Named Entity Recognition

  3. Introduction

  4. Information Extraction • “extracts pieces of information that are salient to the user's needs”

  5. Message Understanding Conferences (MUC) Evaluations • provide prepared data and task definitions in addition to providing fully automated scoring software to measure machine and human performance. • The databases now include named entities, multilingual named entities, attributes of those entities, facts about relationships between entities, and events in which the entities participated. • The multilingual portion was known as "Multilingual Entitity Task (MET)"

  6. Examples The following fictional news story portrays the levels of detail that systems can extract: Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. La Jolla Genomatics will release its product Geninfo in June 1999. Geninfo is a turnkey system to assist biotechnology researchers in keeping up with the voluminous literature in all aspects of their field. Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Oliver's brother, Ambrose, follows more in his father's footsteps and will be the CFO of L.J.G. headquartered in the Maddox family's hometown of La Jolla, CA.

  7. Entities:

  8. Attributes: Attributes:

  9. Facts:

  10. Events: COMPANY-FORMATION_EVENT: RELEASE-EVENT:

  11. Information Extraction • current indicators of the state of the art: Items of Information Percentile Reliability Entities 90 Attributes 80 Facts 70 Events 60

  12. Technical definition of IE • The process of creating database entries by skimming a text and looking for occurrences of a particular class of object or event and for relationships among those objects and events [Russell, Norvig 2003]

  13. Basic IE tasks • Extract addresses from Web pages • target: street, city, state, and zip code • Extract storms from weather report • target: temperature, wind speed, and precipitation

  14. IE Applications • Competitive intelligence • find instances of corporate mergers and joint ventures. • Intelligence gathering • terrorist activities. • any damage to buildings or the infrastructure, as well as the time and location of the event. • Health care delivery • summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments..

  15. Technology • Method in literature • Regular expressions • Cascaded finite-state transducers • Our approaches • Ontological domain knowledge • Machine Learning • Hybrid method

  16. Regular expression approach example • From the text • “17in SXGA Monitor for only $249.99” • Extract • m m  ComputerMonitors ΛSize(m,Inches(17))ΛPrice(m, $(249.99))Λ Resolution(m, 1280×1024)

  17. [0-9] [0-9]+ .[0-9] [0-9] (.[0-9] [0-9])? $[0-9]+(.[0-9] [0-9])? Any digit from 0 to 9 One or more digits A period followed by two digits A period followed by two digits, or nothing $249.99, $1.23, $100000, … Regular Expressions matches

  18. Weakness • What’s the price ? • “List price $99.00, special sale price $78.00, shipping $3.00.”

  19. Cascaded finite-state transducers approach example • From • “Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.” • Extract • e  JointVentures Λ Product(e, “golf clubs”) Λ Date(e,”Friday”) Λ Entity(e,”Bridgetstone Sports Co”) Λ Entity(e, “a local concern”) ΛEntity(e, “a Japanese trading house”)

  20. Cascaded finite-state transducers • A typical relational extraction systems consists of the following five stages: • Tokenization • Complex word handling • Basic group handling • Complex phrase handling • Structure merging

  21. Tokenization • Word segmentation • 土地公有政策->土地|公有|政策, 土地公|有政策 • Complex word handling • “Bridgestone Sports Co.” • CapitalizedWord+(“Company”|”Co”|”Inc”|”Ltd”) • “Intel Chairman Andy Grove” • CapitalizedWord+(“Grove”|”Forest”|”Village”|…) • 謝深”山”、郝柏”村”

  22. Basic group handling • Noun group, verb group, Preposition, Conjunction 1 NG: Bridgestone Sports Co. 2 VG: said 3 NG: Friday 4 NG: it 5 VG: had set up 6 NG: a joint venture 7 PR: in 8 NG: Taiwan 9 PR: with 10 NG: a local concern 11 CJ: and 12 NG: a Japanese trading house 13 VG: to produce 14 NG: golf clubs 15 VG: to be shipped 16 PR: to 17 NG: Japan

  23. Complex phrase handling • Company+SetUp JointVenture (“with” Company+)? • Structure merging • If the next sentence says something about the same event.

  24. A brief remark • IE works well for a restricted domain • Predetermine the Subjects and how they are mentioned

  25. Applications

  26. Table Reading • Citation Extraction • Chinese NER

  27. Semantic Search on Internet Tabular Information Extraction for Answering Queries CIKM 2000

  28. Table Reading Gives a algorithm to interpret tables of the type shown below where some cells span over multiple rows or columns. An example of interpretation is: (Attribute)=>(Value) (Adult-Price-Single Room-Economic class)=>35,450

  29. Table Reading

  30. HTML Table C-I Table Layout Description Layout Transition Rule Database Table Method Ambiguous Tagging Relations of Cells Layout Recognition Layout Transformation

  31. Method Tagging Layout Identifying Layout Trans.

  32. Airline Schedule Ontology

  33. C: Departure City C: Arrival City C: Departure Information Concept v.s. Descent Concept Concept v.s. Instance of the Concept I: Departure City I: Arrival City Instance v.s. Instance of the same Concept Tagging C: Departure City I: Departure City

  34. Four Relations of Table Cells • Relations of Concept - Instances • Concept - Instance of the Concept • Concept - Descent Concept • Concept - Instance of Descent Concept • Instance - Instance of the same Concept

  35. Layout Recognition C-I Table Layout Descriptions Template Matching Defined by Layout Syntax Grammar Matched Layout Description

  36. Layout Transformation Origin Layout Description Destination Layout Description

  37. Experiments • 23 tables from 23 web pages • 13 2-dimension tables, 10 complex tables • Success is no miss, Any miss results fail

  38. Conclusion & Future Works • Layout Transformation from complex tables to simple tables (1D, 2D). • A general approach • 1. Tagging • 2. Semantic Layout Recognition • 3. Layout Transformation • Ambiguous reduced by checking cell relations

  39. Reference • Huei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng-Lung Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular Information Extraction for Answering Queries, Ninth International Conference on Information and Knowledge Management (CIKM-2000), McLean, VA, November 6-11, 2000. pp. 243-249. (EI) • H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML Texts, In Proc. 18th International Conference on Computational Linguistics, Saabrucken, Germany, July 2000.

  40. A Knowledge-based Approach to Citation Extraction IRI-2005

  41. Introduction • Integration of the bibliographical information of scholarly publications available on the Internet • Accurate reference metadata extraction from heterogeneous reference sources. • We propose a knowledge-based approach to reference metadata extraction • INFOMAP: ontological knowledge representation framework • Automatically extract the reference metadata.

  42. Proposed Approach

  43. Phase 1 Reference Data Collection • Journal Spider (journal agent) • collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. • Citation data source • ISI web of science • DBLP • Citeseer • PubMed

  44. Phase 2 Domain Knowledge

  45. INFOMAP • INFOMAP as ontological knowledge representation framework • extracts important citation concepts from a natural language text. • Feature of INFOMAP • represent and match complicated template structures • hierarchical matching • regular expressions • semantic template matching • frame (non-linear relations) matching • Using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference formats or styles.

  46. Phase 3 Reference Metadata Extraction Table 1. Examples of different journal reference styles

  47. Phase 4 Knowledge-based Reference Metadata Extraction - Online Service

  48. Citation Extraction From Text to BixTex @article{ Author = {W. L. Hsu}, Title = {The coloring and maximum independent set problems on planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.}, Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }} @article{ Author = {W. L. Hsu}, Title = {On the general feasibility test of scheduling lot sizes for several products on one machine,"}, Journal = {Management Science}, Volume = {29}, Number = {}, Pages = {93-105}, Year = {1983 }} @article{ Author = {W. L. Hsu}, Title = {The distance-domination numbers of trees,"}, Journal = {Operations Research Letters}, Volume = {1}, Number = {3}, Pages = {96-100}, Year = {1982 }} W. L. Hsu, "The coloring and maximum independent set problems on planar perfect graphs," J. Assoc. Comput. Machin., (1988), 535-563. W. L. Hsu, "On the general feasibility test of scheduling lot sizes for several products on one machine," Management Science 29, (1983), 93-105. W. L. Hsu, "The distance-domination numbers of trees," Operations Research Letters 1, (3), (1982), 96-100. Figure 3. The system input of knowledge-based RME Figure 5. The system output of BibTex Format

  49. System Input (Plain text) System Output Output BibTex Figure 6. The online service of knowledge-based RME (http://bioinformatics.iis.sinica.edu.tw/CitationAgent/)

  50. Experimental Results and Discussion • Experimental data • We used EndNote to collect Bioinformatics citation data for 2004 from PubMed. • A total of 907 bibliography records were collected from PubMed digital libraries on the Web. • Reference testing data was generated for each of the six reference styles (BIOI, ACM, IEEE, APA, MISQ, and JCB). • Randomly selected 500 records for testing from each of the six reference styles.

More Related