1 / 27

Using Automatically Extracted Information in Species Page Retrieval: a use case

Using Automatically Extracted Information in Species Page Retrieval: a use case. Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September 2007 Bratislava, Slovakia. Legacy(NL) - Modernist(RDF) bridge. Taxonomic Literature contains facts Large collections available (BDL)

aqua
Télécharger la présentation

Using Automatically Extracted Information in Species Page Retrieval: a use case

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September 2007Bratislava, Slovakia

  2. Legacy(NL) - Modernist(RDF) bridge • Taxonomic Literature contains facts • Large collections available (BDL) • OCR and extraction tools limit current use • Need rapid search and discovery beyond full text • Need concept normalization 2007

  3. Criticism Knowledge extraction tools are not 100% complete or accurate so is not worth doing. Bunk!!! • If you use Google you are doing probabilistic search already and it is useful • Controlled experiment evidence that it is useful. 2007

  4. Goal: Keys + Google together • Information needed for plant identification • Key-like information • Accurate • Specific • Keyword-based retrieval on semi-structured collections • Keywords as poor content representations • Difficulties in creating keyword queries, esp. for end users • Not able to make use of the document structure 2007

  5. An Example Document Excerpt ……….. Plants, flowering to 2 m. Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m ´ 1.5–3.5 cm, finely appressed-scaly; sheath pale or slightly rust colored, ovate, not inflated, not forming pseudobulb, 6–15 cm wide; blade linear-triangular, leathery, channeled to involute, apex attenuate. Inflorescences: scape, erect, 20–50 cm, 6–12 mm diam.; bracts densely imbricate proximally, often lax distally, erect to spreading, like leaves but gradually smaller; spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, linear, with laxly appressed bracts, 15–40 ´ 10–15 cm, apex acute; branches 5–40 (rarely simple). Floral bracts widely spaced, erect, green or tinged purple, exposing most of rachis at anthesis, ovate, not keeled, 1.2–2 cm, leathery, venation slight, apex acute, glabrous. Flowers 10–200, conspicuous; sepals free, elliptic, not keeled, 1.4–2 cm, thin-leathery, veined, apex obtuse; corolla tubular, somewhat bilaterally symmetric, petals erect, slightly twisted, white, ligulate, to 4 cm; stamens exserted; stigma exserted, conduplicate-spiral. Fruits to 4 cm. n = 25 . ………….. 2007

  6. SDD + TDWG-Lit ?= • SDD: All structured + NL • Literature: Rich, human friendly and semi-structured facts • We want to associate a set of characters and states with links to evidence in the text for an assertion without destroying the text. • Mixture of key and text retrieval 2007

  7. Location of expressions • External Standoff Markup is-a External Document/Object, requires unique text identifier + offset • Internal Standoff markup part-of Literature Document Markup, requires offset • Internal Integrated markup impossible 2007

  8. Why bother? • Full structured coding of natural language taxonomic descriptions is out of our reach • Partial extraction of facts can aid identification. • There is a need to accumulate information over time. • Prior fact patterns can be used to find similar patterns in new texts without human intervention. • Potential for ontology induction 2007

  9. Location of expressions • External Standoff Markup is-a External Document/Object, requires unique text identifier + offset • Allows information merging form multiple sources • Internal Standoff markup part-of Literature Document Markup, requires offset • Internal Integrated markup impossible 2007

  10. Example 1 False match between query and index terms False match: “3 and leaves” User query …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m ´ 1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. 3 leaves 2007

  11. Example 2 Different vocabularies in queries and documents Description of leaf Length in texts User query …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m ´ 1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. Long leaves 2007

  12. long and leaves long leaves Leaf length > 50cm What’s the Problem? • Keyword based retrieval only allows queries and documents to be matched • Based on string occurrence • Not based on semantic meaning • Early example revisited String match 3 and leaves 3 leaves Leaf number is 3 Semantic match 2007

  13. Approach • Identify useful semantic information within full-text documents using Information Extraction techniques • Allow users to search based on semantic meaning via structured semantic information 2007

  14. Text String match Keyword query Identifying Semantic Information Semantic match Semantic information Semantic information Approach 2007

  15. Add facts to the text • <leaves have-cardinality 3 > • <leaves have-length .gt. 50cm> 2007

  16. Morphological Information Extraction System features computationally inexpensive Automatic portable accurate Machine learning Knowledge bases Partial parsing IE techniques 2007

  17. Doc1 Extracted information for Doc1 Training Doc2 Extracted information for Doc2 Information extraction system Rules Doc60 Extracted information for Doc60 Machine Learning 2007

  18. IE System Adaptation Modified in the new domain Rule creation module Pre-processing Automatically learned in the new domain Rules FNA documents Structured information Extraction Templates for useful information Knowledge bases Updated in the new domain Query analysis 2007

  19. Training documents Pre-processing module Learned Rules Manually tagged instances Learning module Knowledge bases IE System Training

  20. Leaf_Shape Leaf_Margin Leaf_Apex    Leaf_Base Blade_Dimension ….. …..  User log analysis Information Extraction From FNA Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} Rules ……….. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,. ……………… Original documents Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………….. ………….. Structured information Extraction ….. PartBlade: Leaf blade Blades blade …… Knowledge bases Templates for useful information

  21. Results - IE • Recall = correct/possible • Precision = correct/actual 2007

  22. Retrieval System Design and User Evaluation SEARF: Keyword retrieval User evaluation FNA collection Performance comparison Information extraction SEARFA: Retrieval with keywords + structured semantic information User evaluation 2007

  23. 2007

  24. 2007

  25. Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task 2007

  26. Results – User Satisfaction 2007

  27. Limitations and Future Work • Generalization of text collections • Other collections in the same domain and other domains • Generalization of IE applications • Document representations • A wider range of attributes • Query formulation and interface design • Online term definitions • Visualized search interface • Retrieval algorithms • More accurate matching 2007

More Related