Using Automatically Extracted Information in Species Page Retrieval: a use case

Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September 2007Bratislava, Slovakia

Legacy(NL) - Modernist(RDF) bridge • Taxonomic Literature contains facts • Large collections available (BDL) • OCR and extraction tools limit current use • Need rapid search and discovery beyond full text • Need concept normalization 2007

Criticism Knowledge extraction tools are not 100% complete or accurate so is not worth doing. Bunk!!! • If you use Google you are doing probabilistic search already and it is useful • Controlled experiment evidence that it is useful. 2007

Goal: Keys + Google together • Information needed for plant identification • Key-like information • Accurate • Specific • Keyword-based retrieval on semi-structured collections • Keywords as poor content representations • Difficulties in creating keyword queries, esp. for end users • Not able to make use of the document structure 2007

An Example Document Excerpt ……….. Plants, flowering to 2 m. Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m ´ 1.5–3.5 cm, finely appressed-scaly; sheath pale or slightly rust colored, ovate, not inflated, not forming pseudobulb, 6–15 cm wide; blade linear-triangular, leathery, channeled to involute, apex attenuate. Inflorescences: scape, erect, 20–50 cm, 6–12 mm diam.; bracts densely imbricate proximally, often lax distally, erect to spreading, like leaves but gradually smaller; spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, linear, with laxly appressed bracts, 15–40 ´ 10–15 cm, apex acute; branches 5–40 (rarely simple). Floral bracts widely spaced, erect, green or tinged purple, exposing most of rachis at anthesis, ovate, not keeled, 1.2–2 cm, leathery, venation slight, apex acute, glabrous. Flowers 10–200, conspicuous; sepals free, elliptic, not keeled, 1.4–2 cm, thin-leathery, veined, apex obtuse; corolla tubular, somewhat bilaterally symmetric, petals erect, slightly twisted, white, ligulate, to 4 cm; stamens exserted; stigma exserted, conduplicate-spiral. Fruits to 4 cm. n = 25 . ………….. 2007

SDD + TDWG-Lit ?= • SDD: All structured + NL • Literature: Rich, human friendly and semi-structured facts • We want to associate a set of characters and states with links to evidence in the text for an assertion without destroying the text. • Mixture of key and text retrieval 2007

Location of expressions • External Standoff Markup is-a External Document/Object, requires unique text identifier + offset • Internal Standoff markup part-of Literature Document Markup, requires offset • Internal Integrated markup impossible 2007

Why bother? • Full structured coding of natural language taxonomic descriptions is out of our reach • Partial extraction of facts can aid identification. • There is a need to accumulate information over time. • Prior fact patterns can be used to find similar patterns in new texts without human intervention. • Potential for ontology induction 2007

Location of expressions • External Standoff Markup is-a External Document/Object, requires unique text identifier + offset • Allows information merging form multiple sources • Internal Standoff markup part-of Literature Document Markup, requires offset • Internal Integrated markup impossible 2007

Example 1 False match between query and index terms False match: “3 and leaves” User query …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m ´ 1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. 3 leaves 2007

Example 2 Different vocabularies in queries and documents Description of leaf Length in texts User query …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m ´ 1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. Long leaves 2007

long and leaves long leaves Leaf length > 50cm What’s the Problem? • Keyword based retrieval only allows queries and documents to be matched • Based on string occurrence • Not based on semantic meaning • Early example revisited String match 3 and leaves 3 leaves Leaf number is 3 Semantic match 2007

Approach • Identify useful semantic information within full-text documents using Information Extraction techniques • Allow users to search based on semantic meaning via structured semantic information 2007

Text String match Keyword query Identifying Semantic Information Semantic match Semantic information Semantic information Approach 2007

Add facts to the text • <leaves have-cardinality 3 > • <leaves have-length .gt. 50cm> 2007

Morphological Information Extraction System features computationally inexpensive Automatic portable accurate Machine learning Knowledge bases Partial parsing IE techniques 2007

Doc1 Extracted information for Doc1 Training Doc2 Extracted information for Doc2 Information extraction system Rules Doc60 Extracted information for Doc60 Machine Learning 2007

IE System Adaptation Modified in the new domain Rule creation module Pre-processing Automatically learned in the new domain Rules FNA documents Structured information Extraction Templates for useful information Knowledge bases Updated in the new domain Query analysis 2007

Training documents Pre-processing module Learned Rules Manually tagged instances Learning module Knowledge bases IE System Training

Leaf_Shape Leaf_Margin Leaf_Apex Leaf_Base Blade_Dimension ….. ….. User log analysis Information Extraction From FNA Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} Rules ……….. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,. ……………… Original documents Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………….. ………….. Structured information Extraction ….. PartBlade: Leaf blade Blades blade …… Knowledge bases Templates for useful information

Results - IE • Recall = correct/possible • Precision = correct/actual 2007

Retrieval System Design and User Evaluation SEARF: Keyword retrieval User evaluation FNA collection Performance comparison Information extraction SEARFA: Retrieval with keywords + structured semantic information User evaluation 2007

2007

Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task 2007

Results – User Satisfaction 2007

Limitations and Future Work • Generalization of text collections • Other collections in the same domain and other domains • Generalization of IE applications • Document representations • A wider range of attributes • Query formulation and interface design • Online term definitions • Visualized search interface • Retrieval algorithms • More accurate matching 2007

Using Automatically Extracted Information in Species Page Retrieval: a use case