1 / 30

Taxonomic Database Working Group 2005

Machine Learning for Extracting Darwin Core Data from Museum Labels. Taxonomic Database Working Group 2005. P. Bryan Heidorn 1 , Wensheng Wu 2 & Hong Zhang 1 , Reed Beaman 3

Télécharger la présentation

Taxonomic Database Working Group 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Machine Learning for Extracting Darwin Core Data from Museum Labels Taxonomic Database Working Group 2005 P. Bryan Heidorn1, WenshengWu2& Hong Zhang1, ReedBeaman3 1Graduate School of Library and Information Science, 2Computer Science, University of Illinois 3Yale, Peabody Museum

  2. Ontology of a Label • Determination? • Synonymy • Collector • Date • Location • Habitat • ….

  3. Ontology of a Label • bc - barcode • bt - barcode text • cm - common/colloquial name • cn - collection number • co - collector • cd - collection date • fm - family name • ft - footer info

  4. Ontology of a Label • gn - genus name • hd - header info • in - infra name • ina - infra name author • lc - location • pd - plant description • sa - scientific name author • sp - species name

  5. BAKIUM <bt>Yale University Herbarium</bt> <bc>YU.000081</bc> <bc>YU.000081</bc> <hd>Herbarium of Yale University Plants of San Luis, Peten, Guatemala</hd> No: 301 Family:<fm>Boraginaceae</fm> Scientific Name: <gn>Heliotropum</gn> <sp>angiospermum^</sp> Mopan Mayan Name: <cm>U p'ot kutz</cm> Colloquial Spanish Name: <cm>mOCO de chompipe</cm> Location: <lc>in pueblo (village) along path</lc> <cd>2 9 May 1976</cd> Comments:<pd>herbaceous plant reaching 60 cm. small yellow flowers</pd> (det.L.Brown) Collected by <co>Pierre Ventur</co>, <ft>Yale Department of Anthropology</ft>

  6. <bt>Yale University Herbarium<bt> 1 <bc>YU.000122</bc> <hd>Herbarium of Yale University</hd> No: Family: <fm>Umbelliferae</fm> Scientific Name: <gn>Zizia</gn> <sp>aurea</sp> <sa>(L.] Koch.</sa> Common Locality: Habitat: Name: <cm>Golden Alexanders</cm> <lc>New Hampshire Keene/Swanzey Yale Forest</lc> Comments: Collector: <co>Oran</co> Ph.D <co>B. Stanley</co> Botany Y'36 Date: <cd>8 Sept. 1932</cd> <bc>YU.000122</bc> (/.) \ .. . . .

  7. A Bad example Poor OCR

  8. f <bt>Yale University Herbarium</bt> <bc>YU.000022</bc> <bc>YU.000022</bc> <cd>1954</cd> NameSc. Name Com. No. 54,157 <cm>Ono glowered Cancer Root</cm> Family Date <fm>Orobanchaceae</fm> Town State <lc>Barkhamsted</lc> Location <lc>Youngsdale</lc> Soil and Site <hb>Mge of moist woods</hb> Deter, by •? Asso. Sps. Remarks: <pd>Flower color</pd> <ft>COLLECTION OF PLANTS NORTH EASTERN UNITED STATES</ft>

  9. Learning Architecture Training Phase Marked-up Specimens Machine Learner Trained Model Application Phase Unmarked-up Specimens Machine Classifier Segments Text Marked-up Specimens Chunker - Segmentor

  10. Segmentation Flowchart Specimen record Line breaker • Patterns for segment boundary: • More than one white spaces • which are not preceded by • a comma • A period followed by at least • one white spaces • A semi-colon followed by at • least one white spaces Lines of texts Pattern-based chunker Segments Chunker utilizes a set of patterns to recognize boundary of segments in each line of text

  11. Unsegmented Label Yale University Herbarium 1 YU.000122 Herbarium of Yale University No: Family: Umbelliferae Scientific Name: Zizia aurea (L.] Koch. Common Locality: Habitat: Name: Golden Alexanders New Hampshire Keene/Swanzey Yale Forest Comments: Collector: Oran Ph.D B. Stanley Botany Y'36 Date: 8 Sept. 1932 YU.000122 (/.) \ .. . . .

  12. Yale University Herbarium 1 YU.000122 Herbarium of Yale University No: Family: Umbelliferae Scientific Name: Zizia aurea (L.] Koch. Common Locality: Habitat: Name: Golden Alexanders New Hampshire Keene/Swanzey Yale Forest Comments: Collector: Oran Ph.D Stanley Botany Y'36 Date: 8 Sept. 1932 YU.000122 (/.) \ .. . . .

  13. <bt>Yale University Herbarium<bt> 1 <bc>YU.000122</bc> <hd>Herbarium of Yale University</hd> No: Family: <fm>Umbelliferae</fm> Scientific Name: <gn>Zizia</gn> <sp>aurea</sp> <sa>(L.] Koch.</sa> Common Locality: Habitat: Name: <cm>Golden Alexanders</cm> <lc>New Hampshire Keene/Swanzey Yale Forest</lc> Comments: Collector: <co>Oran</co> Ph.D <co>B. Stanley</co> Botany Y'36 Date: <cd>8 Sept. 1932</cd> <bc>YU.000122</bc> (/.) \ .. . . .

  14. Overall Performances Baseline–features include only tokens Gazetteer–additional features: genus,species,person names Fuzzy–approximate matching on: genus, species, and family

  15. Gazetteer Breakdown • Effects in performance with individual gazetteer None – baseline (no gazetteer utilized) All – all gazetteer included

  16. Fuzzy Match • Edit Distance • Levenshtein distance • N-Gram (computationally complex)

  17. Weka - T2K Demo

  18. Location "Ridges about 2 miles west of inlet between Peters Lake and Schraders Lake - Drainage Canal, steep north-facing slope. ALASKA: Schraders Lake - Peters Lake area, just northwest of Mount Chamberline, Franklin Mountains, Brooks Range, approx 69 22 N. Lat., 145 03 W Long., about 3000 ft. altitude."

  19. “between” Frame • Type: Relation • Location1: Peters Lake • Location2: Schraders Lake • Verification: inlet • Resolution: __________

  20. Features (1) • ADDR: Street address • ADM: Administrative unit • F: Feature. Anything that could potentially be found in gazetteer • FS: Subdivision of a feature. “~ part of Feature” • J: Junction. any intersection of linear feature • NF/NJ: Near Feature/ Near Junction

  21. Features (2) • P: Path is a linear feature such as a road, trail, boundary or river. A description with a path followed by an offset from feature at a heading should be calculated as a clause of the type rather than as the intersection of a path and a clause. • POM: Path Offset Marker • e.g. Mile 49.5 Sterling Hwy. • PS: Subdivision of Path • TRS: Township, Range, Section • TRSS: Township, Range, Section Subdivision

  22. Coordinates • LL : Latitude and Longitude coordinate • UTM: Universal Transverse Mercator coordinates

  23. Offsets • +2P: orthogonal offsets from two paths • FO: Offsets form a features, no heading • FOH: Offset form a features at heading • FO+: orthogonal offsets from a features • JOH: offset from a junction at heading • FPOH: offset from a features at heading along a path • PO: offset along a path, no feature or heading

  24. MaNIS Data • The following locality types are not found • Coordinates: OGS, UTM • Offsets: +2P

  25. FRAME (1) • A frame defines general properties hold among a class of objects, called frame instances. Frames contain slots, roughly, attributes. • Some frames are complex in that they refer to sequences of transitions, each of which can itself be separately described as a frame.

  26. FRAME (2) • Each locality type can be served as a sub-frame. • A sub-frame can be combined with other sub-frames. For example; • [ FEATURE [ CITY = Cansas ] ] • [PATH [ PLACE = Rio Higueron] ] • FOH; P 10 MI SW CANAS; RIO HIGUERON OFFSET VALUE = 10 DIRECTION= sw UNIT = mile HEADING [ FEATURE [ CITY = Cansas ]] PATH [ PLACE = Rio Higueron ]

  27. FRAME (3) • JOH : offset from a junction at heading • e.g. 0.5 mi. W Sandhill and Hagadorn Roads • [ FEATURE [ CITY = Sandhill ]] • [ FEATURE [ ROAD= Hagadorn Roads ]] • OFFSET VALUE = 0.5 DIRECTION= W UNIT = mile JUNCITON [ FEATURE [ CITY = Sandhill ]] [ FEATURE [ ROAD= Hagadorn Roads ]]

  28. Structure equals flexibility

More Related