300 likes | 450 Vues
Machine Learning for Extracting Darwin Core Data from Museum Labels. Taxonomic Database Working Group 2005. P. Bryan Heidorn 1 , Wensheng Wu 2 & Hong Zhang 1 , Reed Beaman 3
E N D
Machine Learning for Extracting Darwin Core Data from Museum Labels Taxonomic Database Working Group 2005 P. Bryan Heidorn1, WenshengWu2& Hong Zhang1, ReedBeaman3 1Graduate School of Library and Information Science, 2Computer Science, University of Illinois 3Yale, Peabody Museum
Ontology of a Label • Determination? • Synonymy • Collector • Date • Location • Habitat • ….
Ontology of a Label • bc - barcode • bt - barcode text • cm - common/colloquial name • cn - collection number • co - collector • cd - collection date • fm - family name • ft - footer info
Ontology of a Label • gn - genus name • hd - header info • in - infra name • ina - infra name author • lc - location • pd - plant description • sa - scientific name author • sp - species name
BAKIUM <bt>Yale University Herbarium</bt> <bc>YU.000081</bc> <bc>YU.000081</bc> <hd>Herbarium of Yale University Plants of San Luis, Peten, Guatemala</hd> No: 301 Family:<fm>Boraginaceae</fm> Scientific Name: <gn>Heliotropum</gn> <sp>angiospermum^</sp> Mopan Mayan Name: <cm>U p'ot kutz</cm> Colloquial Spanish Name: <cm>mOCO de chompipe</cm> Location: <lc>in pueblo (village) along path</lc> <cd>2 9 May 1976</cd> Comments:<pd>herbaceous plant reaching 60 cm. small yellow flowers</pd> (det.L.Brown) Collected by <co>Pierre Ventur</co>, <ft>Yale Department of Anthropology</ft>
<bt>Yale University Herbarium<bt> 1 <bc>YU.000122</bc> <hd>Herbarium of Yale University</hd> No: Family: <fm>Umbelliferae</fm> Scientific Name: <gn>Zizia</gn> <sp>aurea</sp> <sa>(L.] Koch.</sa> Common Locality: Habitat: Name: <cm>Golden Alexanders</cm> <lc>New Hampshire Keene/Swanzey Yale Forest</lc> Comments: Collector: <co>Oran</co> Ph.D <co>B. Stanley</co> Botany Y'36 Date: <cd>8 Sept. 1932</cd> <bc>YU.000122</bc> (/.) \ .. . . .
A Bad example Poor OCR
f <bt>Yale University Herbarium</bt> <bc>YU.000022</bc> <bc>YU.000022</bc> <cd>1954</cd> NameSc. Name Com. No. 54,157 <cm>Ono glowered Cancer Root</cm> Family Date <fm>Orobanchaceae</fm> Town State <lc>Barkhamsted</lc> Location <lc>Youngsdale</lc> Soil and Site <hb>Mge of moist woods</hb> Deter, by •? Asso. Sps. Remarks: <pd>Flower color</pd> <ft>COLLECTION OF PLANTS NORTH EASTERN UNITED STATES</ft>
Learning Architecture Training Phase Marked-up Specimens Machine Learner Trained Model Application Phase Unmarked-up Specimens Machine Classifier Segments Text Marked-up Specimens Chunker - Segmentor
Segmentation Flowchart Specimen record Line breaker • Patterns for segment boundary: • More than one white spaces • which are not preceded by • a comma • A period followed by at least • one white spaces • A semi-colon followed by at • least one white spaces Lines of texts Pattern-based chunker Segments Chunker utilizes a set of patterns to recognize boundary of segments in each line of text
Unsegmented Label Yale University Herbarium 1 YU.000122 Herbarium of Yale University No: Family: Umbelliferae Scientific Name: Zizia aurea (L.] Koch. Common Locality: Habitat: Name: Golden Alexanders New Hampshire Keene/Swanzey Yale Forest Comments: Collector: Oran Ph.D B. Stanley Botany Y'36 Date: 8 Sept. 1932 YU.000122 (/.) \ .. . . .
Yale University Herbarium 1 YU.000122 Herbarium of Yale University No: Family: Umbelliferae Scientific Name: Zizia aurea (L.] Koch. Common Locality: Habitat: Name: Golden Alexanders New Hampshire Keene/Swanzey Yale Forest Comments: Collector: Oran Ph.D Stanley Botany Y'36 Date: 8 Sept. 1932 YU.000122 (/.) \ .. . . .
<bt>Yale University Herbarium<bt> 1 <bc>YU.000122</bc> <hd>Herbarium of Yale University</hd> No: Family: <fm>Umbelliferae</fm> Scientific Name: <gn>Zizia</gn> <sp>aurea</sp> <sa>(L.] Koch.</sa> Common Locality: Habitat: Name: <cm>Golden Alexanders</cm> <lc>New Hampshire Keene/Swanzey Yale Forest</lc> Comments: Collector: <co>Oran</co> Ph.D <co>B. Stanley</co> Botany Y'36 Date: <cd>8 Sept. 1932</cd> <bc>YU.000122</bc> (/.) \ .. . . .
Overall Performances Baseline–features include only tokens Gazetteer–additional features: genus,species,person names Fuzzy–approximate matching on: genus, species, and family
Gazetteer Breakdown • Effects in performance with individual gazetteer None – baseline (no gazetteer utilized) All – all gazetteer included
Fuzzy Match • Edit Distance • Levenshtein distance • N-Gram (computationally complex)
Location "Ridges about 2 miles west of inlet between Peters Lake and Schraders Lake - Drainage Canal, steep north-facing slope. ALASKA: Schraders Lake - Peters Lake area, just northwest of Mount Chamberline, Franklin Mountains, Brooks Range, approx 69 22 N. Lat., 145 03 W Long., about 3000 ft. altitude."
“between” Frame • Type: Relation • Location1: Peters Lake • Location2: Schraders Lake • Verification: inlet • Resolution: __________
Features (1) • ADDR: Street address • ADM: Administrative unit • F: Feature. Anything that could potentially be found in gazetteer • FS: Subdivision of a feature. “~ part of Feature” • J: Junction. any intersection of linear feature • NF/NJ: Near Feature/ Near Junction
Features (2) • P: Path is a linear feature such as a road, trail, boundary or river. A description with a path followed by an offset from feature at a heading should be calculated as a clause of the type rather than as the intersection of a path and a clause. • POM: Path Offset Marker • e.g. Mile 49.5 Sterling Hwy. • PS: Subdivision of Path • TRS: Township, Range, Section • TRSS: Township, Range, Section Subdivision
Coordinates • LL : Latitude and Longitude coordinate • UTM: Universal Transverse Mercator coordinates
Offsets • +2P: orthogonal offsets from two paths • FO: Offsets form a features, no heading • FOH: Offset form a features at heading • FO+: orthogonal offsets from a features • JOH: offset from a junction at heading • FPOH: offset from a features at heading along a path • PO: offset along a path, no feature or heading
MaNIS Data • The following locality types are not found • Coordinates: OGS, UTM • Offsets: +2P
FRAME (1) • A frame defines general properties hold among a class of objects, called frame instances. Frames contain slots, roughly, attributes. • Some frames are complex in that they refer to sequences of transitions, each of which can itself be separately described as a frame.
FRAME (2) • Each locality type can be served as a sub-frame. • A sub-frame can be combined with other sub-frames. For example; • [ FEATURE [ CITY = Cansas ] ] • [PATH [ PLACE = Rio Higueron] ] • FOH; P 10 MI SW CANAS; RIO HIGUERON OFFSET VALUE = 10 DIRECTION= sw UNIT = mile HEADING [ FEATURE [ CITY = Cansas ]] PATH [ PLACE = Rio Higueron ]
FRAME (3) • JOH : offset from a junction at heading • e.g. 0.5 mi. W Sandhill and Hagadorn Roads • [ FEATURE [ CITY = Sandhill ]] • [ FEATURE [ ROAD= Hagadorn Roads ]] • OFFSET VALUE = 0.5 DIRECTION= W UNIT = mile JUNCITON [ FEATURE [ CITY = Sandhill ]] [ FEATURE [ ROAD= Hagadorn Roads ]]