330 likes | 447 Vues
This paper discusses the lessons learned from linking the International Plant Names Index (IPNI) and TROPICOS databases, highlighting the complexities involved in matching species names and record fields. Key topics include the importance of speed in matching processes, strategies for effective data comparison, handling predictable variations, and minimizing false positives through rigorous matching criteria. By optimizing the data linking process, researchers can enhance botanical information accessibility and improve collaboration among taxonomists.
E N D
Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org
Variation Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum. Die Flora der Deutschen Schutzgebiete in der Sudsee 1900
Duplication • Poa annua L. -- Sp. Pl. 68. 1753 (GCI) • Poa annua L. -- Species Plantarum 2 1753 (APNI) • Poa annua L. -- Sp. Pl. 68. (IK)
Duplication • Calophyllum microphyllum Scheffin Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK) • Calophyllum microphyllum Planch. & Trianain Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK) • Calophyllum microphyllum T.Anders.Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)
Fields 1 Calophyllum Calophyllum 2 kiong kiong 3 K.Schum. & Lauterb. Lauterb. & K.Schum. • Fl. Deutsch. Sudsee Die Flora der Deutschen… • 450. 1900
Lesson 1 Speed matters
Speed matters 2,500 by 2,000 by 4 fields 20,000,000 comparisons ~5.5 hours at 1ms per comparison
Be lazy • Do as little as possible • Do easy things if possible • Do hard things only if necessary • Only expend effort when it’s worth it
Be lazy • Do as little as possible • Specify fields as ‘must match’ • If a ‘must match’ field fails • Mark the match as failed • Stop comparing fields
speciesinfragenusinfraspeciesauthorsrank … Parameterised matching
Optimising • The order of field matching is important • Choose suitable fields to match first • Aim to fail matches early • Significant speed-up
Also, for speed • Do as little as possible • Do escaping or standardisation once • Done on import for each dataset • Keep field matching functions clean
More speed optimisation • Do easy things if possible • Define cascading tests • Do easy tests first, if practical • Length comparisons • Composition comparisons
Speed Lessons • Speed matters • Minimise comparisons made • ‘Must match’ parameters • Match fields in an efficient order • Do data cleaning once, up front • Look for ways to fail matches cheaply
Accuracy False - OK False +
Strict match F- OK
Fuzzy match OK F+
One approach • Currently, to get best results: • Tend towards strictness • Handle false negatives
One approach • Currently, best results from: • Tend towards strictness • Handle false negatives • Failures on ‘rightmost’ fields can be written to a report • Checked and fed back in as escapes • Rerun
Predictable variation • Gendered endings • Common alternatives • Endings: • ii,i • Iae,ae • Dataset specific quirks: • &, &
The framework • Python • Psyco • Modular • Extensible • In progress • More details will be available on the TDWG website • Source code availability
The framework • Some results (HTML)
Thanks to • Bob Magill • Sally Hinchcliffe • The Moore Foundation • Contact: • j.welby@kew.org • or after Jan 2007 :julius.welby@gmail.com