1 / 1

An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL)

TNR Error. TNR. uBio Database. Text Mining. Internet Archive. BHL Web. OCR Error. Authority File Error. OCR. BHL ( Text Database ). I mage & Text Database. Unstructured Data. Structured Data. F igure 1: BHL digitization process and location of errors.

lonato
Télécharger la présentation

An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TNR Error TNR uBio Database Text Mining Internet Archive BHL Web OCR Error Authority File Error OCR BHL (Text Database) Image & Text Database Unstructured Data Structured Data Figure 1: BHL digitization process and location of errors An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL) Qin Wei1, Chris Freeland2 and P. Bryan Heidorn Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, qinwei2@uiuc.edu Missouri Botanical Garden, chris.freeland@mobot.org http://www.biodiversitylibrary.org The BHL has incorporated TaxonFinder, a taxonomic name finding algorithm and service provided by uBio.org, into its portal for the identification and verification of taxonomic name strings found within the digitized BHL corpus. An eight-week evaluation was performed to determine the factors affecting the accuracy of the results returned. We explored and analyzed the factors influencing the performance of: 1).Optical Character Recognition (OCR) for transforming images into text, 2).TNR matching algorithms for identifying taxonomic names from texts, 3). thecompleteness of NameBank, which is used as an authority file for name verification. Table 2: Performances *TaxonFinder Error = 3003-1056(OCR Error)-92(NameBank) -621(Correctly Found Names) =1234 Table 5: Performances of TaxonFinder and FAT* *Without_OCR_Error means the names which have not been correctly recognized by OCR are excluded in the evaluation. With_OCR_Error means all names (whether correctly or uncorrectly recognized by OCR) are included in the evaluation. *The different numbers of names identified by biologists are due to the different mechanisms of TaxonFinder and FAT. TaxonFinder removes duplicate names within a page while FAT does not. In order to match the algorithms, we use the same mechanisms to evaluate them. *TaxonFinder is developed by uBio and FAT is short for Finds All Taxonomic names developed by Sautter et al. The performance of the whole text mining system is evaluated by two measures from information retrieval evaluation: Precision (P) and Recall (R). Precision is the proportion of matching strings that are valid names. In our case,the precision means the capability of the algorithm to exclude the non-valid name in the result. Recall is the proportion of valid names in the whole database that were returned as true positives. It means the capability of finding all valid names from the database. In this evaluation, we use a single measure F-score to express the tradeoff between Precision and Recall which is a harmonic mean of recall and precision: Table 3: Top OCR errors Table 7: Overall NameBank Evaluation For additional information about the evaluation, including datasets used, visit: http://bhlnameevaluation.wikispaces.com/ Table 4: Similarities between TaxonFinder and FAT *Same means No. of names found both by TaxonFinder and FAT. Same Is Valid Name means No. of same names which are also identified by domain experts (biologists) Table 6: NameBank Evaluation For TaxonFinder F-score=2(Precision*Recall)/(Precision+Recall)

More Related