1 / 37

Detecting spelling errors in taxonomic databases

Detecting spelling errors in taxonomic databases. Eduardo Dalcin Instituto Nacional de Pesquisas da Amazônia eduardo@dalcin.org Richard White , Alex Gray Computer Science, Cardiff University. Outline.

kimama
Télécharger la présentation

Detecting spelling errors in taxonomic databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting spelling errors in taxonomic databases Eduardo Dalcin Instituto Nacional de Pesquisas da Amazônia eduardo@dalcin.org Richard White, Alex Gray Computer Science, Cardiff University Eduardo Dalcin - TDWG - St Petersburg

  2. Outline • Experiences with techniques for detecting and correcting spelling errors in scientific names • Dictionaries of complete scientific names are often unavailable • Algorithms for detecting errors without using vocabularies • Results of some experiments • database error rates • performance of the algorithms • cost effectiveness Eduardo Dalcin - TDWG - St Petersburg

  3. Introduction • Many problems can affect the quality of data in taxonomic databases, and some of the ways to resolve these issues are discussed in other talks at this meeting. • Species names play a key role in biodiversity information systems • They (or meaningless unique codes corresponding to them) are the most common values linking related information in different tables and databases • Because of their unfamiliarity to many of the contributors and editors of databases, they are particularly prone to errors. • The causes and frequencies of different types of errors in scientific names are usually unknown. Eduardo Dalcin - TDWG - St Petersburg

  4. A problem in Biodiversity Informatics Species names are the key to biodiversity information – inaccurate names cause data loss: • In single databases where species records may not be found if the species name is wrong • In large biodiversity information systems assembled from several sources, where the names are (or have been) used to link the data sources together Eduardo Dalcin - TDWG - St Petersburg

  5. A problem in Biodiversity Informatics • Thus the accuracy of species names may affect the reliability and usability of species information systems • Techniques to handle the problem semi-automatically can be developed • Litchi (later – this afternoon) • Error detection and correction Eduardo Dalcin - TDWG - St Petersburg

  6. Previous research in error detection • Automatic spelling error detection and correction have been researched since the early 1960’s in relation to three main problems: • spelling checkers in word processor systems • retrieval of the correct records in databases • word recognition in OCR and “pen-based interface” devices Eduardo Dalcin - TDWG - St Petersburg

  7. Types of error detection and correction • Use of vocabularies of scientific name components • akin to those used by conventional spelling checkers. However, suitable dictionaries of complete scientific names are frequently unavailable. • Without the use of vocabularies Eduardo Dalcin - TDWG - St Petersburg

  8. Availability of dictionaries • Spelling error detection in taxonomic databases is straightforward when it involves scientific names of higher taxa such as family names, where reference dictionaries could be used • In contrast, species names, as a binomial combination of genus name and specific epithet, need a different approach because, in most practical situations, a reliable independent dictionary of species names in the same taxonomic domain is not available. Eduardo Dalcin - TDWG - St Petersburg

  9. Objective • Detecting scientific name spelling errors in taxonomic databases aims to minimise “one of the more tedious tasks of taxonomists: to check and update long lists of species names” (Froese, 1997) Eduardo Dalcin - TDWG - St Petersburg

  10. Experiments in error detection • The objective is to detect potential spelling errors in scientific names, using similarity algorithms in order to identify pairs of scientific names which have a high degree of similarity to each other but are not exactly the same. • The results of some experiments will be discussed and summarised, particularly in terms of database error rates and cost effectiveness Eduardo Dalcin - TDWG - St Petersburg

  11. Databases used in tests • ILDIS: International Legume Database and Information Service (plant family Fabaceae) • MARINE: Marine Invertebrates of New Zealand (National Institute of Water & Atmosphere, Wellington ) • SP2K: Species 2000 (Annual Checklist 2002) • PMA: Atlantic Rain Forest Plant Database (Rio de Janeiro Botanic Garden ) • CNIP: North-east Brazil Plant Checklist Database (Federal University of Pernambuco) • A “control” database into which known errors had ben introduced Eduardo Dalcin - TDWG - St Petersburg

  12. Preliminary processing • In an initial step, the presence of invalid characters was detected in all databases • These characters were removed in order to avoid compromising the execution of the spelling error algorithms Eduardo Dalcin - TDWG - St Petersburg

  13. Sample character errors Eduardo Dalcin - TDWG - St Petersburg

  14. Algorithms 1: phonetic similarity • Phonetic matching is used to identify strings that may be of similar pronunciation, regardless of their actual spelling • Phonetic similarity algorithms encode strings of letters (names) according to their spoken sounds. Names that share the same phonetic codes are assumed to be similar and may include erroneous names. Eduardo Dalcin - TDWG - St Petersburg

  15. Phonetic similarity algorithms • The Soundex algorithm (originally developed to search for names in the 1890 USA census files) • For example, reynold and renauld are both reduced to r543 • The Phonix algorithm (developed to be used in a library package for retriieval from bibliographical databases) • The Phonix algorithm is a Soundex variant. Letters are mapped to codes using a similar algorithm with a slightly different set of codes, but before this mapping about 160 letter-group transformations are used to standardise the string and resulting codes Eduardo Dalcin - TDWG - St Petersburg

  16. Algorithms 2: string similarity String similarity algorithms compute a numerical estimate of the similarity between two string • N-grams (Bi-grams and tri-grams) • Levenshtein distance • Skeleton-Key Eduardo Dalcin - TDWG - St Petersburg

  17. N-gram algorithm • N-gramsare n-letter subsequences of words or strings, where n is usually one, two or three letters • If n is greater than 1, n-grams may overlap, in other words a letter may appear in more than one n-gram. • The n-gramalgorithm counts the number of n-grams which the two strings have in common, and assigns a similarity coefficient. • Bi-grams and tri-grams were investigated Eduardo Dalcin - TDWG - St Petersburg

  18. Levenshtein algorithm • The Levenshtein distance is the number of single-character deletions, insertions, or substitutions required to transform one string into the other. Eduardo Dalcin - TDWG - St Petersburg

  19. Skeleton-Key algorithm • Assumes that consonants carry more information than vowels, but is not phonetic • The Skeleton-key of a word consists of its first letter, followed by the remaining consonants and finally the remaining vowels, both in order of appearance. Duplicated letters are eliminated Eduardo Dalcin - TDWG - St Petersburg

  20. Test environment • A series of software tools were developed in Object Pascal for the Delphi compiler • The first set of tools were used to test the algoorithm implementation Eduardo Dalcin - TDWG - St Petersburg

  21. Error classification tool • Single errors (Hit) • extra or one missing letter • one wrong letter • transposition of two adjacent letters • Multiple errors (Hit) • more than one “single error” (Vicia faba : Viica fabba) • Unrelated (false alarm) • different species epithet • different genus • Doubtful Eduardo Dalcin - TDWG - St Petersburg

  22. Eduardo Dalcin - TDWG - St Petersburg

  23. Number of comparisons made Eduardo Dalcin - TDWG - St Petersburg

  24. Error detection and correction algorithms • Algorithms based on character encoding, such as Soundex, Phonix, and Skeleton Key, are faster because the codes can be generated in advance for each name • Bigram, Trigram and Levenshtein Distance algorithms depend on comparison of the original strings, and thus showed the worst performance Eduardo Dalcin - TDWG - St Petersburg

  25. Large number of “false alarms” Eduardo Dalcin - TDWG - St Petersburg

  26. “False alarms” in ILDIS • The large number of false alarms in the ILDIS database can be explained by the large number of scientific names which belong to each genus, because it is a database with a restricted taxonomic domain (one family), increasing the average similarity of names Eduardo Dalcin - TDWG - St Petersburg

  27. The Marine Invertebrates of New Zealand Database showed a particularly large percentage of wrong names, explained by the process of construction of the database, from free text with no data quality control at data entry Percentage of errors by database Eduardo Dalcin - TDWG - St Petersburg

  28. Percentage of errors • Despite the building techniques and database administration efforts and tools, all five databases tested show spelling errors • However, databases built with no data entry control, such as the Marine Invertebrates database, show the highest number of spelling errors Eduardo Dalcin - TDWG - St Petersburg

  29. Types of error Eduardo Dalcin - TDWG - St Petersburg

  30. Genuine errors v. false alarms Eduardo Dalcin - TDWG - St Petersburg

  31. No need for a dictionary • The approach used a variety of detection techniques, using the database or names list itself as a data dictionary; and correction techniques, using similarity algorithms to detect scientific names with a degree of similarity which suggest a spelling error • Lack of dependence on a separate dictionary makes for an independent process, which can be executed by stand-alone tools without dependence on external dictionaries, which may not exist in many groups Eduardo Dalcin - TDWG - St Petersburg

  32. Referring hits to a dictionary • However, a test program which used the IPNI database as a reference was able to classify 92% of “hits” automatically as false alarms, thus greatly reducing the human input required Eduardo Dalcin - TDWG - St Petersburg

  33. But a list of genera is useful • Large numbers of “false alarms” associated with algorithms that find more genuine errors: • Need for auxiliary tools to minimize the requirement for human interaction in order to identify the genuine errors • 14% of false alarms were generated by differences in the generic name, so the use of a reliable genus dictionary could reduce the number of “false alarms”, and enhance the practical efficiency of these tools. Eduardo Dalcin - TDWG - St Petersburg

  34. Eduardo Dalcin - TDWG - St Petersburg

  35. Error limitation policies • Low error rate in the genus name for the first three databases (CNIP, PMA and ILDIS) because those database were populated using a specialised taxonomic database management system (Alice) which offers a genus dictionary from which editors may choose to select existing genus names • The low rate for genus errors in the Species 2000 database reflects the adoption of data entry control polices and data control during the compilation process Eduardo Dalcin - TDWG - St Petersburg

  36. Future work • Improved algorithms for detecting close matches, including Soundex and Phonix modified to be more appropriate for scientific names based on Latin and Greek • Incorporation of dictionaries, where available • Extension to authority strings in scientific names • Extension to other data types, such as the names of geographical areas, etc. • Integration with other tools such as Litchi (later talk – this afternoon) to help bioinformaticians deal with nomenclature Eduardo Dalcin - TDWG - St Petersburg

  37. References • Dalcin E.C. (2004) PhD thesis (Southampton University, unpublished) • MSS in preparation • Reports by Arthur Chapman on GBIF web site Acknowledgements • Plantas do Nordeste (Brazil – Kew collaboration with funding from Brazilian and UK governments) • Arthur Chapman, Andrew Jones Eduardo Dalcin - TDWG - St Petersburg

More Related