1 / 20

Computational Linguistic Techniques Applied to Drugname Matching

Computational Linguistic Techniques Applied to Drugname Matching. Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003. Drugname Matching. String matching to rank similarity between drug names Two classes of string matching

pepin
Télécharger la présentation

Computational Linguistic Techniques Applied to Drugname Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003

  2. Drugname Matching • String matching to rank similarity between drug names • Two classes of string matching • orthographic: Compare strings in terms of spelling without reference to sound • phonological: Compare strings on the basis of a phonetic representation • Two methods of matching • distance: How far apart are two strings? • similarity: How close are two strings?

  3. Distance and Similarity Measures: Orthographic/ Phonological • Orthographic • Distance: string-editEx: contac / zantac = 2/6 = 0.33 • Similarity: LCSR, DICEEx: contac / zantac = 4/6= 0.66Ex: co on nt ta ac / za an nt ta ac = 6/12 = 0.50 • Phonological • Distance: SoundexEx: contac/zantac = 1/4 = 0.25 • Similarity: ALINEEx: contac/zantac = 0.64

  4. Distance vs. Similarity: Examples • Example 1: hordes vs lords • Distance = 2 (replace h with l, and delete e). • Similarity = 2 (bigrams or and rd in common). • Example 2: water vs wine • Distance = 3 (replace a w/ i, t w/ n, delete r). • Similarity = 0 (no bigrams in common). • We can compare (global) similarity and distance: • sim(w1,w2)/length • 1 − dist(w1,w2)/length

  5. Orthographic Distance: string-edit • Count up the number of steps it takes to transform one string into another • Examples: • Distance between hordes and lords is 2. • Distance between water and wine is 3. • For “global distance”, we can divide by length of longest string : 2/6 and 3/5 above

  6. Orthographic Similarity: LCSR, DICE • LCSR: Divide length of longest common sub-sequence by length of longest string • Example: reagir and repair have longest common subsequence reair.Similarity score = 5/max(6,6)= 5/6 = 0.83 • DICE: Double the number of shared character bigrams and divide by total number of bigrams in each string • Example: reagir and repair have bigram sets {re,ea,ag,gi,ir} and {re,ep,pa,ai,ir}, respectively, and shared bigrams are {re,ir}. Similarity score = (2 ∙ 2)/(5+5) = 2/5 = 0.40

  7. Phonological Matching • Distance-based phonological matching • Soundex • Similarity-based phonological matching • ALINE

  8. Phonological Distance • Soundex Examples: • king and khyngge reduce to k52 • knight and night reduce to k523 and n23 • pulpit and phlebotomy reduce to p413

  9. What went wrong? • Truncation of word to four characters • Alternative: Use entire string • Ignoring vowels • Use more sophisticated phonetic rules • Using numbers instead of decomposable features • Use decomposable features

  10. Phonological Similarity • Another possible approach: Compare syllable count, initial/final sounds, stress locations • Misses frequently confused pairs • Alternative: Use phonological features to compare two words by their sounds. • x#→k(s): +consonantal, +velar, +stop, -voice • #x→z: +consonantal, +alveolar, +fricative, +voice • Phonological similarity of two words: Optimal match between their phonological features. • Zantac • Xanax

  11. Kondrak – ALINE (2000) • Two fundamental components of ALINE: • Similarity Function: Uses linguistic feature analysis measurements based on salience, e.g., ±alveolar and ±stop more salient than ±voice • Method for choosing optimal alignment: creates alignment based on a weighted multi-feature analysis • Designed to align phonetic sequences for many different CL applications • Developed originally for identifying cognates in vocabularies of related languages (e.g., colour, couleur) • Feature weights can be fine-tuned for specific application. • Efficient: Dynamic programming algorithm: quadratic

  12. ALINE Features: Weights and Values

  13. Places of Articulation: Numerical Values

  14. Manner of Articulation:Numerical Values • stop 1.0Example: p, b • affricate 0.9Example: th • fricative 0.8Example: f, v

  15. Tuning of ALINE Parameters • Parameters have default settings for cognate matching task, but not appropriate for drugname matching • Parameter tuning: • calculate weights for drugname matching • “Hill Climbing” search against gold standard • Tuned parameters for drugname task • maximum score • insertion/deletion penalty • vowel penalty • phonological feature values

  16. Comparison of Outputs • ALINE: 0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac • EDIT: 0.500 zantac xanax 0.667 zantac contac 0.333 xanax contac • LCSR: 0.545 zantac xanax 0.667 zantac contac 0.364 xanax contac • DICE: 0.222 zantac xanax 0.600 zantac contac 0.000 xanax contac

  17. Evaluation • Precision and recall against online gold standard: USP Quality Review, Mar, 2001. • 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) • Example (using DICE):+ 0.889 atgam ratgam+ 0.875 herceptin perceptin- 0.870 zolmitriptan zolomitriptan+ 0.857 quinidine quinine- 0.857 cytosar cytosar-u+ 0.842 amantadine rimantadine: : : :- 0.800 erythrocin erythromycin

  18. Comparison of Precision at Different Recall Values

  19. Precision of Techniques withPhonetic Transcription

  20. Conclusion • Experimentation with different algorithms and their combinations against gold standard. • ALINE: Strong foundation for search modules in automating the minimization of medication errors • Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). • Related to pattern recognition: Discover patterns of predictable matches based on feature values

More Related