1 / 26

Text Search and Fuzzy Matching

Text Search and Fuzzy Matching. Presented by Andrei Kossoroukov, Sunaptic Solutions Andrei.Kossoroukov@sunaptic.com. Focus of the Presentation. Text Search in Big Databases Data Cleansing in ETL Word Matching Usage of Different Matching Algorithms. Scenarios. Scenario 1.

Télécharger la présentation

Text Search and Fuzzy Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Search and Fuzzy Matching • Presented by Andrei Kossoroukov, Sunaptic Solutions Andrei.Kossoroukov@sunaptic.com

  2. Focus of the Presentation • Text Search in Big Databases • Data Cleansing in ETL • Word Matching • Usage of Different Matching Algorithms

  3. Scenarios Scenario 1 Scenario 2 (ETL) Scenario 3 User Interface User Interface “Dirty” Data Other Systems Other Systems “clean” request for search “dirty” request for search “Dirty” Data “Clean” Data “Clean” Data

  4. Text Search Challenges • Improving Search Speed • Searching for a substring in a string regardless of the substring nature. • Improving Relevance of Results • Searching for words of a human language. • Domain dependence.

  5. Word Matching Approaches • Exact Matching • Partial Matching (Pattern Matching) • Grammatical Algorithms: Stemming Matching and Synonym Matching (Semantics) • Phonetic Matching • Fuzzy Matching

  6. Exact Matching • No additional challenge except speed. • Domain does not really matter. • Example: search in a file using notepad program. • Example (SQL): SELECT field FROM table WHERE field = ‘string’. • MS SQL Server: Proper indexing improves speed.

  7. Partial Matching • Domain does not really matter. • Example: wildcards, search patterns. • Example (SQL): SELECT field FROM table WHERE field LIKE ‘string%’. • MS SQL Server: Proper indexing improves speed.

  8. Full-text Searchin MS SQL Server • Needs MS Search Service (for SQL Server 2000) • Included in MS SQL Server 2005 as SQL Server Full Text Search Service • CONTAINS Predicate • Unlike LIKE, CONTAINS matches words. • Can search for a word inflectionally generated from another (stemming matching). • Can search for a word near another word. • SQL Server discards noise words from the search criteria. • FREETEXT Predicate • A word or phrase close to the search word or phrase. • Needs Additional Space on Disk

  9. Full-text Search Architecture in MS SQL 2000

  10. Full-text Search Architecture in MS SQL 2005

  11. Grammatical Algorithms • Stemming Match • We already saw SQL Server Full Text search. • Google example: “cutting and paste”. • Why needs dictionary: to determine the stem. • MS Search Service provides only inflectional, not derivational, word generation. • Synonym Match • Most Grammatical Algorithms are Based on Dictionaries • Quasi Stemming Match • Can be developed without a main dictionary (using quasi–endings tree). • Relatively low relevance.

  12. Phonetic Matching • Phonetic Matching Algorithms (or Phonetic Encoding, or “Sounds Alike” Algorithms) • Language Dependent • Domain Dependent

  13. Phonetic Matching Algorithms • The original SoundEx Algorithm • Has been used in US census since late 1890s. • Was patented by Margaret O'Dell and Robert C. Russell in 1918. • Improvements: Phonix (1988), Editex (phonetic distance measuring, circa 2000), etc. • Metaphone and Double Metaphone Algorithms • Author: Lawrence Phillips. 1990 and 2000.

  14. SoundEx Algorithm • 1. Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step. • 2. Retain the first letter of the word. • 3. Change all occurrence of the following letters to '0' (zero):  'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. • 4. Change letters from the following sets into the digit given: • 1 = 'B', 'F', 'P', 'V' • 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' • 3 = 'D','T' • 4 = 'L' • 5 = 'M','N' • 6 = 'R' • 5. Remove all pairs of digits which occur beside each other from the string that resulted after step 4. • 6. Remove all zeros from the string that results from step 5 (placed there in step 3). • 7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

  15. More About SoundEx • Example (SQL) • DIFFERENCE • Oracle SOUNDEX – Slightly Different from SQL Server SOUNDEX • Seems That Major DBMSs (SQL Server, Oracle, DB2) Don’t Have a Better Phonetic Matching • Enhancements • Replace DG with G etc. • Phonix algorithm.

  16. SoundEx Limitations • SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned). • In many instances, unreliable interfaces are used as a foundation, upon which a reliable layer may be built. Interfaces that build a reliable layer, based on context, over a SoundEx foundation may also be possible. • SQL: word can’t start with a space. • Mistake in first letter results in 100% mismatch.

  17. Metaphone and Double Metaphone • Metaphone • An algorithm to code English words phonetically by reducing them to 16 consonant sounds. • Double Metaphone • An algorithm to code English words (and foreign words often heard in the United States) phonetically by reducing them to 12 consonant sounds. • Author: Lawrence Phillips, 1990 and 2000 • Metaphone Description and Demo: http://www.wbrogden.com/phonetic/ • SQL Example

  18. Double Metaphone Advantages and Limitations • Free, Efficient, and Easy to Use • Provides Better Results Compare to SOUNDEX • Returns Two Possible Matches • Works Best with Proper Names • May Fail to Match Misspelled Words • Much Slower than SOUNDEX

  19. Fuzzy Matching • What is Fuzzy Matching? • Fuzzy query in Index Server are simple prefix matching, like dog* returns dogmatic and doghouse, + stem matching. • Originally Meant “Not Exact Matching” • Web Search Engines • Edit Distance Based Algorithms • Simple: Hamming distance algorithms. • Most popular:Levenshtein distance algorithm. • Q-Gram Based Algorithms • Both Types of Algorithms Are Language and Domain Independent

  20. Levenshtein Distance • Developed in 1965 • LD is a Measure of the Similarity Between Two Strings • It is the smallest number of insertions, deletions, and substitutions required to change one string into another. • Language and Domain Independent • Demo • http://www.merriampark.com/ld.htm

  21. Q-Grams • Q-Grams Are Obtained by Sliding a Window of Size Q over the Characters of a Given String • Example • 2-grams of “john smith” are $j jo oh hn n_ _s sm mi it th h# • IDEA: If Strings Match, They Have Many Common Q-Grams • Example: “john smith” and jonh smith” have 9 common q-grams. • Language and Domain Independent

  22. “Fuzzy” SSIS • Fuzzy Lookup enables to match input records with clean, standardized records in a reference table. • Fuzzy Grouping enables to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity. • Designed for data cleanup. • Based on Q-Grams and Levenshtein Distance (?).

  23. Design a Simple SSIS Fuzzy Lookup Package • Setting Up • String Data Types (DT_STR and DT_WSTR) • ETI (Error-Tolerant Index), Tokens, Delimiters • Tokens are not Q-Grams • Similarity Threshold • Number of Matches

  24. Can Fuzzy Lookup Be Accessed From C# Code? NOT YET

  25. Conclusions • Language and Domain Knowledge is Important • No Implementations? – Develop Yourself! • Questions?

  26. For More Information Contact • Andrei Kossoroukov, Sunaptic Solutions • Andrei.Kossoroukov@sunaptic.com • 604-629-0891 ext. 105

More Related