200 likes | 328 Vues
NISS Workshop on Computational Advertising, November 2009. Spelling Correction for Advertising: How “Noise” Can Help. Silviu Cucerzan Microsoft Research Text Mining Search and Navigation. Buying Cheap( er ) on eBay. Canon 30d. Not good for the sellers. Not good for most buyers.
E N D
NISS Workshop on Computational Advertising, November 2009 Spelling Correction for Advertising:How “Noise” Can Help SilviuCucerzan MicrosoftResearch Text Mining Search and Navigation
Buying Cheap(er) on eBay Canon 30d Not good for the sellers. Not good for most buyers. Not good for the middle man. Cannon 30d
Good Ads for Bad Queries espresso machines cingular wireless
Is a Trusted Dictionary Enough? cheats celine colour christinaaguilera panasonic recorders filter drivers windows files powerpoint • Search: max paynechats and codes new humweepics • Music: selindioncolor of my love cristinaaquillara • Shopping: pansonicdvdreorders brita water filer • Help and Support: printerdiversforwindowvista insert flash flies into power point
Web Query Logs as Corpora • Web Search: over to 1 billion queries per day! • 10-15% of the queries contain spelling errors • highly dynamic domain: many new names and concepts become popular every day extremely difficult to maintain a high-coverage lexicon • difficult to define what a valid web query is The problem The solution e.g.:divx, ecard, ipod, korn, xbox, zune, naboo, nimh, nsync, shrek, 5dmkii, tsx
Problems To Be Handled power crd power cord video crd video card Context-sensitive correction of out-of-lexicon words Context-sensitive correction of in-lexicon words chicken sopchicken soup sop operasoap opera Concatenate and split cheese cake factory cheesecake factory chat inspanichchat in spanish Recognize out-of-lexicon valid words amd processors amd processors Change in-lexicon words to out-of-lexicon words gun dam fightergundam fighter
An HMM Architecture for Spelling Correction brita water filer input query: brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier states: all alternative spellings from the query log
What about terrible misspellings? • input: arnol shwartzeggar • desired output:arnold schwarzenegger unweighted edit distance: 5
An Iterative Approach Misspelled query: arnol shwartzeggar First iteration: arnold schwartzneggar Second iteration: arnold schwartzenegger Third iteration: arnold schwaxrzenegger Fourth iteration: arnold schwarzenegger Speller output: no more changes
Some Intuition Search Query Log Statistics hunny moon Iterative spelling correction process honeymoon
Basic Assumptions about the “Noise” • query logs contain a lot of different misspellings for most words • the better spelled a word form, the more frequent it is • the correct forms are much more frequent than their misspellings
Concatenation and Splitting Store word unigrams and bigrams in the same searchable trie structure. Find alternative spellings for the input words in this common structure.
Avoid Changing the User’s Intent brit file waiter brita water filer brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier
Modified Viterbi Search – Fringes e.g.:water filer waiter file in-lexicon words k1k2k1+k2 paths
Modified Viterbi Search – Stop words e.g.: lord of teh rigs lord of the rings
A Closer Look to the Results • 81.8% overall agreement with the annotators • Errors: • alternative queries for valid queries many false positives are reasonable suggestions e.g. cowboy robescowboy ropes • alternative queries for misspelled queries some suggestions could be valid (user’s intent not known) e.g. massanger massager / messenger annotator inter-agreement rate: 91.3%
Evaluation – When we “know” user’s intent (audio flie, audio file) audio file (bueavista, buena vista) buena vista (carrabean nooms, carrabean rooms)caribbean rooms 368 queries
Learning Curve Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative processthat exploits the collective knowledge of web users”, EMNLP 2004