50 likes | 282 Vues
Spell Checker. By: Myles Maxfield and Jonathan Reed. Methods.
E N D
Spell Checker By: Myles Maxfield and Jonathan Reed Methods To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from thousands to millions of times. The test beds varied widely in content because they were freshly constructed from the latest statistical data of the most commonly searched words on two products of our sponsor company. To ensure that the time efficiency results were accurate, the data from each trial was recorded and averaged. In addition, each corrected result and the frequency of said result were recorded to show the accuracy and consistency of the program. Results 94.2% Correctness Analysis The spell checker works relatively well.The theoretical run time using Big-Oh notation is O(m*n2), where m is a small constant (0<m<1) and n is the length of the inputted word. The scalar m is used, because the Kd tree filters out many impossible matches and therefore cuts down on the data size of the algorithms. The n2 comes from the use of the Damerau-Levenshtein edit distance algorithm. The runtime of the algorithm on our sponsor company’s servers is even less, and is therefore within the acceptable range. The correctness of the algorithm is relatively high. The algorithm gets the correct answer for all but four cases of the 70-word test-bed that the algorithm was run on. The algorithm did not find an answer for two of the four test cases that did not get the right answer. Because the spell checker output will not be displayed if it does not return an answer, the spell checker will only suggest an incorrect word 2.9% of the time.
Discussion Purpose The purpose of the project was to create a spell checker that could check through a corpus specific dictionary of at least 100,000 words long and correct a search query from a user in 500 milliseconds or less. The program was required to identify incorrectly spelled phrases of words and to use words in the phrase to guess the correct spelling of the rest of the words. The accuracy and time efficiency results shown for the final spell checker program exceeded the performance of the spell checker previously used by our sponsor company and also versions using the Landau-Vishkin algorithm and a modified-Landau-Vishkin algorithm. The previous spell checker program used was an “out of the box” spell checker built into the coding of Lucene, an open source software package that was used to store and access webpage data. The lucene Spell Checker works by ranking words using the Levenshtein edit distance. The Levenshtein edit distance algorithm operates in O(M*N) efficiency and requires M*N member space, where M and N are the lengths of the two strings being compared. The algorithm is very simple and is extremely fast when comparing short strings, but inefficient for long strings (Black, 2006). The Damerau-Levenshtein edit distance algorithm implemented in the final spell checker relies on the same method and is very similar; however, the Damerau-Levenshtein algorithm considers transposed letters as one mistake, and the Levenshtein algorithm treats them as two. The Lucene spell checker provides quick results for small test cases and is easy to use, but was unsatisfactory because of the low accuracy in returned results and poor time efficiency for long test cases (Black, 2006). The spell checker produced by the project far exceeds the Lucene spell checker in speed for longer test cases. Background The spelling mistakes of a person can be divided into two categories: typos and guesses. Typos are random mistakes made by accident. Guesses are attempts at spelling a word by sounding it out. With typos, the correct word can be found by using string matching algorithms, such as the Demarau-Levenstein edit distance algorithm, to search a dictionary and find the closest word that matches. Correcting guesses requires the words to be translated into their phonetic equivalents, which can be done using programming libraries such as soundex and metaphone. The closest match of the misspelled word can be found by comparing the phonetic equivalents using the exact same string matching algorithms used for typos. However, checking through an entire dictionary for each misspelled word is inefficient and time costly. Therefore, the process can be optimized by filtering out words in the dictionary that are cannot be matches. A Kd tree is a special data structure that intelligently organizes the dictionary so only words that could possibly match with the misspelled word are found. This acts like a filter for the core algorithm and eliminates excess calculations. In addition to checking individual words for mismatches, words that are commonly used together as phrases can be utilized to “hint” at what the correct spelling of the rest of the phrase is. This can be used to reduce excess work even more and increase efficiency. Conclusion This study was designed to create a spell checker for an Internet-based search engine. Different algorithms were created and unified under a common spell checker program. The resulting program returned the correct spelling of an inputted word most of the time, and the program ran fast enough that the user would not notice a delay. Users now can search the Internet using our sponsor company’s search engine and have their spelling corrected.