110 likes | 277 Vues
This chapter delves into various methodologies for indexing and searching within information retrieval systems. It explores brute force methods like sequential searching, as well as advanced techniques such as Knuth-Morris-Pratt and Boyer-Moore algorithms. Key concepts include left-to-right and right-to-left scanning, and strategies such as the bad character shift and good suffix shift rules. The text highlights sub-linear time methods that enhance search efficiency by examining fewer characters, ultimately leading to optimized retrieval performance.
E N D
Modern Information Retrieval Chapter 8 Indexing and Searching
Sequential searching • brute force approach
a b a c a b a c • Knuth-Morris-Pratt approach • Left-to-right scan • Shifting rule a b a b a b a c a b ac a b ac a b a c
Boyer-Moore approach • Right-to-left scan • Bad character shift rule • Good suffix shift rule • Sub-linear time method • Examines fewer than m+n characters
Right-to-left scan • Shift one place when a mismatch occurs • O(nm) xpbctbxabpqx tpabxab
Bad character rule • Right-most position in P of each character • R(T(k)) K R(T(k))=R(y) y y x R(y) i y x R(y) < i, shift i-R(y) positions i-R(y)
Bad character rule K R(T(k))=R(y) y x i x y R(y) > i , Shift 1 positions x R(y) = 0, shift n-i+1 positions n-i+1
The strong good suffix rule x t z t’ y t z t’ x t
The strong good suffix rule x t y t y t y t
Shift-Or approach An example of the shift-or algorithm for p=aab and s=abcaaab T a b c a 0 1 1 0 1 1 1 0 1 a b E S(E) T[a] E S(E) T[b] E S(E) T[c] E S(E) T[a] E E S(E) T[a] E S(E) T[a] E S(E) T[b] a a b 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1