Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture

Search AlgorithmsWinter Semester 2004/200515 Nov 20045th Lecture Christian Schindelhauer schindel@upb.de

Chapter II Chapter II Searching in Compressed Text 15 Nov 2004

Searching in Compressed Text (Overview) • What is Text Compression • Definition • The Shannon Bound • Huffman Codes • The Kolmogorov Measure • Searching in Non-adaptive Codes • KMP in Huffman Codes • Searching in Adaptive Codes • The Lempel-Ziv Codes • Pattern Matching in Z-Compressed Files • Adapting Compression for Searching

Ziv-Lempel-Welch (LZW)-Codes • From the Ziv-Lempel-Family • LZ77, LSZZ, LZ78, LZW, LZMZ, LZAP • Literature • LZW: Terry A. Welch: "A Technique for High Performance Data Compression", IEEE Computer vol. 17 no. 6, Juni 1984, p. 8-19 • LZ77 J. Ziv, A. Lempel: "A Universal Algorithm for Sequential Data Compression", IEEE Transactions, p. 337-343 • LZ78 J. Ziv, A. Lempel: "Compression of Individual Sequences Via Variable-Rate Coding", IEEE Transactions on Information, p. 530-536 • known as Unix-command: “compress” • Uses: • TRIES

Trie = “reTRIEval TREE” Name taken out of “ReTRIEval” Tree for storing/encoding text efficient search for equal prefices Structure Edges labelled with letters Nods are numbered Mapping Every node encodes a word of the text The text of a node can be read on the path from the root to the node Node 1 = “m” Node 6 = “at” Inverse direction: Every word uniquely points at a node (or at least some prefix points to a leaf) “it” = node 11 “manaman” points with “m” to node 1 Encoding of “manamanatapitipitipi” 1,2,3,4,5,6,7,8,9,10,11,12 or 1,5,4,5,6,7,11,10,11,10,8 Decoding of 5,11,2 “an”, “it”, “a” = anita 0 m a n i t 1 2 3 8 9 m n t p t p 4 5 6 7 11 10 i 12

How LZW builds a TRIE LZW works bytewise starts with the 256-leaf trie with leafs “a”, “b”, ... numbered with “a”, “b”, ... LZW-Trie-Builder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od - a b c d z a b c d ... z Example: nanananananana - a n ... ... a n a Scanned: na na

How LZW builds a TRIE LZW works bytewise starts with the 256-leaf trie with leafs “a”, “b”, ... numbered with “a”, “b”, ... LZW-Trie-Builder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od Example: nanananananana - a n ... ... a n a Scanned: nanananananana na Continue with: nanananananana nan Residual part: nanananananana

How LZW produces the encoding LZW-Encoder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then output (m,u,T[i]) m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od if u  root(TRIE) then output (u) fi • start-Trie = • 256-leaf trie with • bytes encoded as • 0,1,2,..,255 The output m is predictable: 256,257,258,... Therefore use only output(u,T[i])

An Example Encoding LZW-Encoder(T) n  length(T) i  1 TRIE  start-TRIE m  number of nodes in TRIE u  root(TRIE) while i  n do if no edge with label T[i] under u then output (u,T[i]) m  m+1 append leaf m to u with edge label T[i] u  root(TRIE) else u  node under u with edge label T[i] fi i  i +1 od if u  root(TRIE) then output (u) fi 0 m i n a t p m n a i t p a a t p t p i 256 257 259 260 261 262 264 n i 258 263 Encoding of m a n a m a n a t a p i t i p i t i p i (m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i) 256 257 258 259 260 261 262 264 264 ma na (ma)n at ap it ip (it)i pi

The Decoder LZW-Decoder(Code) i  1 TRIE  start-TRIE m  255 for i  0 to 255 do C(i)=“i” od while not end of file do (u,c)  read-next-two-symbols(Code); if c exists then output (C(u), c) m  m+1 append leaf m to u with edge label c C(m)  (C(u),c) else output (C(u)) od 0 m i n a t p m n a i t p a a t p t p i 256 257 259 260 261 262 264 n i 258 263 If the last string of the code did not produce a new node in the trie then output thecorresponding string Encoding of m a n a m a n a t a p i t i p i t i p i (m,a) (n,a) (256,n) (a,t) (a,p) (i,t) (i,p) (261,i) (p,i) 256 257 258 259 260 261 262 264 264 ma na (ma)n at ap it ip (it)i pi

Performance of LZW • Encoding can be performed in time O(n) • where n is the length of the given text • Decoding can be performed in time O(n) • where n is the length of the uncompressed output • The memory consumption is linear in the size of the compressed code • LZW can be nicely implemented in hardware • There is no software patent • so it is very populary, see “compress” for UNIX • LZW can be further compressed using Huffman-Codes • Every second character is a plain copy from the text! • Search in LZW is difficult • The encoding is embedded in the text (adaptive encoding) • For one search in a text there is a linear number of possibilities of encodings of the search pattern (!)

The Algorithm of Amir, Benson & Farach“Let Sleeping Files Lie” • Ideas • Build the Trie, but do not decode • Use KMP-Matcher with the nodes of the LZW-Trie • Prepare a data structure based on the pattern m • Then, scan the text and update this data structure • Goal: Running time of O(n + f(m)) • where n is the code length • f(m) is some small polynomial depending on the pattern length m • for well compressed codes and f(m)<n it should be faster than decoding and then running text search

Searching in LZW-CodesInside a node Example: Search for tapioca abtapiocaab abarb blahblah tapioca is “inside” a node Then we have found tapioca For all nodes u of a trie: Set: Is_inside[u]=1 if the text of u contains the pattern

Searching in LZW-CodesTorn apart Example: Search for tapioca Parts are hidden in some other nodes All parts arenodes of theLZW-Trie carasi abrastap io The end is thestart of anothernode Startingsomewhere in a node

Finding the start: longest_prefixThe Suffix of Nodes = Prefix of Patterns Is the suffix of the node a prefix of the pattern And if yes, how long is it? Classify all nodes of the trie For very long text encoded by a node only the last m letters matter Can be computed using the KMP-Matcher-algorithm while building the Trie Example: Pattern: “manamana” The last fourletter are the first four ofthe pattern pamana amanaplanacanalpamana length of suffix of node which is prefix of patter is 2 result: 4 mama papa amanaplanacanalpamana result: 0 m mana result: 4 amanaplanacanalpamanam

Is the node inside of a Pattern? Find positions where the text of the node is inside the pattern Several occurrences are possible e.g. one letter There are at most m(m-1)/2 encodings of such sub-strings For every sub-string there is exactly one node that fits Define table Inside-Node of size O(m2) Inside-Node[start,end] := Node that encodes pattern P[start]..P[end] From Inside-Node[start,end] one can derive Inside-Node[start,end+1] as soon as the corresponding node is created To quickly find all occurrences use pointer Next-inside-occurrence(start,end) indicates the next position where the substrings lies It is initialized for start=end with the next occurrence of the letter Example: Pattern: “manamana” ana This text could be in positions 2-4 or positions 6-8 of the pattern ana m result: (2,5) anam rorororororo result: (0,0) is not in the pattern

Finding the End: longest_suffixPrefix of the Node = Suffix of the Pattern Is the prefix of the node a suffix of the pattern And if yes, does it complete the pattern, if already i letters were found? Classify all nodes of the trie For very long text encoded by a node only the first m letters matter Since the text is added at the right side this property can be derived from the ancestor Example: Pattern: “manamana” ananimal Here 3 and 1 could be the solution We take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using  on the reverse string) manammanaaaaaaaaaaaa manamanamana result: 8 manammanaaaaaaaaaaaa m panamacanal result: 0 manammanaaaaaaaaaaaam

How does it fit? On the left side we have the maximum prefix of the pattern On the right side we have the maximum suffix of the pattern 10 letter pattern: pamapamapa 6 letter suffix found 8 letter prefix found 14 letters? pamapana panapamapama Yet the pattern is inside, though, since the last 6 letters + the first 8 letters of the pattern give the pattern Solution: Define prefix-suffix-table PS-T(p,s) = 1 if p-letter prefix of P and s-letter suffix of P contain the pattern

Computing the PS-Table in time O(m3) For all p and s such that p+sm compute PS-T[p,s] Run the KMP-Matcher for pattern P in P[m-p+1..m]P[1..s] needs time O(m) for each combination of p and s Leads to run time of O(n3) 10 letter pattern: pamapamapa xyzpamapama pamapaxyz If pattern pamapamapa found in text pamapamapamapa then PS-T[8,6] = 1

Computing the Prefix-Suffix-Table in Time O(m2) - Preparation p a m a p a m a p a p a m a p a m a p a p a m a p a m a p a p p a m a a m a p a p p a m a a m a p a ptr[i,j] = next left position from i where the suffix of P of length j occurs = max{k < i | P[m-j+1..m] = P[k..k+j-1] or k = 0}

Computing the Prefix-Suffix-Table in time O(m2)Initialization Init-ptr (P) m  length(P) for i  1 to m do ptr[i,0]  i-1 od for j  1 to m-1 do last  m-j+1 i  ptr[last+1,j-1]-1 while i  0 do if P[i]=P[last] then ptr[last,j]  i last  i fi i  ptr[i+1,j-1]-1 od od p a m a p a m a p a p a m a p a m a p a p a m a p a m a p a p p a m a a m a p a p p a m a a m a p a Run time: O(m2)

Computing the Prefix-Suffix-Table in time O(m2) p a m a p a m a p a Init-PS-T(P) • m  length(P) • ptr  Init-ptr(P) • for i  1 to m-1 do • j  i+1 • while j  0 do • PS-T[i,m-j+1] = 1 • j  ptr[j,m-i] • od • od PS-T[8,2]=1 p a m a p a m a p a ptr[5,2] ptr[9,2] PS-T[8,8]=1 PS-T[8,6]=1

ABF-LZW-Matcher(LZW-Code C, uncompressed pattern P) n  length( C), m  length( M) Init-PS-T(P) longest_prefix[P[1]] 1 longest_suffix[P[m]] 1 for i  1 to m do inside_node[i,i] P[i] od   Compute-Prefix(P) TRIE  start-TRIE v  255 prefix  0 for i  0 to 255 do C(i)=“i” od for l  1 to n do (u,c)  read-next-two-symbols(Code) v  v+1 Update_DS() Check_for_Occurrence() od longest prefix of P can be found in node P[1] longest suffix of P can be found in node P[m] Only single node characters can be inside of P Standard LZW-Trie Initialization Insert new node v into data structure Check for occurences of P

Update Data Structure Update_DS() length[v]  length[u]+1 /* omitted C[v] C[u]c */ is_inside[v]  is_inside[u] if longest_prefix[u]< m and P[longest_prefix[u]+1]= c then longest_prefix[v]  longest_prefix[u] +1 fi if length[u]<m then for all entries (start,end) of u in inside_node do if P[end+1]=c and end<m then inside_node[start,end+1]  v Link new entry of v fi do if longest_suffix[u] < length[u] or P[length[v]] c then longest_suffix[v]  longest_suffix[u] else longest_suffix[v]  1+longest_suffix[u] if longest_suffix[v] = m then is_inside[v]  1 fi fi Standard LZW code if u contains the pattern, so does v xyzmana m xyzmanam There is a linked list of u for all positions of inside_node pointing to u This occurs at most m2 times over all rounds manamm x manammx manama n manaman

Check for Occurrences Check_for_Occurrences() • if is_inside[v] = m then • return “pattern found at l” • prefix  longest_prefix[v] • else if prefix = 0 then • prefix  longest_prefix[v] • else if prefix + length[v] < m then • while prefix  0 and inside-node[prefix+1,prefix+length[v]]  v do • prefix  (prefix) • od • if prefix = 0 then prefix  longest_prefix[v] • else prefix  prefix+length[v] • fi • else • suffix  longest_suffix[v] • if PS-T[prefix,suffix]=1 then • return “pattern found at l” • prefix  longest_prefix[v] • else • prefix  longest_prefix[v] • fi • fi xyzmanamanaxyz Like in KMP-matcher This occurs at most || m2 times man xyzmana namanaxy xyzmana

Running Time of the Matcher • Initialization needs time O(m2) • Amortized analysis leads to (additional) time for checking of inner words of O(min{N,|| m2}) • Every inner word occurs at most || times • Where N is the length of the uncompressed text • and n is the length of the compressed text • Run time: O(n + m2 + min{N,|| m2}) • For small search pattern faster than the alternative • which is Decompress and apply Boyer-Moore-Matcher

Text Compression Allowing Fast Searching Directly “A Text Compression Scheme that Allows Fast Searching Directly in the Compressed File”, Udi Manber, ACM Trans. Inf. Systems, Vol15, No. 2 , 1997,124-136 • Idea: • Do not use LZ-compression or Huffman Codes • Combine some letter pairs (a,b) and encode them into the “free” ASCII space (128-255) • Let f(a,b) denote the weight of such a pair • Encode the 128 most frequent pairs into a letter of {128,..,255} each • Use only non-overlapping pairs V1 times V2 that are disjoint, i.e. • Sum of weights of f(a,b) gives the compression ratio • Then one can apply Boyer-Moore-Algorithm directly on the code • Since pattern and text will be encoded with the same byte string • Problem: Choosing these sets optimally is NP-complete! • Solution: • Greedy heuristic (of unclear performance) gives compression rate of 28-33%

Example • The most common “digraphs”: • ther onan re he in ed nd ha ... • Encoding: f(th)=128, f(er)= 129, f(on)=130, f(an)=131, f(ed)=132 • No compression: re, he, in, nd, ha t h e d V2 V1 a r o n i

Chapter III Chapter III Searching the Web 15 Nov 2005

Problems of Searching the Web • Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages • 10.000 words cover more than 95% of each text • much more web-pages than words • Users hardly ever look through more than 40 results • The problem is not to find a pattern, but to find the most important pages • Problems: • Important pages do not contain the search pattern • www.porsche.com does not contain sports car or even car • www.google.com does not contain web search engine • www.airbus.com does not contain airplane • Certain pages have nearly every word (dictionary) • Names are misleading • http://www.whitehouse.org/ is not the web-site of the white house • www.theonion.com is not about vegetables • Certain pattern can be found everywhere, e.g. page, web, windows, ...

How to rank Web-pages • The main problem about searching the web is to rank the importance • Links are very helpful: • Humans are usually introduced on purpose • The context of the links gives some clues about the meaning of the web-page • Pages where many people point to are of probably very important • Most search rely on links • Other approach: Ontology of words • Compare the combination of words with the search word • Good for comparing text • Difficult if single word patterns are given

Thanks for your attentionEnd of 5th lectureNext lecture: Mo 22 Nov 2004, 11.15 am, FU 116Next exercise class: Mo 15 Nov 2004, 1.15 pm, F0.530 or We 17 Nov 2004, 1.00 pm, E2.316

Search Algorithms Winter Semester 2004/2005 15 Nov 2004 5th Lecture