Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Lecture on Information Knowledge Network"Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA Lecture on Information knowledge network

The 8thMisc. topics of pattern matching Method for multi-bytecode texts Toward an intelligent pattern matching: Pattern matching for XML data Pattern matching on texts with arc annotation Pattern matching with taxonomy data Appendix: Randomized algorithm

T F T 液晶の時代 Text T = A sequence of bytes → 54 46 54 B1 D5 BE BD A4 CE BB FE C2 E5 修了 Pattern P = 0 1 2 3 4 BD A4 CE BB 修了 Method for multi-bytecode texts (Japanese texts) • Synchronization problem of codewords: • False detection will occur when we do pattern matching on a Japanese text by the unit of ASCII (unit of byte). • It is necessary to determine the boundaries of characters as well as Huffman codes. Japanese EUC encoded text AC machine for a pattern P=“BD A4 CE BB”（修了） ∑－{BD} Lecture on Information knowledge network

0 1 1 0 0 1 E ∑ D Ordinal KMP automaton C 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 A B 0 0 1 1 0 0 1 1 0 0 1 1 KMPautomaton with sync. 0 0 0 0 1 1 1 1 Review: Solution by automaton with synchronization M. Miyaaki, S. Fukamachi, M. Takeda: Speeding up the pattern matching machine for compressed texts (in Japanese), Trans. IPSJ, Vol. 39, No. 9, pp.2638-2648, 1998. HuffmanencodedPatternE(P) = 011001 Huffman tree PatternP = DEC KMPautomaton with sync. Text T = ABECA・・・ Huffman encoded text E(T) = 0000000110010000・・・ Lecture on Information knowledge network

T F T 液晶の時代 [00-8D,90-9F] Text T = 0 [8F] [8E, A0-FF] A sequence of bytes → 54 46 54 B1 D5 BE BD A4 CE BB FE C2 E5 g 修了 Pattern P = z [A0-FF] {half-widthchar.} [A0-FF] [00-8D, 90-9F] [8E, A0-FF] 　∖ [BD] 0 1 2 3 4 z {full-width char.} BD A4 CE BB [A0-FF] 修了 [A0-FF] [8F] part for synchronization g PM on multi-bytecode texts by an automaton with synchronization M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002. Japanese EUC encoded text An AC machine with synchronization, which correctly detects （EUCencoded）pattern P=“修了” Code automaton accepting any EUC code Lecture on Information knowledge network

Mask table M ab a b a b b 1 0 1 0 0 0 1 0 1 1 1 2 3 4 5 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 & 0 0 0 0 0 0 Idea of bit-parallel technique ababb Pattern P: abababba TextT: Ri = (Ri-1<<1 | 1) & M(T[i]) This can be calculated in O(1) time ※Keeping only the right transferred bits by taking AND op. with the maskbits M. Lecture on Information knowledge network

[00-8D,90-9F] [8E, A0-FF]/[BD,CE] M[修]=01 BD A4 z 0 1 2 [A0-FF] CE BB [A0-FF] 3 4 [8F] M[了]=10 g Bit-parallel method for multi-bytecode texts Heikki Hyyrö, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda: On Bit-Parallel Processing of Multi-byte Strings,Proc. of Asia Information Retrieval Symposium, pp.190-196, 2004. • Basic idea: • We construct the pattern matching machine (code automaton) that can determine the boundaries of codewords and recognize each multi-byte character in the input pattern. • The code automaton runs while reading the text by each byte, and it output the mask bit sequence corresponding to each character in the input pattern. • We simulate an arbitrary bit parallel algorithm by using the output M(T[i]) of the code automaton instead of reading T[i]. arbitrary bit parallel algorithm A code automaton that can determine the boundariesof EUC codes and recognize “修” and “了”. Ri = (Ri-1<<1 | 1) & M(T[i]) Lecture on Information knowledge network

Toward an intelligent pattern matching • Until now … • Text= just a sequence of characters(We’ve ignored the background knowledge about the text and meaning of sentences.) • Fast! Fast! Fast! • From now on… • Text= a sequence of sentences that have meanings and/or structures • We need an intelligent pattern matching （of course, at high speed!） • Pattern matching in consideration of the structure of the text • Pattern matching for XML texts • Pattern matching for texts with arc-annotation • etc… • Pattern matching in consideration of the meaning of the text （cooperating with ontology data） • Pattern matching in consideration of the taxonomic information • Thesaurus, Inductive rules, etc… Lecture on Information knowledge network

RDB person person “” name person/name “” person/name/first Makiko first last person/name/last Tanaka Makiko Tanaka … … Pattern matching for XML texts: previous ones XMLdocument memory ＸＭＬ parser Application program SQL DOM API XMLdocument Lecture on Information knowledge network

Pattern matching for XML texts: our approach M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002. XMLdocument memory Pattern matching algorithm <person> <name> <first> Makiko </first> <last> Tanaka </last> </name> </person> Application program XMLdocument Lecture on Information knowledge network

Advantage of pattern matching approach • It can batch the processing for a huge XML document and a large amount of documents • It can treat many queries at once. Fast processing XMLdocument In a little memory space Various applications Treestructure Lecture on Information knowledge network

Wrong detection Problem in a simple pattern matching algorithm • It may match to part of tag names. Is it inside or outside of tags? Pattern Π = {other, <mother>} <body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other> </p> </body> Lecture on Information knowledge network

o t h e r other 0 1 2 3 4 5 ∑ < m o t h e r > 14 6 7 8 9 10 11 12 13 other <mother> Other than‘<‘ Other than‘<‘ o o t t h h e e r r other 14 14 0 0 1 1 2 2 3 3 4 4 5 5 other < < < < > > m m o o t t h h e e r r > > 15 15 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 <mother> Other than‘<‘ Other than‘<‘ <mother> A solution An ordinal AC machine An AC machine in consideration of XML tags Lecture on Information knowledge network

<mother> <mother nature=“tender”> <mother nature=“hard”> ・・・ Other than ‘< ‘ < 以外の文字 o t h e r Other than ‘< ‘ o t h e r other 14 0 1 2 3 4 5 14 0 1 2 3 4 5 other 16 > < < < > < > m o t h e r m o t h e r > 15 ] 15 6 7 8 9 10 11 12 13 > 6 7 8 9 10 11 12 13 <mother> <mother> Other than ‘< ‘ > 以外の文字 Handling of attributes The same tag <mother> Lecture on Information knowledge network

stack 0 (<last>,2) Other than<person> <person> <person> (<name>,1) 1 (<person>,0) (<xml>,0) <name> 2 14 0 1 2 3 4 5 o t h e r <last> 3 Other than‘<‘ 15 < other 6 7 8 9 10 11 12 13 < > m o t h e r > <mother> Other than‘<‘ Pattern matching in consideration of XML path I want to look for the parsons whose family name is “Tanaka” （In Xpath expression, the element //person/name/last/ is equal to “Tanaka”） ={<person>,</person>,<name>,</name>,<last>,</last>,…}={Tanaka} Lecture on Information knowledge network

Processible subset of XPath • Limitation of pattern matching approach • We cannot specify the predecessor nodes • The complex filter specifications remarkably decrease the processing speed LocationPath ::= '/' RelativeLocationPath RelativeLocationPath ::= Step | RelativeLocationPath '/' Step Step ::= AxisSpecifier NodeTest AxisSpecifier ::= AxisName '::' AxisName ::= 'attribute' | 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace' NodeTest ::= QName | NodeType '(' ')' NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction' /descendant::cars/child::car/attribute::node() //cars/car/@＊ Lecture on Information knowledge network

Speed comparison with Sgrep • Comparison with Sgrep(J. Jaakkola and P. Kilpeläinen) CPU time (sec.) Text : 110MB (English text) CPU : Celelon 366MHz Memory : 128MB OS : Kondara/MNU Linux 2.1 RC2 Lecture on Information knowledge network

A G T C A C G C C C G T Pattern matching for texts with arc-annotation An example of the text with arc-annotation: Definition： • The arc annotation A that accompanies sequence S is the set of union of integers {1, 2, …, |S|} • Each element (iL, iR) ∈A is called an arc. • S[iL] and S[iR] are called a right endpoint and a left endpoint, respectively. • For an arbitrary arc, we assume that it holds that iL < iR. • Moreover, any two arcs doesn't share the same integer. • That is, any two arcs doesn't share the same endpoint. 1 2 3 4 5 6 7 8 9 10 11 12 Lecture on Information knowledge network

Example of text with arc annotation ・・・ACACCUAGCΨTGUGU・・・ The string having nested arcs An example of the tRNA(tRNAPhe) two-dimensional structure Lecture on Information knowledge network

A T G C T A G T C A C G C C C G T Arc-preserving subsequence(APS) problem • The APS problem is to answer if the following conditions are satisfied, when text S1 = S1[1 : n] and pattern S2 = S2[1 : m] are given with arch annotations A1 and A2, respectively. • S2 is a subsequence of S1 • There are arcs in the pattern if there are arcs in the sequence, and vice versa. Text: S1:= A G T C A C G C C C G T Pattern: S2:= ○ base match Text: S1:= ×arc match Pattern: S2:= A T G C T Lecture on Information knowledge network

Crossing Chain Nested Plain APS(TYPE1, TYPE2) • The difficulty of the APS problem changes for its arc annotation structure • APS(TYPE1, TYPE2) • TYPE1：arc structure of the text • TYPE2：arc structure of the pattern • Example： APS(nested, chain) • Arc structure of the text is “nested” • Arc structure of the pattern is “chain” Difficulty Limitation High loose Low strict Lecture on Information knowledge network

Result of Kida[2005] Kida: Faster Pattern Matching Algorithm for Arc-Annotated Sequences, Proc. of Federation over the Web,LNAI (to appear) The previous work of APS problem: • J. Gramm, J. Guo, and R. Niedermeier.“Pattern matching for arc-annotated sequences.”In Proc. 22nd FSTTCS, volume 2556 of LNCS, pages 182–193. Springer, 2002. The result of Kida[2005]: • proposed an improved algorithm based on the GGN algorithm • However, the worst case complexity is as the same as GGN • corrected an error of Gramm-Guo-Niedermeier (GGN) algorithm • The original GGN algorithm include an error • have implemented and experimented • The proposed algorithm runs 2～5 times faster than GGN APS(nested, nested) is solved in O(nm) Lecture on Information knowledge network

Change to the text lengthn |A1|=20% of n, m=20, |A2|=4 Lecture on Information knowledge network

Change to the pattern lengthm |A2|=20% of m, n=1000, |A1|=100 Lecture on Information knowledge network

Take a breath • Summary to here • Method for multi-bytecode texts (Japanese texts) • Embedding the code automaton into AC machine for synchronization • Combining the code automaton that outputs mask bit sequences with bit-parallel methods • Pattern matching in consideration of the structure of the text • Pattern matching for XML texts • Pattern matching for arc-annotated texts ～Trivia～ How to compute min(x,y) without conditional branching when two integers x and y are represented as m-bits sequences S ← ((x | 10m) － y) & 10m, S ← S － (S ≫ m), min(x,y) ← (~S & x) | (S & y) （However, we need m+1-bits for each） King Penguin flying in water（2005.8.12 in Asahiyama Zoo ） Lecture on Information knowledge network

Example of pattern matching in consideration of taxonomic information (PMTX) Gene Ontology molecularfunction cell catalyticactivity insolublefraction membranefraction cellsurface cellenvelope cellwall lyaseactivity vesicularfraction microsome hyaluronate Pattern P: (cell) (receptor) (for) (catalytic activity) Text T: Pub:1: Cell. 1990 Jun 29;61(7):1303-13. Title:CD44 is the principal cell surface receptor for hyaluronate. Authours:Aruffo A, Stamenkovic I, Melnick M, Underhill CB, Seed B. Lecture on Information knowledge network

Result of Kida&Arimura[2004] T. Kida and H. Arimura: Pattern Matching with Taxonomic Information, Proc. of Asia Information Retrieval Symposium(AIRS2004), pp. 265-268, Oct. 2004. • O(m+mh/w) time for preprocessing • O(m|∑|/w) space • O(mn/w) time for scanning the text • O(m+h) time for preprocessing • O(|∑|) space • O(n) time for scanning the text • m: the length of patternP∈∑* • n: the length of textT∈∑* • h: the size of taxonomic information H • |∑|: the size of set ∑ of concepts • w: the length of word (say, 32 or 64) It works well when m < w Lecture on Information knowledge network

G Pattern P:= A B E F E Text T:= A B C B D F C B D F C A B Taxonomic information and sorted alphabet • Sorted alphabet （∑, ） • ∑： a finite alphabet （a set of concepts） • ： a partial order relation An example of DAG H representing (∑,) We assume that a pattern and a text are given as a sequence of concepts:P∈∑* and T∈∑* Concept E corresponds with the character class [A,B,C,D,E]. ※ This is also called as Hasse diagram. Lecture on Information knowledge network

A B C D Z Examples of sorted alphabet (abc) (1) flat alphabet (ab) (bc) (ac) ? (b) (a) (c) [0-9] [a-z] φ 0 1 2 9 a z (3) letter-sets alphabet (2) class of characters Lecture on Information knowledge network

Mask table M ab a b [ab] b b 1 0 1 0 0 0 1 1 1 1 1 2 3 4 5 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 1 0 & 0 1 1 1 0 0 We can utilize the Shift-And method! ab[ab]bb PatternP: ababbbba Text T: The difference is just here! This is the same Ri = (Ri-1<<1 | 1) & M(T[i]) Lecture on Information knowledge network

Mask table M’ ABCDEFG 1 0 0 1 1 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 A B C D E F G E D F C A B Toward taxonomic information Taxonomic information H: Pattern P:= A B E F O(mh) ? Text T:= A B C B D F C B Lecture on Information knowledge network

Computation of M’(a) • Lemma 1Let （∑,） be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that M’(a) = ∪x∈Upb(a) M(x) . • Lemma 2Let （∑,） be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds thatM’(a) = M(a) ∪∪x∈Par(a) M’(x) . Lecture on Information knowledge network

Pseudo code for computing M’(a) • Preprocess_M’ (P=p1…pm) /* Assume H is a global variable */ • initalize M(a) as follows: • M(a)={1≦i≦m | P[i]=a}； • for each a∈∑ do • CalculateM’(a)； • end of for • Function CalculateM’(a) • if M’(a) has been computedthen return M’(a) • else do • M’(a) = M(a); • for each x∈Par(a) do • M’(a)=M’(a)∪(CalculateM’(x)); • end of for • return M’(a); O(m) TotalO(m+mh/w) O(h) O(m/w) Lecture on Information knowledge network

Overview of retrieval system with PMTX algorithm Text ＤＢ Patternmatchingmachine Translator Pattern Taxonomicinformation ＤＢ Occurrences We have to parse the text into a sequence of concepts O(h+n) Translator Replace Automaton （Arikawa and Shiraishi[1984]） Or using a morphological parser for natural language texts like ChaSen Lecture on Information knowledge network

The 7th summary • Method for multi-bytecode texts (Japanese texts) • Embedding the code automaton into AC machine for synchronization • Combining the code automaton that outputs mask bit sequences with bit-parallel methods • Toward an intelligent pattern matching: • Pattern matching in consideration of the structure of the text • Pattern matching for XML texts • Pattern matching for arc-annotated texts • Pattern matching in consideration of the meanings of the text (in cooperation with ontology data) • Pattern matching in consideration of taxonomic information • Prof. Arimura will take charge of this class from the next • Efficient data structure for information retrieval • Data mining form the web, etc. Lecture on Information knowledge network

Karp-Rabin algorithm KARP R.M., RABIN M.O., Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2):249-260, 1987. • It is a randomized algorithm using hashing technique • Matching a string by regarding it as an integer! • The worst case takes O(mn) time, but it becomes O(n+m)time in the average • Extra space we need is only O(1) ∑ = { 0,1,2,…,9 } Pattern： mod 13 3 1 4 1 5 7 2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 Text：・・・・・・・・・ mod 13 8 9 3 11 0 1 7 8 4 5 10 11 7 9 11 Correct! Wrong! 3 1 4 1 5 2 14152 ≡ (31415 – 3×10000)×10 + 2 (mod 13) ≡ (7 – 3×3)×10 + 2 (mod13) ≡ 8 (mod 13) The lowest figure that is newly input The highest figure in the previous step 7 8 Lecture on Information knowledge network

Pseudo code • Karp-Rabin (P, T, d, q) • m ← length[P]. • n ← length[T]. • h ← dm–1 mod q. • p ← 0. • t0 ← 0. • for i ← 1 to m do • p ← (d・p + P[i]) mod q; • t0 ← (d・t0 + T[i]) mod q. • for s ← 0 to n – m do • if p = tsthen • if P[1…m] = T[s+1…s+m] then • report an occurrence at s; • else if s < n – m then • ts+1 ← (d・(ts – T[s+1]・h)+T[s+m+1]) mod q. Check if the candidate is the occurrence Lecture on Information knowledge network

i : 1 2 3 4 5 6 7 8 9 10 T[i] = a c b a b b a c c b Randomized approximate pattern matching using FFT K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A Note on Randomized Algorithm for String Matchingwith Mismatches. Nordic Journal of Computing, 10(1):2-12, 2003. • Fast Fourier Transform (FFT)can be computedat high speed on hardware • They do (approximate) pattern matching by replacing strings into a sequence of numeric and then computing the score vectors at high speed by FFT K. Baba（Kyushu Univ.） P = a b ba c a b b a c a b b a c abbac Scorevector a b b a c a b b a c ci = 3 1 1 5 2 0 Lecture on Information knowledge network

Lecture on Information Knowledge Network "Information retrieval and pattern matching"