570 likes | 697 Vues
This document provides an extensive overview of string matching algorithms, focusing on both exact and approximate matching techniques. It discusses various data structures such as suffix trees and suffix arrays, and details algorithms like Horspool, Wu-Manber, and Backward Oracle Matching, including their experimental efficiencies. The text delves into the complexities of handling multiple patterns and explores advanced concepts such as dynamic programming, sequence alignment, and probabilistic search methods using Hidden Markov Models.
E N D
Recuperació de la informació • Modern Information Retrieval (1999) • Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • http://www-igm.univ-mlv.fr/~lecroq/string/index.html Algorismes de: Cerca de patrons (exacta i aproximada) (String matching i Pattern matching) Indexació de textos: Suffix trees, Suffix arrays
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models
Exact string matching: one pattern (text on-line) Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256
Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
Trie G T A T A T G Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
Trie G T A T A T G Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G T A T A T G T A A T A A
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G T A T A T G G T A T A A T A Which is the cost?
Set Horspool algorithm • How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns • Which is the next position of the window? a We shift until a is aligned with the first a in the trie not longer than lmin,or lmin
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G T 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4
Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4 4. Find the patterns
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…
Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA… …
Set Horspool algorithm G T A T A T G G T A T A A T A As more patterns we search for, shorter shifts we do! A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA… … Is the expected length of the shifts related with the number of patterns?
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3 1
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT 1 CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3 3 3 3
Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 … AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA
Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA But given k patterns, how many symbols we should take ? … log|Σ| 2*lmin*k
Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45
BOM algorithm (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata
Factor Oracle of k strings A How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? G T A G T A T G T 1,4 A A T A 3 2
Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G
Factor Oracle of k strings A Given the Factor Oracle of GTATGTA G T T
Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G T A T A
Factor Oracle of k strings G Given the Factor Oracle of GTATGTA G T A T T A
Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G T A G T G T A
Factor Oracle of k strings A 1 Given the Factor Oracle of GTATGTA G T A G T T G T A … we insert GTAA
Factor Oracle of k strings A 2 …inserting GTAA G T A G T A T G T 1 A
Factor Oracle of k strings Given the AFO of GTATGTA and GTAA G T A G T A T G T 1 A A 2 … we insert TAATA
Factor Oracle of k strings A T A 3 … inserting TAATA G T A G T A T G T 1 A A 2
Factor Oracle of k strings A Given the AFO of GTATGTA, GTAA and TAATA G T A G T A T G T 1 A A T A 3 2 …we insert GTGTA
Factor Oracle of k strings A …inserting GTGTA G T A G T A T G T 1 A A T A 3 2
Factor Oracle of k strings A G T A G T A T G T 1,4 A A T A 3 2 This is the Automata Factor Oracle of GTATGTA, GTAA, TAATA and GTGTA
SBOM algorithm • How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata
SBOM algorithm: example A We search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG … the we build the Automata Factor Oracle of GTATG, GTAAT, TAATA and GTGTA of length lmin=5 G T A G T T A 1 4 A G T A A T 2 3
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATGCTAGCTATAATAATGTATG
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATGCTAGCTATAATAATGTATG