1 / 17

Recuperació de la informació

Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq

minnie
Télécharger la présentation

Recuperació de la informació

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recuperació de la informació • Modern Information Retrieval (1999) • Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • Algorithms on strings (2001) • M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html

  2. String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models

  3. String matching: one pattern Text : Pattern : How does the matching algorithms made the search? There is a sliding window along the text against which the pattern is compared: At each step the comparison is made and the window is shifted to the right. Which are the facts that differentiate the algorithms? • How the comparison is made. • The length of the shift.

  4. Alg. Cerca exacta d’un patró (text on-line) Algorismes més eficients (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. patró 2 w 2 4 8 16 32 64 128 256

  5. Automaton Given a pattern p, we find for an automaton A that verifies the properties: A is acyclic. A recognizes at least the factors of p. A has the fewer states as possible. A has a linear number of transitions according to the length of p. … and at the end of the last century …

  6. Automaton Factor Oracle: properties G T A G T T A G T A G T A G T T A G T A This state recognizes all the factors that ends in the fourth letter that have not been accepted before: GTAT, TAT, AT (note that T had been recognized before). All the factors of the first four letters have been recognized. Given the word G T A T G T A All states are accepting states ==> Recognize all the factors …. and more Hip: recognize all factors of GTA

  7. Automaton Factor Oracle: algorithm ? Algorithm: for i=1 to p do Add those transitions that recognize the new factors that end in letter i;

  8. Automaton Factor Oracle: algorithm T T What happens if the transition is in the automaton?

  9. Automaton Factor Oracle: algorithm T T What happens if transition isn’t in the automaton?

  10. Automaton Factor Oracle and BOM. G T A G T T A G T A This automaton recognizes words that are not factors of GTATGTA like GTGTA => the affirmative answer is not informative, …. but The negative answer ==> the word isn’t a factor! Is the strategy of the BOM algorithm.

  11. Algorithm BOM (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Autòmaton Factor Oracle of the reverse pattern Check if the suffix is a factor of the pattern. • Which is the next position of the window? a • If some letter is not found • If the pattern is found:

  12. BOM algorithm G T A G T T A G T A • And the search phase : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A... • How the comparison is made? • Given the pattern ATGTATG we construct the automaton of the reverse pattern: A T G T A T G

  13. BOM algorithm G T A G T T A G T A • And the search phase : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G • How the comparison is made? • Given the pattern ATGTATG we construct the automaton of the reverse pattern A T G T A T G

  14. BOM algorithm G T A G T T A G T A • And the search phase : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G • How the comparison is made? • Given the pattern ATGTATG we construct the automaton of the reverse pattern A T G T A T G

  15. BOM algorithm G T A G T T A G T A • And the search phase : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G A T G T A T G • How the comparison is made? • Given the pattern ATGTATG we construct the automaton of the reverse pattern A T G T A T G

  16. BOM algorithm G T A G T T A G T A • And the search phase : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ... A T G T A T G A T G T A T G A T G T A T G A T G T A T G • How the comparison is made? • Given the pattern ATGTATG we construct the automaton of the reverse pattern A T G T A T G

  17. BOM algorithm G T A G T T A G T A • And the search phase : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G • How the comparison is made? • Given the pattern ATGTATG we construct the automaton of the reverse pattern A T G T A T G

More Related