1 / 23

ISP 433/633 Week 4

ISP 433/633 Week 4. Text operation, indexing and search. Document Process Steps. Example Collection. Documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. Step 1: Parse Text Into Words.

nuncio
Télécharger la présentation

ISP 433/633 Week 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISP 433/633 Week 4 Text operation, indexing and search

  2. Document Process Steps

  3. Example Collection Documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat.

  4. Step 1: Parse Text Into Words • break at spaces and punctuation D5:MY DOG WEARS A HAT D1:IT IS A DOG EAT DOG WORLD D2:WHILE THE WORLD SLEEPS D3:LETSLEEPING DOGS LIE D4:I WILL EAT MY HAT

  5. Step 2: Stop Words Elimination • Remove non-distinguishing words • Pronouns, … prepositions, … articles, ... to Be, to Have, to Do • I,MY,IT,YOUR,…OF,BY,ON,…A,THE,THIS,…,IS,HAS,WILL,… D5:DOG WEARS HAT D1:DOG EAT DOG WORLD D2:WORLD SLEEPS D3:LETSLEEPING DOGS LIE D4:EAT HAT

  6. Stop Words List • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). • Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63

  7. Step 3: Stemming • Goal: “normalize” similar words D5:DOG WEAR HAT D1:DOG EAT DOG WORLD D2:WORLD SLEEP D3:LETSLEEP DOG LIE D4:EAT HAT

  8. Stemming and Morphological Analysis Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • Derivational Morphology • Derive one word from another • Often change grammatical class • build, building; health, healthy

  9. Simple “S” stemming • IF a word ends in “ies”, but not “eies” or “aies” • THEN “ies”  “y” • IF a word ends in “es”, but not “aes”, “ees”, or “oes” • THEN “es” “e” • IF a word ends in “s”, but not “us” or “ss” • THEN “s”  NULL Harman, JASIS 1991

  10. Porter’s Algorithm • An effective, simple and popular English stemmer • Official URLhttp://www.tartarus.org/~martin/PorterStemmer/ • A demo http://snowball.tartarus.org/demo.php

  11. Porter’s Algorithm • 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y Porter, Program 1980

  12. Porter’s Algorithm • Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat

  13. Problems of Porter’s Algorithm • Unreadable results • Does not handle some irregular verbs and adjectives • Take/took • Bad/worse • Possible errors:

  14. Vocabulary Occurrences DOG EAT HAT LET LIE SLEEP WEAR WORLD D1 D3 D5 D1 D4 D4 D5 D3 D3 D2 D3 D5 D1 D2 Step 4: Indexing • Inverted Files

  15. Inverted Files • Occurrences can point to • Documents • Positions in a document • Weight • Most commonly used indexing method • Based on words • Queries such as phrases are expensive to solve • Some data does not have words • Genetic data

  16. Suffix Trees 1234567890123456789012345678901234567890123456789012345678901234567 This is a text. A text has many words. Words are made from letters. Patricia tree 60 l d 50 a m n 28 t ‘ ‘ 19 e x t . 11 w ‘ ‘ 40 o r d s . 33

  17. Text Compression • Represent text in fewer bits • Symbols to be compressed are words • Method of choice • Huffman coding

  18. Huffman Coding • Developed by David Huffman (1952) • Average of 5 bits per character • Based on frequency distributions of symbols • Idea: assign shorter code to more frequent symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

  19. 0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h An Example

  20. 0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h Example Coding

  21. Exercise • Consider the bit string: 011011011110001001100011101001110001101011010111 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding

  22. Huffman Code • Prefix property • it means that no word in the code is a prefix of any other word in the code • Random access • Decompress starting from any where • Not the fastest

  23. Sequential string searching • Boyer-Moore algorithm • Example: search for “cats” in “the catalog of all cats” • Some preprocessing is needed. • Demos:http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html

More Related