1 / 21

Phrase Hierarchy Inference

Phrase Hierarchy Inference Gordon Paynter, UC Riverside Craig Nevill-Manning, Google Ian Witten, University of Waikato Outline Overlapping vs non-overlapping phrases Memory-based algorithm Suffix trees Suffix arrays Multipass algorithm Non-overlapping phrases

Faraday
Télécharger la présentation

Phrase Hierarchy Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phrase Hierarchy Inference Gordon Paynter, UC Riverside Craig Nevill-Manning, Google Ian Witten, University of Waikato

  2. Outline • Overlapping vs non-overlapping phrases • Memory-based algorithm • Suffix trees • Suffix arrays • Multipass algorithm

  3. Non-overlapping phrases • Given a text, parse it into a tree of repeated phrases • Advantage • Based on existing data compression algorithms • Disadvantage • Sometimes arbitrary association of words In the beginning, God created the heaven and the earth

  4. Overlapping Phrases • Instead, we count all repeating phrases, even if two phrases overlap • Limit phrase length to, say, ten

  5. Memory-based Algorithm • For each word w: • Everywhere that word occurs, consider the phrase formed by the word plus the word to the left (aw) • Similarly for words to the right (wa) • If the phrase is always preceded or followed by the same word, extend the phrase • If the phrase begins or ends with a stopword, extend the phrase • Add all the extended phrases to the list of expansions for w • For each phrase p: • …

  6. Memory-based Algorithm • Problem: • How to efficiently find words to the right and left for every occurrence of a word or a phrase? • Solution: • Suffix trees

  7. Suffix Tree • A compacted trie of suffixes • Trie: a tree containing a set of strings she sells sea shells on the sea shore s h e l l s   o r e  e l l l s  a  o n  t h e 

  8. Suffix Tree • Compacted trie: no nodes with only one child s h e l l s   o r e  e l l l s  a  o n  t h e  s h e lls  ore e llls a on the

  9. Suffix Tree • Compacted trie of all suffixes she sells sea shells on the sea shore he sells sea shells on the sea shore e sells sea shells on the sea shore sells sea shells on the sea shore sells sea shells on the sea shore ells sea shells on the sea shore lls sea shells on the sea shore ls sea shells on the sea shore s sea shells on the sea shore sea shells on the sea shore sea shells on the sea shore …

  10. Two Surprising Facts • Even though there are O(n2) characters in all the suffixes, • Suffix trees consume O(n) space • Suffix trees take O(n) time to compute

  11. Suffix Tree • How does the suffix tree help us? • Build a suffix tree of words (instead of single letters) • For any word, words to the right are children in the tree • Compaction means that the longest unique sequence is already computed • For words to the left, build a suffix tree for the reverse sequence

  12. Suffix Array • Sorted list of suffixes ·sea·shells·on·the·sea·shore ·sells·sea·shells·on·the·sea·shore e·sells·sea·shells·on·the·sea·shore ells·sea·shells·on·the·sea·shore he·sells·sea·shells·on·the·sea·shore lls·sea·shells·on·the·sea·shore ls·sea·shells·on·the·sea·shore s·sea·shells·on·the·sea·shore sea·shells·on·the·sea·shore sells·sea·shells·on·the·sea·shore she·sells·sea·shells·on·the·sea·shore

  13. Suffix Array • Advantages • Simple: 10 lines of code • Space efficient: one array of pointers • Disadvantages • More expensive to create: O(n log n) • More expensive to operate on (linear scans instead of following an edge)

  14. Multi-pass Algorithm • Disk seeks dominate • minimize disk seeks • fit within available memory • Disk reads are cheap, seeks are expensive • Make multiple passes over the data, using as little memory as possible

  15. Three Phases • Phase 1: count all single words, two word phrases, three word phrases… • Phase 2: make expansion lists for each phrase • Phase 3: delete uninteresting phrases

  16. Phase 1: Count Phrases • Make one pass over the data, counting individualwords • Write out all words that appear more than once • Make a second pass over the data, counting pairs of words, where both words appear more than once • Write out all pairs that appear more than once • Make a third pass over the data, counting triples of words, where both overlapping pairs appear more than once • Write out all triples that appear more than once • …

  17. Phase 1: Output words and 31 Gone 2 man 4 old 12 sea 8 the 57 Wind 3 with 17 pairs of words and the 25 Gone with 2 man and 3 old man 2 The old 5 the sea 3 the Wind 2 with the 13 triples of words and the sea 3 Gone with the 2 man and the 2 old man and 2 The old man 2 with the Wind 2

  18. Phase 2: Make Expansion Lists • Read all pairs of words that appear more than once (from phase 1) • Insert each pair in the list for each word • Read all frequent triples • Insert each triple in the list for each overlapping pair • …

  19. Phase 2: Output words and 31 Gone 2 man 4 old 12 sea 8 the 57 Wind 3 with 17 pairs of words and the 25 Gone with 2 man and 3 old man 2 The old 5 the sea 3 the Wind 2 with the 13 triples of words and the sea 3 Gone with the 2 man and the 2 old man and 2 The old man 2 with the Wind 2 …

  20. Phase 3 • Delete each phrase in the hierarchy if • it begins or ends in a stopword (“man and”) • it occurs in a particular longer phrase more than 75% of the time (“theoretical computer”) • Pointers to that phrase now point to that phrase’s expansions • Process is recursive

  21. Phase 3: Output words and 31 Gone 2 man 4 old 12 sea 8 the 57 Wind 3 with 17 pairs of words and the 25 Gone with 2 man and 3 old man 2 The old 5 the sea 3 the Wind 2 with the 13 triples of words and the sea 3 Gone with the 2 man and the 2 old man and 2 The old man 2 with the Wind 2 …

More Related