N-gram Search Engine on Wikipedia

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)

Hammer : Fast and multi-functional n-gram search engine Search ngram: FAST INPUT: token, POS, chunk, NE OUTPUT: frequency to text ngrams Lexical Knowledge from Ngrams 2

Characteristics • Search up to 7 grams with wildcards • Multi-level input • Token, POS, chunk, NE, combinations • NOT, OR for POS, chunk, NE • Multi-level output • Token, POS, chunk, NE • document information • Original sentences, KWIC, ngram • Display • Show the results in the order of frequency • Running Environment • Single CPU, PC-Linux, 400MB process, 500GB disk Lexical Knowledge from Ngrams 3

Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 Lexical Knowledge from Ngrams

Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive Lexical Knowledge from Ngrams

Implementation: Overview 3. Display 2. Filtering 1. Search candidates Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

Implementation: Overview 1. Search candidates Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

From n-gram to Inverted Index • Example: 3-grams • Posting list A pos=1 A pos=2 B pos=1 B pos=2 B pos=3 C pos=3 Lexical Knowledge from Ngrams

Posting list Wide variation of posting list size (in 7-gram: 1.27B) “#EOS#” (100,906,888), “,” (55,644,989), “the” (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) List of ngramID Encoded into pointer (freq=1) C pos=3 C pos=3 5 Lexical Knowledge from Ngrams

Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use “look ahead” to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) SKIP Lexical Knowledge from Ngrams

Implementation: Overview 2. Filtering Suffix array for text 1 Search candidates. N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

Filtering Not all candidate ngramID’s match the request We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB A B Freq=123 NN PERSON VB Freq=10 LOC Freq=5 Lexical Knowledge from Ngrams

Implementation: Overview 3. Display 1. Search candidates 2. Filtering Suffix array for text N-gram data Inverted index for n-gram data Search request Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE Lexical Knowledge from Ngrams

Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC Lexical Knowledge from Ngrams

Size of data 8 GB Text 1.7 G words 200M sentences 2.4M articles Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Total 530GB Suffix array For text 260 GB N-gram data 108 GB Inverted index for n-gram data 8 GB Wikipedia text 100 GB POS, chunk, NE for N-gram data 6 GB Wikipedia POS, chunk, NE 40 GB Others Lexical Knowledge from Ngrams

Future Work Other information (ex: parse, coref, relation, genre, discourse…) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens Lexical Knowledge from Ngrams

Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive Lexical Knowledge from Ngrams

N-gram Search Engine on Wikipedia

N-gram Search Engine on Wikipedia

Presentation Transcript

N-gram Models

N-gram Based Indexing for Marathi Monolingual Search

Search Engine

Search Engine

n-gram analysis

N-gram Models

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005

Image Search Engine on Internet

A Knowledge-Based Search Engine Powered by Wikipedia

SEARCH ENGINE

Search Engine

Search Engine

Search engine

Search Engine

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing

search engine

N-gram Models

SEARCH ENGINE

A Knowledge-Based Search Engine Powered by Wikipedia

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing