440 likes | 772 Vues
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal. Indexing in Search Engine. Linguistic Preprocessing. Normalized terms. User query. Already built Inverted Index Lookup the documents that contain the terms.
E N D
Inverted index,Compressing inverted indexAndComputing score in complete search systemChintan Mistry Mrugank dalal
Indexing in Search Engine Linguistic Preprocessing Normalized terms User query Already built Inverted Index Lookup the documents that contain the terms Rank the returned documents according to their relevancy Documents Results
Forward index • What is INVERTED INDEX? First look at the FORWARD INDEX! Documents Words • Querying the forward index would require sequential iteration through each document and to each word to verify a matching document • Too much time, memory and resources required!
What is inverted index? Posting List One posting Opposed to forward index, store the list of documents per each word Directly access the set of documents containing the word
How to build inverted index? (1/3) • Build index in advance 1. Collect the documents 2. Turning each document into a list of tokens 3. Do linguistic preprocessing, producing list of normalized tokens, which are the indexing terms 4. Index the documents (i.e. postings) for each word (i.e. dictionary)
How to build inverted index? (2/3) • Given two documents: Document1 Document2 This is first document. Microsoft’s products are office, visio, and sql server This is second document. Google’s services are gmail, google labs and google code.
How to build inverted index? (3/3) • Sort based indexing: • 1. Sort the terms alphabetically • 2. Instances of the same term are grouped by word and then documentID • 3. The terms and documentIDs are then separated out • Reduces storage requirement • Dictionary commonly kept in memory while postings list kept on disk
Blocked sort based indexing • Use termID instead of term • Main memory is insufficient to collect termID-docID pair, we need external sorting algorithm that uses disk • Segment the collection into parts of equal size • Sorts and group the termID-docID pairs of each part in memory • Store the intermediate result onto disk • Merges all intermediate results into the final index • Running Time: O (T log T)
Single-pass in-memory indexing • SPIMI uses term instead of termID • Writes each block’s dictionary to disk, and then starts a new dictionary for the next block • Assume we have stream of term-docID pairs, • Tokens are processed one by one, when a term occurs for the first time, it is added to the dictionary, and a new posting list is created.
Distributed Indexing (1/4) • We can not perform index construction on single computer, web search engine uses distributed indexing algorithms for index construction • Partitioned the work across several machine • Use MapReduce architecture: • A general architecture for distributed computing • Divide the work into chunks that can easily assign and reassign. • Map and Reduce phase
Distributed Indexing (3/4) • MAP PHASE: • Mapping the splits of the input data to key-value pairs • Each parser writes its output to local segment file • These machines are called parsers • REDUCE PHASE: • Partition the keys into j term partitions and having the parsers write key-value pair for each term partition into a separate file. • The parser write the corresponding segment files, one for each term partition.
Distributed Indexing (4/4) • REDUCE PHASE (cont.): • Collecting all values (docIDs) for a given key (termID) into one list is the task of inverter • The master assigns each term partition to a different inverter • Finally, the list of values is sorted for each key and written to the final sorted postings list.
Dynamic indexing • Motivation: what we have seen so far was static collection of documents, what if the document is added, updated or deleted? • Maintain 2 indexes: Main and Auxiliary • Auxiliary index is kept in memory, searches are run across both indexes, and results are merged • When auxiliary index becomes too large, merge it into the main index • Deleted document can be filtered out while returning the results
Querying distributed indexes (1/2) • Partition by terms: • Partition the dictionary of index terms into subsets, along with a postings list of those term • Query is routed to the nodes, allows greater concurrency • Sending a long lists of postings between set of nodes for merging; cost is very high and it outweighs the greater concurrency • Partition by documents: • Each node contains the index for a subset of all documents • Query is distributed to all nodes, then results are merged
Querying distributed indexes (2/2) • Partition by documents (cont.): • Problem: idf must be calculated for an entire collection even though the index at single node contains only subset of documents • The query is broadcasted to each of the nodes, with top k results from each node being merged to find top k documents of the query.
Index compression (1/8) • Compression techniques for dictionary and posting list • Advantages • Less disk space • Use of caching: frequently used terms can be cached in memory for faster processing, and compression techniques allows more terms to be stored in memory • Faster data transfer from disk to memory: total time of transferring a compressed data from disk and decompress it is less than transferring uncompressed data
Index compression (2/8) • Dictionary compression: • It’s small compared to posting lists, so why to compress? • Because when large part (think of a millions of terms in it!) of dictionary is on disk, then many more disk seeks are necessary • Goal is to fit this dictionary into memory for high response time
Index compression (3/8) • 1. Dictionary as an array: • Can be stored in an array of fixed width entries • For ex. We have 4,00,000 terms in dictionary; • 4,00,000 * (20+4+4) = 11.2 MB
Index compression (4/8) • Any problem in storing dictionary as an array? • 1. Average length of term in English language is about eight chars, so we are wasting 12 chars • 2. No way of storing terms of more than 20 chars like hydrochlorofluorocarbons SOLUTION? • 2. Dictionary as a string: • Store it as a one long string of characters • Pointer marks the end of the preceding term and the beginning of the next
Index compression (5/8) • 2. Dictionary as a string (cont.): • 4,00,000 * (4+4+3+8) = 7.6 MB (compared to 11.2 MB earlier)
Index compression (6/8) • 3. Blocked storage: • Group the terms in the string into blocks of size k and keeping a term pointer only for the first term of each block. k=4; We save, (k-1)*3 =9 bytes for term pointer But, Need additional 4 bytes for term length • 4,00,000 * (1/4) * 5 = 7.1 MB (compared to 7.6 MB)
Index compression (7/8) • 4. Blocked storage with front coding: • Common prefixes • According to experience conducted by author: Size reduced to 5.9 MB (compared to 7.1 MB)
Index compression (8/8) • Posting file compression: • By Encoding Gaps: gaps between postings are shorter so we can store gaps rather than storing the posting itself
Review : Scoring , term weighting • Meta data:- information about document • Metadata generally consist of “fields” • E.g. date of creation , authors , title etc. • Zone :- similar to fields • Difference : zone is arbitrary free text • E.g. Abstract , overview
Review : Scoring , term weighting • Term Frequency(tf) : # of occurrence of term in document • Problem : size of documents => inappropriate ranking • Document frequency(dft): # of documents in collection which contain ‘term’ from query. • Inverse Document Frequency(idft): idft = log( N / dft) : N =total # of doc • Significance of idf • If low it’s a common term (e.g. stop word ) • If high rare word ( e.g. apothecary )
Review : Scoring , term weighting • Tf-idf weighting tf-idft,d = tft,d * idft . • High :when term occurs many time in small # of docs • Low: when it occurs fewer time in docs or it occurs in many docs • Lowest: when term is in almost all documents. • Score of document: Score(q,d) = ∑ (t€q)tf-idft,d
Inexact top K document retrieval • Motivation : to reduce the cost of calculating score for all N documents • We calculate score ONLY for top K documents whose scores are likely to be high w.r.t given query • How : • Find set A of documents who are contenders where K < A << N. • Return the K top scoring docs from A
Index Elimination • Idf preset threshold : • Only traverse postings for terms with high idf • Benefit : low idf postings are long so we remove them from counting score. • Include all terms: • Only traverse documents with many query terms in it. • Danger: we may end up with less than K docs at last.
Champion lists • Champion list = fancy list = top docs • Set of r documents for each term t in dictionary which are pre-computed • The weights for t are high • How to create set A • Take a union of champion list for each term in query • Compute score only for docs which are in union • How and when to decide ‘r’ • Highly application dependent • Create list at the time of indexing documents • Problem : ????????
Static quality scores and ordering • In many search engine we have • Measure of quality g(d) for each documents • The net score is calculated • Combination of g(d) and tf-idf score. • How to achieve this • Document posting list is in decreasing order for g(d) • So we just traversed first few documents in list • Global champion list : • Chose r documents with highest value of g(d)+tf-idf
Cluster pruning (1/2) • We cluster document in preprocessing step • Pick √N documents : call them ‘leaders’ • For each document who is not leader we compute nearest leader • Followers: docs which are not leaders • Each leader has approximately √N followers
Cluster pruning (2/2) • How does it help: • Given a query q find leader L nearest to q • i.e calculating score for only root N docs • Set A contains leader L with root N followers • i.e calculating score for only root N docs
Tiered indexes auto Doc1 Doc 2 Tier 1 car Doc 1 Doc 2 Doc 3 best Doc 4 Preset threshold value set to 20 auto Doc 1 car Doc1 Tier 2 best Doc 4 Preset threshold value set to 10 Addressing an issue of getting set A of contenders less than K documents
A complete search system Parsing Linguistics Result Page User Query Documents Free text query parser Indexers Documents cache Spell correction Scoring and Ranking Training set Scoring parameters MLR
Questions ? Thank you