360 likes | 438 Vues
Fast Phrase Querying With Combined Indexes. HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University. Search Engines. Need to evaluate queries extremely fast. Involve phrases. Supported with low disk overheads. Introduction.
E N D
Fast Phrase Querying With CombinedIndexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University
Search Engines . . . Need to evaluate queries extremely fast. Involve phrases. Supported with low disk overheads.
Introduction Most queries consist of simple list of words. Some of query terms must be ordered and adjacent. Typically by enclosing and in quotation mark. Standart way to evaluate phrase queries to use inverted index. Inverted Index(II) use List of posting (each posting include a document ID ) List of offsets.(ordinal word position) II work with combinating the posting list for the query terms occurs in the documents. This process is fast but does not mean! Because of common words.
Introduction Cont. A common term require several megabytes for each GB of Inverted Index's Data. A crude solution is to use stopping The Google neglected common words in phrase queries until 2002 Until this, many more queries evaluated incorrectly.
Introduction Cont. A Nextword index is like a Inverted Index Nextword index use Index term(firstword and nextword) Nextword index work Each index term(firstword) is a list of the words(nextword) that follow that term. Firstword and nextword occur as a pair. As a disadvantages is its storage size. Must be processed linearly(Nextword process). With direct indexing, indexed 10 k most common phase queries reduces query evalution time by over %10.
Next . . . Introduction (Fin) Properties of Phrase Queries Inverted Index in Phrase Queries Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion
Properties of Queries In this research, used query logs by Excite from 1997 and 1999 These logs have similar properties. 1.583.922 queries including duplicates. % 8.3 of these were explicit phrase queries. In totaly, %5-10 are explicit. Queries matched in an around 20 GB Web dataset. Pharses queries , 11.103 or % 8.4 include one of three common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest terms.
Properties of Queries In this research, used query logs by Excite from 1997 and 1999 These logs have similar properties. 1.583.922 queries including duplicates. % 8.3 of these were explicit phrase queries. In totaly, %5-10 are explicit. Queries matched in an around 20 GB Web dataset. Pharses queries , 11.103 or % 8.4 include one of three common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest terms.
Properties of Queries Common words played important role! In tower of london, can be safely neglected during evalution. But in the spacial name like movie name or brand name End of days or The who These queries are diffucult to evaluate with stopwords removed. Also query logs include; To be or not to be Who are we All in all
Properties of Queries Stopping may yield efficiency gain, But, significant number of queries cannot be correctly evaluated. Basic query is tower of london, it is evaluated as tower – london Stopped first 3 commenest word Result 309 x 10^6 matches Stopped first 20 commenest word Result 490 x 10^6 matches Stopped first 254 commenest word Result 1693 x 10^6 matches Most mixed problem in form and to. Dismathes flights from london and flights to london
Properties of Queries Other dismathes examples; So many roads ->how many road Man in the moon -> man on the moon Among the phase queries include, Generaly 2 words. %34 in 3 words. %1.3 in 6 or more word.
Properties of Queries Testing Data Called WT10g collection. This is 10.27 GB Web data (HTML) and 1.67 million doc. It is crawed in 1997
Next . . . Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion
Inverted Index It is a standart method for supporting queries on large text DB. It is fast for ranked query evalution. It use two level structure Upper level is a vocabulary or lexicon Lower level is set of posting list. Zobel and Moffat (1998) notation; D is document ID F dt frequent of term indocument D OX is position of term in document D
Inverted Index Let's look "hatful of hollow" • This is general structure of Inverted Index • Term and Document frequences contain in it. • Word positions are ordinal.
Inverted Index Inverted Index Evaluator It is open source MG text retrival engine Descirebed by Witten et al.(1999) Inverted Index data size for WT10g is 1,429 MB Stopped word data size is 427 MB (490 stopwords) Stopped Inverted Index size is 1,002 MB
Inverted Index Result of Inverted Index performing
Next . . . Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion
Phrase Indexes Phase Index is an Inverted Index where items stored as a word sequence. A parcial phrase index with a vocabulary of five popular phrases.
Phrase Indexes A phrase index with L = 3 cannot be used efficient to 2 word queries L=> 2 are stored as term in conventional inverted index. L= 2 is organized for partial nextword indexes. Parcial Phrase Index It is notation like; D is document ID, f dp is term frequence of document. Offsets are not stored. The sets saves the cost of merging lists.
Phrase Indexes As examples are Lord of the rings(19) and birtney spears(59)* in 2001 Given a stream of queries over a long period and fixed volume of memory May also be required to update the vocabulary or replace least frequently used queries. This research do not experiment with this approach. * is number of same request(Query)
Nextword Indexes A phrase query can never be less than two word. Nextword index is similar to inverted index. Term representation; F wp is document frequence. D is document ID. F dwp is frequent of term of D. OX is position of term in D.
Nextword Indexes A nextword index with two firstwords. An example : boulder municipal employee credit union This can be grouped like boulder-municipal,employee-credit and credit-union Other example : historical railroads in new hamsphire It can grouped as railroads in in preferences to in new AS railroad is much less common than in.
Nextword Indexes The nextword index for the WT10g collection is 2.75 GB in size. It is exactly twice that of an inverted index file. The nextword index involves more complex structures than does processing with inverted index. Differences between Inverted Index and Nextword Index in queries
Next . . . Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing Experimental Result Conclusion
Combining Nextword and Inverted Indexing Propose that common words only be used as firstword in a parcial nextword index.
Combining Phrase and Inverted Indexing As an example, the query is new york city can be resolved using the partial phrase index find the locations of new york and merging with the inverted index postings list for city.
Three-Way Index Combination It is include a parcial nextword, partial phrase, and full inverted index.
Next . . . Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing (Fin) Experimental Result Conclusion
Experimental Result All expriments were run on intel 700 Mhz Pentium III based server with 2 GB of memory. Result of Inverted and Nextword Indexing This table is include the memory usage of the combinations.
Result of Inverted and Nextword Indexing Result of n terms queries with Inverted and Nextword Indexing
Result of Inverted Index and Phrase This test evaluate in 100, 1000, 10000 most frequent distinct queries Phrase index was less than %0.1of the collection 2.1MB, 4,8 MB, 12,8 MB In query logs, an american dictionary of the english language AND los angeles department of water and power are in 10000 common queries. Experimental results,
Result of Inverted Index, Nextword Index and Phrase This result is based 66000 queries' testing with using phase queries as common 10000 queries, nextword(only stopped word) and inverted indexing.
Next . . . Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing (Fin) Experimental Result(Fin) Conclusion