170 likes | 199 Vues
Indexing and Complexity. Agenda. Inverted indexes Computational complexity. Some Interesting Questions. How long will it take to find a document? Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? How much disk space? How much RAM?
E N D
Agenda • Inverted indexes • Computational complexity
Some Interesting Questions • How long will it take to find a document? • Is there any work we can do in advance? • If so, how long will that take? • How big a computer will I need? • How much disk space? How much RAM? • What if more documents arrive? • How much of the advance work must be repeated? • Will searching become slower? • How much more disk space will be needed?
A Cautionary Tale • Searching is easy - just ask Microsoft! • “Find” can search my 1 GB disk in 30 seconds • Well, actually it only looks at the file names... • How long do you think find would take for • The 100 GB disk we just got? • For the World Wide Web? • Computers are getting faster, but… • How does AltaVista give answers in 5 seconds?
The “Inverted File” Trick • Organize the bag of words matrix by terms • You know the terms that you are looking for • Look up terms like you search phone books • For each letter, jump directly to the right spot • For terms of reasonable length, this is very fast • For each term, store the document identifiers • For every document that contains that term • At query time, use the document identifiers • Consult a “postings file”
An Example Postings Term Inverted File Doc 3 Doc 1 Doc 2 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 aid 0 0 0 1 0 0 0 1 4, 8 AI A all 0 1 0 1 0 1 0 0 2, 4, 6 AL back 1 0 1 0 0 0 1 0 1, 3, 7 BA B brown 1 0 1 0 1 0 1 0 1, 3, 5, 7 BR come 0 1 0 1 0 1 0 1 2, 4, 6, 8 C dog 0 0 1 0 1 0 0 0 3, 5 D fox 0 0 1 0 1 0 1 0 3, 5, 7 F good 0 1 0 1 0 1 0 1 2, 4, 6, 8 G jump 0 0 1 0 0 0 0 0 3 J lazy 1 0 1 0 1 0 1 0 1, 3, 5, 7 L men 0 1 0 1 0 0 0 1 2, 4, 8 M now 0 1 0 0 0 1 0 1 2, 6, 8 N over 1 0 1 0 1 0 1 1 1, 3, 5, 7, 8 O party 0 0 0 0 0 1 0 1 6, 8 P quick 1 0 1 0 0 0 0 0 1, 3 Q their 1 0 0 0 1 0 1 0 1, 5, 7 TH T time 0 1 0 1 0 1 0 0 2, 4, 6 TI
The Finished Product Term Inverted File Postings aid 4, 8 AI A all 2, 4, 6 AL back 1, 3, 7 BA B brown 1, 3, 5, 7 BR come 2, 4, 6, 8 C dog 3, 5 D fox 3, 5, 7 F good 2, 4, 6, 8 G jump 3 J lazy 1, 3, 5, 7 L men 2, 4, 8 M now 2, 6, 8 N over 1, 3, 5, 7, 8 O party 6, 8 P quick 1, 3 Q their 1, 5, 7 TH T time 2, 4, 6 TI
What Goes in a Postings File? • Boolean retrieval • Just the document number • Ranked Retrieval • Document number and term weight (TF*IDF, ...) • Proximity operators • Word offsets for each occurrence of the term • Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
How Big Is the Postings File? • Very compact for Boolean retrieval • About 10% of the size of the documents • If an aggressive stopword list is used! • Not much larger for ranked retrieval • Perhaps 20% • Enormous for proximity operators • Sometimes larger than the documents! • But access is fast - you know where to look
Building an Inverted Index • Simplest solution is a single sorted array • Fast lookup using binary search • But sorting large files on disk is very slow • And adding one document means starting over • Tree structures allow easy insertion • But the worst case lookup time is linear • Balanced trees provide the best of both • Fast lookup and easy insertion • But they require 45% more disk space
Starting a B+ Tree Inverted File Now is the time for all good … aaaaa now all good now time
Adding a New Term Now is the time for all good men … aaaaa now aaaaa men all good men now time
How Big is the Inverted Index? • Typically smaller than the postings file • Depends on number of terms, not documents • Eventually almost all terms will be indexed • But the postings file will continue to grow • Postings dominate asymptotic space complexity • Linear in the number of documents • Assuming that the documents remain about the same size
Some Facts About Disks • It takes a long time to get the first byte • A Pentium can do 1,000,000 operations in 10 ms • But you can get 1,000 bytes just about as fast • 40 MB/sec transfer rates are typical • So it pays to put related stuff in each “block” • M-ary trees B+ are better than binary B+ trees • Time complexity is measured in disk blocks read • Since computing time is negligible by comparison
Time Complexity • Indexing • Walk the inverted file, splitting if needed • Insert into the postings file in sorted order • Hours or days for large collections • Query processing • Walk the inverted file • Read the postings file • Seconds, even for enormous collections
Summary • Slow indexing yields fast query processing • We use extra disk space to save query time • Index space is in addition to document space • Time and space complexity must be balanced • Disk block reads are the critical resource • Fast disks are more useful than fast computers
A Question • If insertions are more common than queries (for example, filtering news stories as they arrive and then never looking at them again), what kind of an index should you build?