1 / 19

Algorithms for Information Retrieval

Dive into algorithmic design challenges with toy problems such as Max Subarray Algorithm and Top-frequent Elements. Explore the mathematical solutions and implementation steps for these problems. Additionally, discover insights on sorting, indexing, and B-trees for sorting and searching. Learn how to optimize algorithms for better performance and efficiency in various scenarios.

rmatthews
Télécharger la présentation

Algorithms for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms forInformation Retrieval Is algorithmic design a 5-mins thinking task ???

  2. Toy problem #1: Max Subarray Algorithm • Compute P[1,n] array of Prefix-Sums over A • Compute M[1,n] array of Mins over P • Find end such that P[end]-M[end] is maximum. • start is such that P[start] is minimum. • Goal: Find the time window achieving the best “market performance”. • Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7 P = 2 -3 3 4 2 6 9 -4 5 -1 6 M = 2 -3 -3 -3 -3 -3 -3 -4 -4 -4 -4 • Note: • Find maxsumx≤y A[x,y] • = maxx≤y P[y] – P[x] • = maxy [ P[y] – (minx≤yP[x]) ]

  3. Toy problem #1(solution 2) Algorithm • sum=0; • For i=1,...,n do • If (sum + A[i] ≤ 0) sum=0; else MAX(max_sum, sum+A[i]); sum +=A[i]; ≥0 Optimum A = ≤0 A = 2 -5 6 1 -2 4 3 -13 9 -6 7 • Note: • Sum = 0 when OPT starts; • Sum > 0 within OPT

  4. Problems if ≤ n/2 Toy problem #2: Top-freq elements Algorithm • Use a pair of variables <X,C> • For each item s of the stream, • if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} • Return X; • Goal: Top queries over a stream of n items (S large). • Math Problem: Find the item y whose frequency is > n/2, using the smallest space. (i.e. If mode occurs > n/2) A = b a c c c d c b a a a c c b c c c <b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3> Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > n...

  5. Toy problem #3 : Indexing • Consider the following TREC collection: • N = 6 * 109 size • n = 106 documents • TotT= 109 (avg term length is 6 chars) • t = 5 * 105 distinct terms • What kind of data structure we build to support word-based searches ?

  6. Solution 1: Term-Doc matrix n = 1 million t=500K 1 if play contains word, 0 otherwise Space is 500Gb !

  7. 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Solution 2: Inverted index We can do still better: i.e. 3050% original text Brutus Calpurnia Caesar 13 16 • Typically <termID,docID,pos> use about 12 bytes • We have 109 total terms  at least 12Gb space • Compressing 6Gb documents gets 1.5Gb data • Better index but yet it is >10 times the text !!!!

  8. Toy problem #4 : sorting • How to sort tuples (objects) on disk • 109 objects of 12 bytes each, hence 12 Gb • Key observation: • Array A to sort is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to memory locations A[i] and A[j] • If we use qsort, this is an indirect sort !!! • W(n log n) random memory accesses !! (I/Os ?) Memory containing the tuples (objects) A

  9. Cost of Quicksort on large data • Some typical parameter settings • N=109 tuples of 12 bytes each • Typical Disk (Seagate Cheetah 150Gb): seek time~5ms • Analysis of qsort on disk: • qsort is an indirect sort: W(n log2 n) random memory accesses • [5ms] * n log2 n = 109 * log2 (109) * 5ms ≥ 3years • In practice a little bit better because of caching, but...

  10. What about listing tuples in order ? B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB • n=109 insertions  Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (“tuple pointers") Tuples Possibly 109 random I/Os = 109 * 5ms 2 months

  11. Binary Merge-Sort Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first half of A into array A1 03 Copy the second half of A into array A2 04 Merge-Sort(A1) 05 Merge-Sort(A2) 06 Merge(A, A1, A2) Divide Conquer Combine

  12. 4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 3 8 13 8 7 15 4 19 3 12 2 9 6 11 1 5 10 17 Merge-Sort Recursion Tree log2 n How do we exploit the disk features ??

  13. Main-memory sort Main-memory sort Main-memory sort 3 4 8 15 6 11 12 17 3 4 6 8 11 12 15 17 7 9 13 19 17 4 5 1 13 9 19 15 7 8 3 12 6 11 External Binary Merge-Sort • Increase the size of initial runs to be merged! 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 External two-way merge 1 2 5 7 9 10 13 19 External two-way merges 1 2 5 10 Main-memory sort 10 2 N/M runs, each level is 2 passes (R/W) over the data

  14. Cost of External Binary Merge-Sort • Some typical parameter settings: • n=109 tuples of 12 bytes each, N=12 Gb of data • Typical Disk (Seagate): seek time~8ms • avg transfer rate is 100Mb per sec = 10-8 secs/byte • Analysis of binary-mergesort on disk (M = 10Mb = 106 tuples): • Data divided into (N/M) runs:  103 runs • #levels is log2 (N/M)  10 • It executes 2 * log2 (N/M)  20 passes (R/W) over the data • I/O-scanning cost: 20 * [12 * 109] * 10-8 2400 sec = 40 min

  15. Multi-way Merge-Sort • Sort N items using internal-memory M and disk-pages of size B: • Pass 1: Produce (N/M) sorted runs. • Pass 2, …: merge X M/Bruns each pass. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT X Disk Disk Main memory buffers of B items

  16. Multiway Merging Bf1 p1 min(Bf1[p1], Bf2[p2], …, Bfx[pX]) Bf2 Fetch, if pi = B Bfo p2 po Bfx pX Flush, if Bfo full Current page Current page Current page EOF Run 1 Run 2 Run X=M/B Out File: Merged run

  17. Cost of Multi-way Merge-Sort • Number of passes = logM/B #runs  logM/B N/M • Cost of a pass = 2 * (N/B) I/Os Tuning depends on disk features • Parameters • M = 10Mb; B = 8Kb; N = 12 Gb; • N/M 103 runs; #passes = logM/B N/M  1 !!! • I/O-scanning: 20 passes (40m)  2 passes (4 m) • Increasing the fan-out (M/B) increases #I/Os per pass!

  18. Does compression may help? • Goal: enlarge M and reduce N • #passes = O(logM/B N/M) • Cost of a pass = O(N/B)

  19. Please !! Do not underestimate the features of disks in algorithmic design

More Related