190 likes | 210 Vues
Dive into algorithmic design challenges with toy problems such as Max Subarray Algorithm and Top-frequent Elements. Explore the mathematical solutions and implementation steps for these problems. Additionally, discover insights on sorting, indexing, and B-trees for sorting and searching. Learn how to optimize algorithms for better performance and efficiency in various scenarios.
E N D
Algorithms forInformation Retrieval Is algorithmic design a 5-mins thinking task ???
Toy problem #1: Max Subarray Algorithm • Compute P[1,n] array of Prefix-Sums over A • Compute M[1,n] array of Mins over P • Find end such that P[end]-M[end] is maximum. • start is such that P[start] is minimum. • Goal: Find the time window achieving the best “market performance”. • Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7 P = 2 -3 3 4 2 6 9 -4 5 -1 6 M = 2 -3 -3 -3 -3 -3 -3 -4 -4 -4 -4 • Note: • Find maxsumx≤y A[x,y] • = maxx≤y P[y] – P[x] • = maxy [ P[y] – (minx≤yP[x]) ]
Toy problem #1(solution 2) Algorithm • sum=0; • For i=1,...,n do • If (sum + A[i] ≤ 0) sum=0; else MAX(max_sum, sum+A[i]); sum +=A[i]; ≥0 Optimum A = ≤0 A = 2 -5 6 1 -2 4 3 -13 9 -6 7 • Note: • Sum = 0 when OPT starts; • Sum > 0 within OPT
Problems if ≤ n/2 Toy problem #2: Top-freq elements Algorithm • Use a pair of variables <X,C> • For each item s of the stream, • if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} • Return X; • Goal: Top queries over a stream of n items (S large). • Math Problem: Find the item y whose frequency is > n/2, using the smallest space. (i.e. If mode occurs > n/2) A = b a c c c d c b a a a c c b c c c <b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3> Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > n...
Toy problem #3 : Indexing • Consider the following TREC collection: • N = 6 * 109 size • n = 106 documents • TotT= 109 (avg term length is 6 chars) • t = 5 * 105 distinct terms • What kind of data structure we build to support word-based searches ?
Solution 1: Term-Doc matrix n = 1 million t=500K 1 if play contains word, 0 otherwise Space is 500Gb !
2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Solution 2: Inverted index We can do still better: i.e. 3050% original text Brutus Calpurnia Caesar 13 16 • Typically <termID,docID,pos> use about 12 bytes • We have 109 total terms at least 12Gb space • Compressing 6Gb documents gets 1.5Gb data • Better index but yet it is >10 times the text !!!!
Toy problem #4 : sorting • How to sort tuples (objects) on disk • 109 objects of 12 bytes each, hence 12 Gb • Key observation: • Array A to sort is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to memory locations A[i] and A[j] • If we use qsort, this is an indirect sort !!! • W(n log n) random memory accesses !! (I/Os ?) Memory containing the tuples (objects) A
Cost of Quicksort on large data • Some typical parameter settings • N=109 tuples of 12 bytes each • Typical Disk (Seagate Cheetah 150Gb): seek time~5ms • Analysis of qsort on disk: • qsort is an indirect sort: W(n log2 n) random memory accesses • [5ms] * n log2 n = 109 * log2 (109) * 5ms ≥ 3years • In practice a little bit better because of caching, but...
What about listing tuples in order ? B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB • n=109 insertions Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (“tuple pointers") Tuples Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first half of A into array A1 03 Copy the second half of A into array A2 04 Merge-Sort(A1) 05 Merge-Sort(A2) 06 Merge(A, A1, A2) Divide Conquer Combine
4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 3 8 13 8 7 15 4 19 3 12 2 9 6 11 1 5 10 17 Merge-Sort Recursion Tree log2 n How do we exploit the disk features ??
Main-memory sort Main-memory sort Main-memory sort 3 4 8 15 6 11 12 17 3 4 6 8 11 12 15 17 7 9 13 19 17 4 5 1 13 9 19 15 7 8 3 12 6 11 External Binary Merge-Sort • Increase the size of initial runs to be merged! 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 External two-way merge 1 2 5 7 9 10 13 19 External two-way merges 1 2 5 10 Main-memory sort 10 2 N/M runs, each level is 2 passes (R/W) over the data
Cost of External Binary Merge-Sort • Some typical parameter settings: • n=109 tuples of 12 bytes each, N=12 Gb of data • Typical Disk (Seagate): seek time~8ms • avg transfer rate is 100Mb per sec = 10-8 secs/byte • Analysis of binary-mergesort on disk (M = 10Mb = 106 tuples): • Data divided into (N/M) runs: 103 runs • #levels is log2 (N/M) 10 • It executes 2 * log2 (N/M) 20 passes (R/W) over the data • I/O-scanning cost: 20 * [12 * 109] * 10-8 2400 sec = 40 min
Multi-way Merge-Sort • Sort N items using internal-memory M and disk-pages of size B: • Pass 1: Produce (N/M) sorted runs. • Pass 2, …: merge X M/Bruns each pass. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT X Disk Disk Main memory buffers of B items
Multiway Merging Bf1 p1 min(Bf1[p1], Bf2[p2], …, Bfx[pX]) Bf2 Fetch, if pi = B Bfo p2 po Bfx pX Flush, if Bfo full Current page Current page Current page EOF Run 1 Run 2 Run X=M/B Out File: Merged run
Cost of Multi-way Merge-Sort • Number of passes = logM/B #runs logM/B N/M • Cost of a pass = 2 * (N/B) I/Os Tuning depends on disk features • Parameters • M = 10Mb; B = 8Kb; N = 12 Gb; • N/M 103 runs; #passes = logM/B N/M 1 !!! • I/O-scanning: 20 passes (40m) 2 passes (4 m) • Increasing the fan-out (M/B) increases #I/Os per pass!
Does compression may help? • Goal: enlarge M and reduce N • #passes = O(logM/B N/M) • Cost of a pass = O(N/B)
Please !! Do not underestimate the features of disks in algorithmic design