Mastering Algorithms: A Comprehensive Guide to Data Structures and Sorting Techniques

Algoritmi per IR Prologo

What Google is searching for... Algorithm Complexity: You need to know Big-O. [….]Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting algorithm [….] Hashtables: Arguably the single most important data structure known to mankind. [….] Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS and DFS, and know the difference between inorder, postorder and preorder.Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms: breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.Other data structures: You should study up on as many other data structures and algorithms as possible. You should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NP-complete means.Mathematics: … Operating Systems: …Coding: …

Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999. Mining the Web: Discovering Knowledge from... S. Chakrabarti, Morgan-Kaufmann Publishers, 2003. References  A bunch of scientific papers available on the course site !!

About this course • It is a mix of algorithms for • data compression • data indexing • data streaming (and sketching) • data searching • data mining Massive data !!

Web 2.0 is about the many Paradigm shift...

Big DATA Big PC ? • We have three types of algorithms: • T1(n) = n, T2(n) = n2, T3(n) = 2n ... and assume that 1 step = 1 time unit • How many input data n each algorithm may process within t time units? • n1 = t, n2 = √t, n3 = log2 t • What about a k-times faster processor? ...or, what is n, when the time units are k*t ? • n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario • Data are more available than even before n ➜ ∞ ... is more than a theoretical assumption • The RAM model is too simple Step cost is W(1) time

net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets Not just MIN#steps… You should be “??-aware programmers” 1 RAM CPU

read/write head track read/write arm magnetic surface I/O-conscious Algorithms Spatial locality vs Temporal locality “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

The space issue • M = memory size, N = problem size • T(n) = time complexity of an algorithm using linear space • p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)] • C = cost of an I/O [105 ÷ 106 (Hennessy-Patterson)] If N=(1+f)M, then the D-avg cost per step is: C * p * f/(1+f) This is at least 104 * f/(1+f) If we fetch B ≈ 4Kb in time C, and algo uses all of them: (1/B) * (p * f/(1+f) * C) ≈ 30 * f/(1+f)

Space-conscious Algorithms I/Os Compressed data structures search access

read/write head track read/write arm magnetic surface Streaming Algorithms Data arrive continuously or we wish FEW scans • Streaming algorithms: • Use few scans • Handle each element fast • Use small space

net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets Cache-Oblivious Algorithms Unknown and/or changing devices • Block access important on all levels of memory hierarchy • But memory hierarchies are very diverse • Cache-oblivious algorithms: • Explicitly, algorithms do not assume any model parameters • Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray • Goal:Given a stock, and its D-performance over the time, find the time window in which it achieved the best “market performance”. • Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7

An optimal solution We assume every subsum≠0 Algorithm • sum=0; max = -1; • For i=1,...,n do • If (sum + A[i] ≤ 0) sum=0; else sum +=A[i]; MAX{max, sum}; >0 Optimum A = <0 A = 2 -5 6 1 -2 4 3 -13 9 -6 7 • Note: • Sum < 0 when OPT starts; • Sum > 0 within OPT

Toy problem #2 : sorting • How to sort tuples (objects) on disk • Key observation: • Array A is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to memory locations A[i] and A[j] • MergeSort Q(n log n) random memory accesses (I/Os ??) Memory containing the tuples A

What about listing tuples in order ? B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB • n insertions  Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (“tuple pointers") Tuples Possibly 109 random I/Os = 109 * 5ms 2 months

Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine

Cost of Mergesort on large data • Take Wikipedia in Italian, compute word freq: • n=109 tuples few Gbs • Typical Disk (Seagate Cheetah 150Gb): seek time~5ms • Analysis of mergesort on disk: • It is an indirect sort: Q(n log2 n) random I/Os • [5ms] * n log2 n ≈ 1.5years In practice, it is faster because of caching...

4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 3 8 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 19 12 7 15 4 8 3 13 11 9 6 1 5 2 10 17 Merge-Sort Recursion Tree If the run-size is larger than B (i.e. after first step!!), fetching all of it in memory for merging does not help 2 passes (R/W) How do we deploy the disk/mem features ? log2 N M N/M runs, each sorted in internal memory (no I/Os) — I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort • The key is to balance run-size and #runs to merge • Sort N items with main-memory M and disk-pages B: • Pass 1: Produce (N/M) sorted runs. • Pass i: merge X M/Bruns  logM/B N/M passes INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT X Disk Disk Main memory buffers of B items

Multiway Merging Bf1 p1 min(Bf1[p1], Bf2[p2], …, Bfx[pX]) Bf2 Fetch, if pi = B Bfo p2 po Bfx pX Flush, if Bfo full Current page Current page Current page EOF Run 1 Run 2 Run X=M/B Out File: Merged run

Cost of Multi-way Merge-Sort • Number of passes = logM/B #runs  logM/B N/M • Optimal cost = Q((N/B) logM/B N/M) I/Os • In practice • M/B ≈ 1000#passes = logM/B N/M 1 • One multiway merge 2 passes = few mins Tuning depends on disk features • Large fan-out (M/B) decreases #passes • Compression would decrease the cost of a pass!

Does compression may help? • Goal: enlarge M and reduce N • #passes = O(logM/B N/M) • Cost of a pass = O(N/B)

Part of Vitter’s paper… In order to address issues related to: • Disk Striping: sorting easily on D disks • Distribution sort: top-down sorting • Lower Bounds: how much we can go

Problems if ≤ N/2 Toy problem #3: Top-freq elements Algorithm • Use a pair of variables <X,C> • For each item s of the stream, • if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} • Return X; • Goal: Top queries over a stream of N items (S large). • Math Problem: Find the item y whose frequency is > N/2, using the smallest space. (i.e. If mode occurs > N/2) A = b a c c c d c b a a a c c b c c c <b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3> Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing • Consider the following TREC collection: • N = 6 * 109 size = 6Gb • n = 106 documents • TotT= 109 (avg term length is 6 chars) • t = 5 * 105 distinct terms • What kind of data structure we build to support word-based searches ?

Solution 1: Term-Doc matrix n = 1 million t=500K 1 if play contains word, 0 otherwise Space is 500Gb !

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Solution 2: Inverted index We can do still better: i.e. 3050% original text Brutus Calpurnia Caesar 13 16 • Typically <doc,pos,rankinfo> use about 12 bytes • We have 109 total terms  at least 12Gb space • Compressing 6Gb documents gets 1.5Gb data • Better index but yet it is >10 times the text !!!!

Please !! Do not underestimate the features of disks in algorithmic design

Algoritmi per IR Basics + Huffman coding

How much can we compress? Assuming all input messages are valid, if even one string is (lossless) compressed, some other must expand. Take all messages of length n. Is it possible to compress ALL OF THEM in less bits ? NO, they are 2n but we have less compressed msg… We need to talk about stochastic sources

Entropy (Shannon, 1948) For a set of symbols S with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s)

Statistical Coding How do we use probability p(s) to encode s? • Prefix codes and relationship to Entropy • Huffman codes • Arithmetic codes

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

0 1 a 0 1 d b c Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1

Average Length For a code C with codeword length L[s], the average length is defined as We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

A property of optimal codes Theorem (Kraft-McMillan). For any optimal uniquely decodable code, it does exist a prefix code with the same symbol lengths and thus same average optimal length. And vice versa… Theorem (golden rule).If C is an optimal prefix code for a source with probabilities {p1, …, pn} then pi < pj L[si] ≥ L[sj]

Relationship to Entropy Theorem (lower bound, Shannon).For any probability distribution and any uniquely decodable code C, we have Theorem (upper bound, Shannon).For any probability distribution, there exists a prefix code C such that Shannon code takes log 1/p bits

Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms • gzip, bzip, jpeg (as option), fax compression,… Properties: • Generates optimal prefix codes • Cheap to encode and decode • La(Huff) = H if probabilities are powers of 2 • Otherwise, at most 1 bit more per symbol!!!

0 1 1 (.3) 1 0 (.5) 0 (1) Running Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) a=000, b=001, c=01, d=1 There are 2n-1 “equivalent” Huffman trees What about ties (and thus, tree depth) ?

Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. 1 0 (.5) d(.5) abc... 00000101 1 0 (.3) 101001...  dcb c(.2) 0 1 a(.1) b(.2)

...by induction, optimality follows… A property on tree contraction Something like substituting symbols x,y with one new symbol x+y

Optimum vs. Huffman

Model size may be large Huffman codes can be made succinct in the representation of the codeword tree, and fast in (de)coding. Canonical Huffman tree We store for any level L: • firstcode[L] • Symbol[L,i], for each i in level L This is ≤ h2+ |S| log |S| bits = 00.....0

Canonical Huffman Encoding 1 2 3 4 5

Canonical Huffman Decoding firstcode[1]=2 firstcode[2]=1 firstcode[3]=1 firstcode[4]=2 firstcode[5]=0 T=...00010...

Problem with Huffman Coding Consider a symbol with probability .999. The self information is If we were to send 1000 such symbols we might hope to use 1000*.0014 = 1.44 bits. Using Huffman, we take at least 1 bit per symbol, so we would require 1000 bits.

What can we do? Macro-symbol = block of k symbols • 1 extra bit per macro-symbol = 1/k extra-bits per symbol • Larger model to be transmitted Shannon took infinite sequences, and k  ∞ !! In practice, we have: • Model takes |S|k (k * log |S|) + h2 (where h might be |S|) • It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

7 bits huffman tagging g a b g a b 1 0 0 Codeword Byte-aligned codeword T = “bzip or not bzip” a a C(T) 1 0  1  1  0 0  [bzip] [ ] [or] a a 1  1 0  1  1  0  0 [bzip] [not] [ ] [ ] Compress + Search ? [Moura et al, 98] • Compressed text derived from a word-based Huffman: • Symbols of the huffman tree are the words of T • The Huffman tree has fan-out 128 • Codewords are byte-aligned and tagged “or” a g b g b space a bzip b a or not

Mastering Algorithms: A Comprehensive Guide to Data Structures and Sorting Techniques

Mastering Algorithms: A Comprehensive Guide to Data Structures and Sorting Techniques

Presentation Transcript

ALGORITMI

Algoritmi per Sistemi Distribuiti Strategici

Algoritmi per IR

Algoritmi per IR

Algoritmi per IR