Algorithms for Large Data Sets

Algorithms for Large Data Sets Giuseppe F. (Pino) Italiano Università di Roma “Tor Vergata” italiano@disp.uniroma2.it

Examples of Large Data Sets:Astronomy • Astronomical sky surveys • 120 Gigabytes/week • 6.5 Terabytes/year The Hubble Telescope

Examples of Large Data Sets:Phone call billing records • 250M calls/day • 60G calls/year • 40 bytes/call • 2.5 Terabytes/year

Examples of Large Data Sets:Credit card transactions • 47.5 billion transactions in 2005 worldwide • 115 Terabytes of data transmitted to VisaNet data processing center in 2004

Examples of Large Data Sets:Internet traffic • Traffic in a typical router: • 42 kB/second • 3.5 Gigabytes/day • 1.3 Terabytes/year

Examples of Large Data Sets:The World-Wide Web • 25 billion pages indexed • 10kB/Page • 250 Terabytes of indexed text data • “Deep web” is supposedly 100 times as large

Reasons for Large Data Sets:Better technology • Storage & disks • Cheaper • More volume • Physically smaller • More efficient Large data sets are affordable

Reasons for Large Data Sets:Better networking • High speed Internet • Cellular phones • Wireless LAN More data consumers More data producers

Reasons for Large Data Sets:Better IT • More processes are automatic • E-commerce • Online and telephone banking • Online and telephone customer service • E-learning • Chats, news, blogs • Online journals • Digital libraries • More enterprises are digital • Companies • Banks • Governmental institutions • Universities More data is available in digital form World’s yearly production of data: Billions Gigabyes

More and More Digital Data • Amount of data to be processed increasing at faster rate than computing power • Digital data created in few years larger than amout of data created in all previous history (57 billion GB)

The Digital Universe is growing fast • Digital Universe = amount of digital information created and replicated in a year. • YouTube hosts 100 million video streams a day • More than a billion songs a day shared over the Internet • London’s 200 traffic surveillance cameras send 64 trillion bits a day to the command center • Chevron accumulate data at the rate of TB / day • TV broadcasting is going all-digital in most countries • …

We Ain’t Seen Nothing Yet… • In 2009, despite global recession, Digital Universe grew by 62% to nearly 800,000 PetaBytes (1 PB = 1 million GB). I.e., stack of DVDs reaching from earth to moon and back. • In 2010, Digital Universe expected to grow almost as fast to 1.2 million PB, or 1.2 Zettabytes (ZB). • With this trend, in 2020 Digital Universe will be 35 ZB, i.e., 44 TIMES AS BIG as it was in 2009. Stack of DVDs would now reach halfway to Mars!

The RAM Model of Computation • The simple uniform memory model (i.e., unit time per memory access) is no longer adequate for large data sets • Internal memory (RAM) has typical size of few GB only Let’s see this with two very simple experiments

Experiment 1: Sequential vs. Random Access • 2GB RAM • Write (sequentially) a file with 2 billions 32-bit integers (7.45 GB) • Read (randomly) same file • Which is faster? Why?

Platform • MacOS X 10.5.5 (2.16 GHz Intel Core Duo) • 2GB SDRAM, 2MB L2 cache • HD Hitachi HTS542525K9SA00 232.89 GB serial ATA (speed 1.5 Gigabit) • File system Journaled HFS+ • Compiler gcc 4.0.1

Sequential Write Accesso sequenziale (scrittura) #include <stdio.h> #include <stdlib.h> typedef unsigned long ItemType; /* type of file items */ int main(int argc, char** argv){ FILE* f; long long N, i; if (argc < 3) exit (printf("Usage: ./MakeRandFile fileName numItems\n")); /* check command line parameters */ N = atoll(argv[2]); /* convert number of items from string to integer format */ printf("file offset: %d bit\n", sizeof(off_t)*8); printf("creating random file of %lld 32 bit integers...\n", N); f = fopen(argv[1], "w+"); /* open file for writing */ if (f == NULL) exit(printf("can't open file\n")); /* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); } fclose(f); }

Sequential Write Accesso sequenziale (scrittura) … /* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); } …

#include <stdio.h> #include <stdlib.h> #include <time.h> typedef unsigned long ItemType; /* type of file items */ int main(int argc, char** argv){ FILE* f; long long N, i, R; if (argc < 3) exit (printf("Usage: ./RandFileScan fileName numReads\n")); /* check command line parameters */ R = atoll(argv[2]); /* convert number of accesses from string to integer format */ f = fopen(argv[1], ”r"); /* open file for reading */ if (f == NULL) exit(printf("can't open file\n”, argv[1])); fseeko(f, 0LL, SEEK_END); /* compute number N of elements in the file */ N = ftello(f)/sizeof(ItemType); printf("file offset: %d bit\n", sizeof(off_t)*8); printf("make %lld random accesses to file of %lld 32 bit integers...\n", R, N); srand(clock()); /* init pseudo-random generator seed */ for (i=0; i<R; ++i) { /* make R random file accesses */ ItemType val; long long j = (long long)(N*((double)rand()/RAND_MAX)); fseeko(f, j*sizeof(ItemType), SEEK_SET); fread(&val, sizeof(ItemType), 1, f); } fclose(f); } Accesso casuale (lettura) Random Read

Accesso casuale (lettura) Random Read … for (i=0; i<R; ++i) { /* make R random file accesses */ ItemType val; long long j = (long long)(N*((double)rand() / RAND_MAX)); fseeko(f, j*sizeof(ItemType), SEEK_SET); fread(&val, sizeof(ItemType), 1, f); } …

Outcome of the Experiment • Random Read: • Time to read randomly 10,000 integers in file with 2 billions 32-bit integers (7.45 GB) is ≈118.419 sec. (i.e., 1 min. and 58.419 sec.). That’s ≈11.8 msec. per integer. • Throughput: ≈ 337.8 byte/sec ≈ 0.0003 MB/sec. • CPU Usage : ≈ 1.6% • Sequential Write: • Time to write sequentially file with 2 billions 32-bit integers (7.45 GB) is ≈250.685 sec. (i.e., 4 min. and 10.685 sec.). That’s ≈120 nanosec. per integer. • Throughput: ≈ 31.8 MB/sec. • CPU Usage: ≈ 77% • Sequential access is roughly 100,000 times faster than random!

What is More Realistic Doug Comer: “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk”

Magnetic Disk Drives as Secondary Memory Actually, disk access is about million times slower… More like going Around the World in 80 Days! Time for rotation ≈ Time for seek Amortize search time by transfering large blocks so that: Time for rotation ≈ Time for seek ≈ Time to transfer data Solution 1: Exploit locality – take advantage of data locality Solution 2: Use disks in parallel

Another Experiment Experiment 2: Copy 2048 x 2048 array of 32-bit integers copyij: copy by rows copyij: copy by columns

Array Copy Access by rows: void copyij (int src[2048][2048], int dst[2048][2048]) { int i,j; for (i = 0; i < 2048; i++) for (j = 0; j < 2048; j++) dst[i][j] = src[i][j]; }

Array Copy Access by columns: void copyji (int src[2048][2048], int dst[2048][2048]) { int i,j; for (j = 0; j < 2048; j++) for (i = 0; i < 2048; i++) dst[i][j] = src[i][j]; }

Array Copy copyijandcopyjidiffer only in access patterns: copyijaccesses by rows, copyjiaccesses by columns. On a Intel Core i7 with 2.7 GHz: copyijtakes 5.2 msec, copyjitakes 162 msec ( ≈ 30X slower!) Arrays stored in row-major order (depends on language / compiler) Thus copyijmakes a better use of locality

Is this due to External Mem?

A Refined Memory Model MB KB 100 GB / TB GB

Outline • Algorithms for Large Data Sets • External Memory (Disk): I/O-Efficient Algorithms • Cache: Cache-Oblivious Algorithms • Large and Inexpensive Memories: Resilient Algorithms

Outline • Important issues we are NOT touching • Algs for data streaming • Algs for multicore architectures: Threads, parallelism, etc… • Programming models (MapReduce) • How to write fast code • …

I/O-Efficient Algorithms

Model N: Elements in structure (input size) B: Elements per block M: Elements in main memory Problem starts out on disk Solution is to be written to disk Cost of an algorithm is the number of input and output (I/O) operations. D Block I/O M P

I/O- Efficient Algorithms • Will start with “Simple” Problems • Scanning • Sorting • List ranking

Scanning Scanning N elements stored in blocks costs Θ(N/B) I/Os: Will refer to this bound as scan(N) 36

Sorting Sorting one of the most-studied problems in computer science. In external-memory, sorting plays particularly important role, because often lower bound, and even upper bound, for other problems. Original paper of Aggarwal and Vitter [AV88] proved that number of memory transfers to sort in comparison model is Θ( N/B logM/B N/B). Will denote this bound by sort(N). Clearly, sort(N) = W(scan(N))

External Memory Sorting Simplest external-memory algorithm that achieves this bound [AV88] is an (M/B)-way mergesort. During merge, each memory block maintains first B elements of each list, and when a block empties, next block from that list is loaded. So a merge effectively corresponds to scanning through entire data, for an overall cost of Θ(N/B) I/Os.

External Memory Sorting Mainly of theoretical interest… Luckily more practical I/O-efficient alg. for sorting Total number of I/Os for this sorting algorithm is given by recurrence T (N) = (M/B) T (N / (M/B) ) + Θ(N/B), with a base case of T(O(B)) = O(1).

External Memory Sorting T (N) = (M/B) T (N / (M/B) ) + Θ(N/B), T(O(B)) = O(1). At level i of recursion tree: (M/B)i nodes, problem sizes Ni = N / (M/B)i Number of levels in recursion tree is O(logM/B N/B) Divide-and-merge cost at any level i is Θ(N/B): Recursion tree has Θ(N/B) leaves, for a leaf cost of Θ(N/B). Root node has divide-and-merge cost Θ(N/B) as well, as do all levels in between: (M/B)i Θ(Ni/B) = (M/B)i Θ(N/B / (M/B)i) So total cost is Θ( N/B logM/B N/B).

1 2 3 4 5 6 List Ranking Given a linked list L, compute for each item in the list its distance from the head

3 1 5 2 3 1 3 4 9 11 14 15 Weighted List Ranking • Can be generalized to weighted ranks: • Given a linked list L, compute for each item in the list its weghted distance from the head

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 Why Is List Ranking Non-Trivial? • The internal memory algorithm spends W(N) I/Os in the worst case (LRU assumed).

I/O-Efficient List Ranking Alg Proposed by Chiang et al. [1995] If list L fits into internal memory (|L| ≤ M): 1. L read in internal memory in O(scan(|L|)) I/Os 2. Use trivial list ranking in internal memory 3. Write to disk element ranks in O(scan(|L|)) I/Os Difficult part is when |L| > M (not a surprise…)

3 1 5 2 3 1 3 1 5 2 3 1 List Ranking for |L| > M • Assume an independent set of size at least N/3 can be found in O(sort(N)) I/Os (we’ll see later how). Scan(|L|)

3 1 5 2 3 1 3 1 5 2 3 1 3 1 7 4 List Ranking for |L| > M

Step Analysis • Assume each vertex has a unique numerical ID • Sort elements in L \ I by their numbers • Sort elements in I by numbers of their successors • Scan the two lists to update the label of succ(v), for every element v  I

Step Analysis • Each vertex has a unique numerical ID • Sort elements in I by their numbers • Sort elements in L \ I by numbers of their successors • Scan the two lists to update the label of succ(v), for every element v  L \ I

3 1 5 2 3 1 3 1 7 4 3 4 11 15 List Ranking for |L| > M Recursive step:

3 1 5 2 3 1 3 4 11 15 3 4 9 11 14 15 List Ranking for |L| > M O(Sort(|L|) + Scan(|L|)) (as before)

3 1 5 2 3 1 3 1 5 2 3 1 3 1 7 4 3 4 11 15 3 4 9 11 14 15 Recap of the Algorithm

Algorithms for Large Data Sets