1 / 53

Algorithms and data structures for big data , what ’ s next?

Algorithms and data structures for big data , what ’ s next?. Paolo Ferragina University of Pisa. Is Big Data a buzz word ?. “ Big Data ” vs “ Grid Computing ”. VLDB does exist since 1992. Big data, big impact !. Big data are everywhere !. No SQL. [Procs OSDI 2006]. Hadoop.

Télécharger la présentation

Algorithms and data structures for big data , what ’ s next?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms and data structures for big data, what’s next? Paolo Ferragina University of Pisa

  2. Is Big Data a buzz word ?

  3. “Big Data”vs“Grid Computing”

  4. VLDB does exist since 1992

  5. Big data, big impact !

  6. Big data are everywhere !

  7. No SQL [Procs OSDI 2006] Hadoop Cassandra HyperTable Cosmos

  8. From macro to micro-users Energy is related to time/memory-accesses in an intricated manner, so the issue “algo + memory levels” is a key for everyday users, not only big players

  9. ... but do NOT forget practice ;-) Our driving moral... Big steps come from theory

  10. Our running example

  11. (String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P. Exact search  Hashing

  12. [Fredkin, CACM 1960] 2 2 0 5 1 3 1 4 5 6 7 2 Dominated the string-matching scene in the ‘80s-90s Most known is the Suffix Tree (Compacted) Trie • Performance: • Search≈ O(|P|) time • Space≈ O(N) s y z • Software engineers objected: • Search: random memory accesses • Space: pointers + strings omo aibelyite stile zyg (2; 3,5) czecin etic Lexicographic search P = systo ygy ial systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo

  13. Timeline: theoryandpractice... What about Software Engineers ?? Suffix Tree Trie ‘60 ’90 ’70-’80

  14. Used the Compacted trie, of course, but with 2 other concerns because of large data • What did systems implement?

  15. 5,ial 5,y 2,zygetic 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... 1° issue: space concern Front Coding systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ...

  16. track 2° issue: Disk memory B • 2 main features: • Seek time = I/Os are costly • Blocked access =B items per I/O Count I/Os Why are stringschallenging ? 1 CPU Internal Memory Strings may be arbitrarily long

  17. Internal Memory Disk 2-level indexing • 2 advantages: • Search≈ typically 1 disk access • Space≈ Front-coding over buckets CT on a sample One main limitation: Sampling rate &lengths of sampled strings Trade-offbtw speed vsspace (because of bucket size) systileszaielyite (Prefix) B-tree B B ….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

  18. Timeline: theoryandpractice... Space + Hierarchical Memory Do we need to trade space by I/Os ? 2-level indexing Suffix Tree String B-tree Trie ‘60 1995 ’90 ’70-’80

  19. 5 2 2 0 1 [Morrison, J.ACM 1968] An old idea: Patricia Trie s y z stile zyg omo aibelyte etic y ial czecin Disk ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

  20. 2 2 0 1 5 • Search(P): • Phase 1: tree navigation 5 0 1 2 [Ferragina-Grossi, J.ACM 1999] A new (lexicographic) search Lexicographic search: P = syzytea s • Phase 2: Compute LCP y z • Phase 3: tree navigation yg z a o s Lexicographic position c e y Only 1 string is checked on disk Trie Space ≈ #strings, NOT their length i Disk ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

  21. + • Search(P) • O((p/B) logB K) I/Os O(occ/B) I/Os Itis dynamic... Check 1 string = O(p/B) I/Os O(logB K) levels PT PT PT PT PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 29 2 26 13 20 25 6 18 3 14 21 23 21 17 23 [Ferragina-Grossi, J.ACM 1999] The String B-tree > 15 US-patents cite it !! 29 13 20 18 3 23 Lexicographic position of P Knuth, vol 3°, pag. 489: “elegant”

  22. I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge literature !!

  23. net L2 RAM HD CPU L1 registers Cache Timeline: theoryandpractice... Not just 2 memory levels 2-level indexing Suffix Tree Trie ‘60 ’90 ’70-’80 String B-tree 1999 1995 • Cache-oblivious solutions, aka parameter-free algo+ds • Anywhere, anytime, anyway... I/O-optimal !!

  24. Timeline: theoryandpractice... Not just 2 memory levels Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Space 1999 1995

  25. A challenging question [Ken Church, AT&T 1995] Software Engineers use “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ?

  26. Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Aka: Compressed self-indexes ...now, J.ACM 2005 • Space for text+index space for compressed text only ( Hk) • Query/Decompression time  theoretically (quasi-)optimal

  27. # mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Highly compressible, but…

  28. i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# # mississipp i ississippi#m i #mississipp ssissippi#mi i ppi#mississ sissippi#mis issippi#miss ssippi#missi T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi bzip2 = BWT + other simple compressors

  29. 5 issippi#miss 2 ississippi#m 1 mississippi# 10 pi#mississi p 9 ppi#mississi 7 sippi#missis 4 sissippi#mis 6 ssippi#missi 3 ssissippi#mi From practice to theory... [Ferragina-Manzini, IEEE Focs ‘00] bwt(T) sa(T) 12 #mississippi 11 i#mississipp 8 ippi#mississ • FM-index = BWT is searchable • ...or Suffix Array is compressible • Space = l |T| Hk + o(|T|) bits • Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

  30. Compressed & Searchable data formats • After our paper in FOCS 2000, about texts • Wefindnowdayscompressedindexes for: • Trees • Labeled trees and graphs • Functions • Integer Sets • Geometry • Images • ...

  31. From theory to practice… December 2003

  32. ACM J. on Experimental Algorithmics, 2009

  33. > 103 faster than Smith-W. >102 faster than SOAP & Maq

  34. What about the Web ? [Ferragina-Manzini, ACM WSDM 2010]

  35. IEEE FOCS 2005 WWW 2006 J. ACM 2009 US Patent 2012 An XML excerpt <dblp> <book> <author> Donald E. Knuth</author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp>

  36. A tree interpretation XBW transform • XML document exploration  Tree navigation • XML document search  Labeled subpath searches

  37. XBW Transform: Some performance figures Xerces better on smaller files Xerces worse on larger files Xerces uses 10x space Num searches per second larger and larger datasets

  38. Where we are nowadays Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Something is known... yet very preliminary Lower Bounds derived from Geometry Text search = 2d Range Search 1995 1999

  39. New food for research.. 40Gb, about 100$ • [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] • Solid-state disks: no mechanical parts • ... very fast reads, but slow writes & wear leveling • Self-adjusting or Weighted design • Time ops depend on some (un/known) distribution • Challenge: no pointers, self-adjust (perf) vs compression (space) [Ferragina et al, ESA 2011]

  40. The energy challenge IEEE Computer, 2007

  41. Browsing a web site The most used!

  42. Yet today, it is a problem... Apple is still working on the battery life problem: “The recent iOS software update addressed many of the battery issues that some customers experienced on their iOS 5 devices. We continue to investigate a few remaining issues.” (nov 2011, wired.com) “Windows 8's power hygiene: the scheduler will ignore the unused software”(Feb 2012, MSDN)

  43. Energy-aware Algo+Ds ? Memory-level impacts Locality pays off I/Os and compression are obviously important BUT here there is a new twist

  44. Battery life !! MIPS per Watt ? Idea: Multi-objective optimization in data-structure design Approach in a principled way Who cares whether your application: is y% slower than optimal, but it is more energy efficient ? takes x% more space than optimal, butitis more energyefficient ?

  45. A preliminary step Took inspiration from BigTable(Google), ... Design a compressed storage scheme that can trade in a principled waybetween space vs decompression time [vs energy efficiency] Requirements: gzip-like compression [like Snappy or lz4by Google] Goal: Fix the space occupancy, find the best compressionthat achieves that space and minimizes the decompression time (or vice versa) Copy back new char Copy back [abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>

  46. A preliminary step... NP-hard in general This special case is POLY: O(n3) • Modeled as a Constrained Shortest Path problem: • Nodes = one per char of the text to be compressed • Edges = single char or copy back substrings • 2 edge weights = decompression time (t) and compressed space (c) n is huge m might be n2 LZ-parsing = Path from 1 to 12 We solved heuristically (Lagrangian Dual) and provably (Path Swap)

  47. A preliminary step...

More Related