1 / 43

Data structures: time, I/Os, entropy, joules

Data structures: time, I/Os, entropy, joules. Paolo Ferragina Dipartimento di Informatica Università di Pisa. ... but do NOT forget practice ;-). Our driving moral. Big steps come from theory. Strings... why?. Ubiquitous: any datum is a sequence of bits, hence a string

nan
Télécharger la présentation

Data structures: time, I/Os, entropy, joules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Università di Pisa

  2. ... but do NOT forget practice ;-) Our driving moral... Big steps come from theory

  3. Strings... why? • Ubiquitous: any datum is a sequence of bits, hence a string • Spur new problems in many areas: • Geometry • String-similarity search  Points in high-dim space and NN-search • Lower/upper-bounds to indexing  via reductions to geo-problem • Graphs • Doc-doc similarity graph  ubiquitous in Text/Web mining • Query-log graphs  edge iff 2 queries clicked on the same res-page • Data compression Shortest paths on char-based weighted graphs • [Ferragina et al, SODA 09, ESA 09]

  4. (String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix-searches for a pattern P. Exact search  Hashing Mitzenmacher, ESA invited ‘09

  5. [Fredkin, CACM 1960] 2 2 0 5 1 1 4 5 6 7 2 3 Dominated the string-matching scene in the ‘80s-90s with its suffix-version: the Suffix Tree (Compacted) Trie • Performance: • Search≈ O(|P|) time • Space≈ O(K + N) s y z omo aibelyite stile zyg (2; 3,5) czecin etic ygy ial systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo

  6. Timeline: theoryandpractice... What about Software Engineers ?? Suffix Tree Trie ‘60 ’90 ’70-’80

  7. [Fredkin, CACM 1960] 2 2 0 5 1 1 4 5 6 7 2 3 Dominated the string-matching scene in the ‘80s-90s with its suffix-version: the Suffix Tree (Compacted) Trie • Performance: • Search≈ O(|P|) time • Space≈ O(K + N) s y z • ... But in practice… • Search: random memory accesses • Space: len + pointers + strings omo aibelyite stile zyg (2; 3,5) czecin etic ygy ial systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo

  8. Used the Compacted trie, of course, but with 2 other concerns because of large data • What did systems implement?

  9. 5 5 2 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... 1° issue: space concern Front Coding ….systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Bender et al., PODS 2006 Ferragina et al., PODS 2008

  10. track 2° issue: hierarchical memory M Spatial locality or Temporal locality caching: less I/Os Less and Faster I/Os 1 HD B CPU Internal Memory Count I/Os

  11. Internal Memory Disk 2-level indexing • 2 advantages: • Search≈ typically 1 I/O • Space≈ Front-coding over buckets CT on a sample • 2 limitations: • Sampling rate≈ lengths of sampled strings • Trade-off ≈ speed vsspace (because of bucket size) systileszaielyite (Prefix) B-tree ….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

  12. Timeline: theoryandpractice... Space + Hierarchical Memory Do we need to trade space by I/Os ? 2-level indexing Suffix Tree String B-tree Trie ‘60 1995 ’90 ’70-’80

  13. 2 2 0 5 1 1 4 5 6 7 2 3 [Morrison, J.ACM 1968] An old idea: Patricia Trie s y z omo aibelyite stile zyg czecin etic ygy ial

  14. 2 2 0 1 5 • Search(P): • Phase 1: tree navigation 5 2 1 0 [Ferragina-Grossi, J.ACM 1999] A new search Three-phase search: P = syzyyea s • Phase 2: Compute LCP y z • Phase 3: tree navigation g < y z a o s Only 1 string is checked Trie Space ≈ #strings, NOT their length c P’s position e y i ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

  15. + • Search(P) • O((p/B) logB K) I/Os • O(occ/B) I/Os It is dynamic... 1 string checked : O(p/B) O(logB K) levels PT PT PT PT PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 29 2 26 13 20 25 6 18 3 14 21 23 21 17 23 [Ferragina-Grossi, J.ACM 1999] The String B-tree > 15 US-patents cite it !! [Handbook of Comp. Biology, 2009] 29 13 20 18 3 23 Lexicographic position of P Knuth, vol 3°, pag. 489: “elegant”

  16. I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge literature !!

  17. net L2 RAM HD CPU L1 registers Cache Timeline: theoryandpractice... Not just 2 memory levels 2-level indexing Suffix Tree Trie ‘60 ’90 ’70-’80 String B-tree Space 1999 1995 • Parameter-free solutions • Anywhere, anytime, anyway... I/O-optimal !! Cache-oblivious Algo. and Data Str. See chap by Arge, Brodal, Fagerberg

  18. Some precious achievements... • Cache-oblivious trie • Static dictionary of strings [Brodal et al, SODA 2006] • Cache-oblivious String B-tree • Dynamic dictionary of strings [Bender et al, PODS 2006] • Cache-oblivious tree mapping • Split-and-Refine that applies to any B-fixed tree partitioning • [Alstrup et al, manuscript 2003] • Worst-case solution [Demaine et al, manuscript 2004]  Patricia Trie

  19. Timeline: theoryandpractice... Not just 2 memory levels Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Space 1995 1999

  20. A challenging question [Ken Church, AT&T 1995] Soft. Eng. use many “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ?

  21. Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Aka: Compressed self-indexes ...now, J.ACM 2005 • Space for text + (full-text) index  compressed text ( Hk) • Query/Decompression time  theoretically (quasi-)optimal

  22. # mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Can we compress it ?

  23. i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# # mississipp i ississippi#m i #mississipp ssissippi#mi i ppi#mississ sissippi#mis issippi#miss ssippi#missi T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi bzip2 = BWT + other simple compressors

  24. i ssippi#miss i ssissippi#m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# #mississipp i ississippi#m i #mississipp ssissippi#mi i ppi#mississ sissippi#mis issippi#miss ssippi#missi T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Suffix Array bzip2 = BWT + other simple compressors

  25. i ssippi#miss i ssissippi#m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i From practice to theory... [Ferragina-Manzini, Focs ‘00] bwt(T) #mississipp i i #mississipp i ppi#mississ • FM-index = BWT is searchable • ...or Suffix Array is compressible • Space = l |T| Hk + o(|T|) bits • Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

  26. Compressed & Searchable data formats Texts FOCS 2000 SODA 2003, 04 SODA 2007 SPIRE 2007 CPM 2008 CPM 2010 ICALP 2010 Integer Sets SODA 2002 … FOCS 2008 STACS 2009 Trees SODA 2002 SODA 2007 ICALP 2007 SWAT 2008 ICALP 2009 SODA 2010 Graphs DCC 2001 WWW 2004 ISAAC 2007 ESA 2008 FOCS 2009 Labeled Trees SODA 2002 FOCS 2005 WWW 2006 SODA 2007 ICDE 2010 Functions ICALP 2003, 04 SODA 2004 ICALP 2008 ESA 2009 LATIN 2010 Point Sets SODA 2003 TALG 2007 WADS 2009 SODA 2009 Images DCC 2008

  27. [December 2003] [January 2005]

  28. ACM J. on Experimental Algorithmics, 2009

  29. > 103 faster than Smith-W. >102 faster than SOAP & Maq What about the Web ? [Ferragina-Manzini, ACM WSDM 2010]

  30. Where we are nowadays Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Something is known... yet very preliminary [PODS ‘08, Navarro, Vitter, ...] Bellazougui et al, this ESA 1995 1999

  31. What else... • [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] • Solid-state disks: no mechanical parts • ... very fast reads, but slow writes & wear leveling • Self-adjusting or Weighted design • Time ops depend on some (un/known) distribution • Challenging : no pointers, self-adjust (perf) vs compression (space)

  32. A bigger challenge: from micro to macro ! IEEE Computer, 2007

  33. Approach #1 (engineering oriented) • News: Proper system components + specific algorithms • Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom + Sort]

  34. Approach #2 (Manage resources) • Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability • Susanne Albers, Comm. ACM 2010

  35. Approach #3 (models and algorithms) IEEE Computer, 2009 “Algorithmics offers benefits that extend far beyond TCS into the design of systems.” Workshop in IEEE Conf. on Green Comp. 2010

  36. Sometimes energy is a primary resource!

  37. Energy-aware Algo+Ds ? Memory-level impacts Locality pays off I/Os and compression are obviously important BUT here there is a new twist

  38. Battery life !! MIPS per Watt ? Approach in a principled way Who cares whether your application: is y% slower than optimal, but it is more energy efficient ? occupies x% more space than optimal, but decompr is faster ?

  39. Battery life !! MIPS per Watt ? Stay tuned: Algorithm Library for Mobile Phones Idea: Multi-objective optimization in data-structure design

  40. v Hbase - Hadoop BigTable, 2006 Cosmos HyperTable Cassandra Real-time search Q&A social search Knowledge search

  41. Many ingredients • Items are graphs, vectors, strings, … • Number and size are VERY large • Involve many resources to be optimized: • Time (speed/patience) • Space (#disks/management costs) • Bandwidth (speed/€) • Energy (€) Multi-objective optimization in data-structure design!

  42. That’s all ! • Look at my paper in the proceedings

More Related