1 / 28

Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies

Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies. George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown). Two observations. Receiptdate. Shipdate. Boston. 71° 05'W.

lula
Télécharger la présentation

Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation Maps:A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)

  2. Two observations

  3. Receiptdate Shipdate Boston 71° 05'W Civic Hybrid Honda 02116 2007 MA Geographic 1. Correlations abound Attributes tend to encode related info (these are soft functional dependencies) {shipdate, receiptdate} {manufacturer, model, year} {zip code, city, state, long/latitude}

  4. 2. Secondary indexes are often useless for range and aggregation queries

  5. SELECT * FROM lineitem WHERE orderdate=‘2009-08-26’ Clusteredaccess pattern Unclusteredaccess pattern Sorted by orderdate (clustered index on orderdate) One seek Many seeks Sorted by order_id (secondary index on orderdate) How can we improve the access pattern of a secondary index?

  6. Our contribution:Exploiting correlationsto improvesecondary index performance

  7. lineitem access pattern SELECT * FROM lineitem WHERE orderdate = 2007-01-03 Clustered by shipdate (correlated) Clustered by primary key (uncorrelated)

  8. Correlation determines index performance Poorly Correlated Different sort orders Very Correlated

  9. Our system: 1. Cost model with correlations 2. Correlation maps 3. Multi-attribute keys 4. Evaluation

  10. c_per_u: average number of clustered attribute values per unclustered attribute value 1. Cost model with correlations 3 c_per_u 2 lookups 10ms 3 levels SELECT * FROM lineitem WHERE receiptdate IN (i, j) 3 pages per shipdate 1ms 20s receiptdate (unclustered) shipdate (clustered) i j

  11. Correlation Map Clustered B+Tree Unclustered Attribute: City Clustered Attribute: State 2. Correlation Maps CREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer); SELECT * FROM Salaries WHERE city=`Boston’;

  12. CMs: Usage • Populated using initial scan of the table • Insertions/deletions: keep a co-occurrence count for each (u, c) pair • Physically stored as a B+Tree in the DB

  13. CMs: Compression • CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table) • Achieves compression by mapping values→ values, not values→tuples • Possible to build many CMs; dedicated CM per query • Improve performance by reducing buffer pool pressure

  14. 3. Multi-attribute keys • Combined attributes may predict the clustered key better than either attr alone • (longitude, latitude) → zip_code • Challenges: • Finding these is non-trivial • Combining attributes leads to many-valued keys leads to large CMs

  15. Clustered Unclustered Clustered Unclustered CM Advisor • The CM Advisor considers all possible attribute combinations for clustered and unclustered keys given a training set of queries • Buckets: collapse a range of key values into one • Bucketing clustered keys • Leads to longer sequential disk reads • Boston:MA versus Boston:MA,MI • Bucketing unclustered keys • Merging two unclustered buckets may increase disk seeks • Boston:MA versus Boise,Boston:ID,MA

  16. 4. Experimental evaluation SELECT … WHERE City IN (Boston, Springf) AND State IN (MA,NH,OH) SELECT … WHERE City IN (Boston, Springf)

  17. Benefit of correlation SELECT * FROM lineitem WHERE shipdate IN (2009-01-03, …)

  18. eBay category data • Hierarchies of products in categories • antiques→architectural→hardware→locks & keys • 24,000 categories up to 6 levels deep • Clustered by catID • Correlation: catID → price • Generated unique ItemIDs for 43 million rows (3.5GB)

  19. Maintenance costs: CM vs B+Tree Each B+Tree: 1.5GB Each CM: 300K Index updates fit in memory

  20. Mixed workload performance(5 indexes each) Selects slow down inserts even more due to buffer pool pressure! Total B+Tree size: 7.7GB Total CM size: 1.4MB

  21. SDSS Skyserver data • Celestial objects and their optical properties • PhotoObj: right ascension (ra), declination (dec) • Clustered by fieldID • Correlation: (ra, dec) → fieldID • Initial data: 200k tuples • Copied ra and dec windows 10x to produce 20M tuples, 3GB

  22. Multi-attributeindex performance BTree(ra,dec) CM(ra) CM(ra,dec) CM(dec) SELECT COUNT(*)FROM PhotoObjWHERE 193.1 < ra < 194.5AND 1.41 < dec < 1.55AND 23 < g+rho < 25 Correlation: (ra, dec) →fieldID BTree(ra,dec) CM(dec) CM(ra) CM(ra,dec)

  23. Related ideas • BHUNT/CORDS • Similar measure of correlation for query opt. • Doesn’t discuss indexing, no cost model • ADC Clustering • Proposes reclustering, but no cost model/designer • Microsoft SQL Server: datetime clustering • Limited to datetime types • Index compression (Prefix B+Tree, delta encoding, …) • Compression rates in the range of 2x

  24. Summary • Correlations between attributes arise naturally in a variety of applications • Correlations determine the cost of secondary index lookups • We presented a correlation-aware cost model and advisor to decide when to build CMs • Multi-attribute CMs capture more correlations; bucketing keeps them tiny • Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically 10-1000x smaller than secondary B+Trees

  25. Model accuracy SELECT Avg(Price) FROM Ebay WHERE Category=X

  26. Isolated CM performance vs.secondary B+Tree Slightly slower on isolated query; CM must filter unmatching tuples CM: 900KB B+Tree: 860MB

  27. Bucketing • Random-sample synopsis from table • Try unclustered bucket sizes: 2², 2³, … • Output candidates grouped by size, ordered by c_per_u Acceptable performance Smaller size

More Related