1 / 52

Beyond SQL: Structured Data Retrieval by Ranking

Beyond SQL: Structured Data Retrieval by Ranking. Chengkai Li October 25, 2006 CS511 guest lecture. About Myself. Advisor: Kevin Chang (UIUC) Research Interests Structured Data Retrieval and Ranking Query Processing Web Information Integration XML Query Processing Misc.

adamdaniel
Télécharger la présentation

Beyond SQL: Structured Data Retrieval by Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond SQL: Structured Data Retrieval by Ranking Chengkai Li October 25, 2006 CS511 guest lecture

  2. About Myself • Advisor: Kevin Chang (UIUC) • Research Interests • Structured Data Retrieval and Ranking Query Processing • Web Information Integration • XML Query Processing • Misc. • http://www.ews.uiuc.edu/~cli • cli [at] uiuc [dot] edu

  3. Boolean Semantics of SQL: Success and Barrier • Example: SELECT * FROM Homes H WHERE 400K<price<600K AND #bedroom = 4 • Problems: • Query Semantics: • True or False, nothing in the middle • What do we want?: similarity, relevance, preference, etc. • Query Results Presentation: • Too many answers (“information overload” problem): hard to digest and analyze; or too few answers: no enough options to consider • What do we want?: a ranked list

  4. “Google” on Databases:From Boolean Query to Fuzzy Retrieval

  5. Fuzzy Retrieval by Ranking • Ranking (Top-k) Queries • Top k results by a ranking function with multiple criteria • SELECT * FROM Homes H Order By 0.3*big(H.size)+0.7*cheap(H.price) Limit 5

  6. Beyond Boolean Semantics: Ranking Useful and important in many real-world database applications: • E-Commerce Find the best hotel deals by price, distance, etc. • Multimedia Databases Find the most similar images by color, shape, texture, etc. • Text Retrieval and Document Management Find the most relevant records/documents/pages. • OLAP, Decision Support Find the top profitable customers.

  7. Application 1: E-commerce • The home search example • More details later

  8. Application 2: Decision Support • What are the top 5 areas to advertise a new insurance product? SELECT zipcode, AVG(income*w1+age*w2+credit*w3) as score FROM customer WHERE occupation=‘student’ GROUP BY zipcode ORDER BY score LIMIT 5

  9. Application 3: Multimedia Databases • Find the most similar images by color, shape, texture, etc. • Online demo: http://amazon.ece.utexas.edu/~qasim/research.htm

  10. Application 4: Keyword Queries • Text data coexist with structured data • IR-style database queries [slides courtesy of Vagelis Hristidis]

  11. Example – Complaints Database Schema

  12. Example - Complaints Database Data Complaints Customers Products

  13. Example – Keyword Query [Maxtor Netvista] Complaints Customers Products

  14. Keyword Query Semantics (definition of “document” in databases) • Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • IR-style score of result tree

  15. Example – Keyword Query [Maxtor Netvista] Complaints Customers Products Results: (1) c3, (2) p2c3, (3) p1c1

  16. Great Interest in Database Community • Top-k Queries • Top-k to range query transformation: [ChaudhuriG99] • Indexing: [ChangBC+00], [TsaparasPK+03] • View: [HristidisKP01], [YiYY+03], [DasGK+06] • Query operators and algorithms: [CareyK97], [NatsevCS+01], [IlyasAE03], [IlyasSA+04], [ZhangHC+06] • Ranking algebra: [ChaudhuriRW05] [LiCI+05] • Ranking aggregate queries: [LiCI06] • Other Types of Queries • Keyword queries: [BhalotiaHN+02], [AgrawalCD02], [HristidisP02] • IR-style TFIDF ranking: [AgrawalCD+03] • Ranking of attributes: [DasHK+06] • Preference queries: [AgrawalW00], [Kießling02] • Skyline queries: [BorzsonyiSK01] • DBRank07

  17. Theme Life Cycles: Bursting Topics(Current Bursting) [slide courtesy of Chengixang Zhai]

  18. RankSQL: a RDBMS with Efficient Support of Ranking Queries • Algebraic Foundation and Optimization Framework SPJ queries (SELECT … FROM … WHERE … ORDER BY …) • Ad-Hoc Ranking Aggregate Queries top k groups instead of tuples (SELECT … FROM … WHERE … GROUP BY… ORDER BY …)

  19. Example: Trip Planning Suggest a hotel to stay and a museum to visit: Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area Order By cheap(h.price) + close(h.addr, “BWI airport”) + related(m.collection,“dinosaur”) Limit 5 membership dimension: Boolean predicates, Boolean function B order dimension: ranking predicates, scoring function R

  20. 5 results R sort cheap+close+related B h.area=m.area σh.star=3 scan(m) scan(h) Processing Ranking Queries in Traditional RDBMS Select * From Hotel h, Museum m Where B Order By R Limit 5

  21. 5 results R B Problems of Traditional Approach • NaïveMaterialize-then-Sort scheme • Overkill total order of all results; only 5 top results are requested. • Very inefficient • Scan large base tables • Join large intermediate results • Evaluate every ranking on every tuple • Full sorting

  22. Therefore the problem is: Unlike Boolean constructs, ranking is second class. Ranking is processed as a Monolithic component (R), always after the Boolean component (B).

  23. σh.star=3AND h.area=m.area How did we make Boolean “first class”? Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter

  24. h.area=m.area σh.star=3 scan(m) scan(h) Splitting and Interleaving σh.star=3AND h.area=m.area Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter (2) B is split into joins and selections, which interleave with each other.

  25. Relational Algebra Enables Such Splitting and Interleaving • Algebra: Arabic word, “bring together broken parts”. • Operators: to break apart the Boolean condition (splitting) • selection, projection, join, etc. • Algebraic Laws: to bring them together (interleaving) • pushdown selection: σc(R ΧS) = σc(R) Χ S • split selection: Σc1Λc2 (R) = σc1(σc2 (R)) • Relational algebra is the foundation of query optimization

  26. Rank-Relational Algebra By: • Splitting ranking function • Interleaving ranking with Boolean operations We achieve: • Support ranking as a first-class query type • Integrate rankingwith traditional Boolean constructs • Enable efficient query plans that were unavailable

  27. 5 results 5 results R μclose sort cheap+close+related rank-joinh.area=m.area B μrelated σh.star=3 h.area=m.area σh.star=3 rank-scan(h,cheap) scan(m) scan(m) scan(h) Ranking Query Plan split and interleave: reduction of intermediate results, thus processing cost materialize-then-sort: naïve, overkill

  28. Possibly orders of magnitude improvement Implementation in PostgreSQL plan1: traditional materialize-then-sort plan plan2-4: new ranking query plans Observations: an extended plan space with plans of various costs.

  29. μp3 μp2 Scanp1(H) Example • Ranking function split, predicates evaluated incrementally. • Iterator structure. • Output order: how “promising” according to the evaluated predicates. Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 GetNext() Hp1

  30. hotel upper-bound μp3 hotel upper-bound μp2 hotel upper-bound Scanp1(H) Example Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 Hp1

  31. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  32. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  33. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example h2 2.9 Select * From Hotel H Order By p1+p2+p3 Limit 1

  34. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  35. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  36. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  37. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  38. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  39. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  40. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  41. Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

  42. In contrast: materialize-then-sort Sortp1+p2+p3 Select * From Hotel H Order By p1+p2+p3 Limit 1 Scan(H)

  43. Data Model: of Rank-Relation • Two Logical Properties • Membershipof the tuples: evaluated Boolean predicates • Orderamong the tuples: evaluated ranking predciates rank-relation Membership(B): h.area=m.area, h.star=3 Order(R): close(h.addr, “BWI airport”) related(m.collection,“dinosaur”) cheap(h.price) B Membership(B): h.star=3 Order(R): cheap(h.price) A rank-relation

  44. Ranking Principle:upper-bound determines the order F=cheap + close + related • Processing in the “promising” order, avoiding unnecessary processing. B tuple h1 tuple h2 • Without further processing h1, we cannot output any result; A

  45. Ranking Operators To achieve splitting and interleaving: • New operator: • μ: evaluate ranking predicates piece by piece. • Extended operators: • rank-selection • rank-join • rank-scan • rank-union, rank-intersection.

  46. parser, rewriter operatorsandalgebraic laws Logical Query Plan optimizer Plan enumerator cost model implementation of operators Physical Query Plan Results executor Impact of Rank-Relational Algebra Ranking Query

  47. Skyline Queries • A real estate example dist P4 P3 Skyline on price & dist P1 P5 P2 price • skyline contains points that are not dominated by any other point. [slide courtesy of Yidong Yuan]

  48. Preference Queries • Semantically more flexible than top-k queries • Instead of enforcing total order, it allows partial order. • Efficient processing is more challenging

  49. Motivation - Preferences SELECT * FROM used_cars PREFERRING (make = ‘Honda’ ELSE make = ‘Ford’) AND price AROUND 40000 SELECT * FROM used_cars WHERE (make = ‘Honda’ OR make = ‘Ford’) AND price = 40000

  50. Preferences Semantics: Strict Partial Order • Preference: strict partial order A > B : “I like A better than B” • Example Car Make: {Honda, Ford, Toyota} Honda > Ford > {Toyota, Pontiac}

More Related