520 likes | 688 Vues
Beyond SQL: Structured Data Retrieval by Ranking. Chengkai Li October 25, 2006 CS511 guest lecture. About Myself. Advisor: Kevin Chang (UIUC) Research Interests Structured Data Retrieval and Ranking Query Processing Web Information Integration XML Query Processing Misc.
 
                
                E N D
Beyond SQL: Structured Data Retrieval by Ranking Chengkai Li October 25, 2006 CS511 guest lecture
About Myself • Advisor: Kevin Chang (UIUC) • Research Interests • Structured Data Retrieval and Ranking Query Processing • Web Information Integration • XML Query Processing • Misc. • http://www.ews.uiuc.edu/~cli • cli [at] uiuc [dot] edu
Boolean Semantics of SQL: Success and Barrier • Example: SELECT * FROM Homes H WHERE 400K<price<600K AND #bedroom = 4 • Problems: • Query Semantics: • True or False, nothing in the middle • What do we want?: similarity, relevance, preference, etc. • Query Results Presentation: • Too many answers (“information overload” problem): hard to digest and analyze; or too few answers: no enough options to consider • What do we want?: a ranked list
Fuzzy Retrieval by Ranking • Ranking (Top-k) Queries • Top k results by a ranking function with multiple criteria • SELECT * FROM Homes H Order By 0.3*big(H.size)+0.7*cheap(H.price) Limit 5
Beyond Boolean Semantics: Ranking Useful and important in many real-world database applications: • E-Commerce Find the best hotel deals by price, distance, etc. • Multimedia Databases Find the most similar images by color, shape, texture, etc. • Text Retrieval and Document Management Find the most relevant records/documents/pages. • OLAP, Decision Support Find the top profitable customers.
Application 1: E-commerce • The home search example • More details later
Application 2: Decision Support • What are the top 5 areas to advertise a new insurance product? SELECT zipcode, AVG(income*w1+age*w2+credit*w3) as score FROM customer WHERE occupation=‘student’ GROUP BY zipcode ORDER BY score LIMIT 5
Application 3: Multimedia Databases • Find the most similar images by color, shape, texture, etc. • Online demo: http://amazon.ece.utexas.edu/~qasim/research.htm
Application 4: Keyword Queries • Text data coexist with structured data • IR-style database queries [slides courtesy of Vagelis Hristidis]
Example - Complaints Database Data Complaints Customers Products
Example – Keyword Query [Maxtor Netvista] Complaints Customers Products
Keyword Query Semantics (definition of “document” in databases) • Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • IR-style score of result tree
Example – Keyword Query [Maxtor Netvista] Complaints Customers Products Results: (1) c3, (2) p2c3, (3) p1c1
Great Interest in Database Community • Top-k Queries • Top-k to range query transformation: [ChaudhuriG99] • Indexing: [ChangBC+00], [TsaparasPK+03] • View: [HristidisKP01], [YiYY+03], [DasGK+06] • Query operators and algorithms: [CareyK97], [NatsevCS+01], [IlyasAE03], [IlyasSA+04], [ZhangHC+06] • Ranking algebra: [ChaudhuriRW05] [LiCI+05] • Ranking aggregate queries: [LiCI06] • Other Types of Queries • Keyword queries: [BhalotiaHN+02], [AgrawalCD02], [HristidisP02] • IR-style TFIDF ranking: [AgrawalCD+03] • Ranking of attributes: [DasHK+06] • Preference queries: [AgrawalW00], [Kießling02] • Skyline queries: [BorzsonyiSK01] • DBRank07
Theme Life Cycles: Bursting Topics(Current Bursting) [slide courtesy of Chengixang Zhai]
RankSQL: a RDBMS with Efficient Support of Ranking Queries • Algebraic Foundation and Optimization Framework SPJ queries (SELECT … FROM … WHERE … ORDER BY …) • Ad-Hoc Ranking Aggregate Queries top k groups instead of tuples (SELECT … FROM … WHERE … GROUP BY… ORDER BY …)
Example: Trip Planning Suggest a hotel to stay and a museum to visit: Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area Order By cheap(h.price) + close(h.addr, “BWI airport”) + related(m.collection,“dinosaur”) Limit 5 membership dimension: Boolean predicates, Boolean function B order dimension: ranking predicates, scoring function R
5 results R sort cheap+close+related B h.area=m.area σh.star=3 scan(m) scan(h) Processing Ranking Queries in Traditional RDBMS Select * From Hotel h, Museum m Where B Order By R Limit 5
5 results R B Problems of Traditional Approach • NaïveMaterialize-then-Sort scheme • Overkill total order of all results; only 5 top results are requested. • Very inefficient • Scan large base tables • Join large intermediate results • Evaluate every ranking on every tuple • Full sorting
Therefore the problem is: Unlike Boolean constructs, ranking is second class. Ranking is processed as a Monolithic component (R), always after the Boolean component (B).
σh.star=3AND h.area=m.area How did we make Boolean “first class”? Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter
h.area=m.area σh.star=3 scan(m) scan(h) Splitting and Interleaving σh.star=3AND h.area=m.area Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter (2) B is split into joins and selections, which interleave with each other.
Relational Algebra Enables Such Splitting and Interleaving • Algebra: Arabic word, “bring together broken parts”. • Operators: to break apart the Boolean condition (splitting) • selection, projection, join, etc. • Algebraic Laws: to bring them together (interleaving) • pushdown selection: σc(R ΧS) = σc(R) Χ S • split selection: Σc1Λc2 (R) = σc1(σc2 (R)) • Relational algebra is the foundation of query optimization
Rank-Relational Algebra By: • Splitting ranking function • Interleaving ranking with Boolean operations We achieve: • Support ranking as a first-class query type • Integrate rankingwith traditional Boolean constructs • Enable efficient query plans that were unavailable
5 results 5 results R μclose sort cheap+close+related rank-joinh.area=m.area B μrelated σh.star=3 h.area=m.area σh.star=3 rank-scan(h,cheap) scan(m) scan(m) scan(h) Ranking Query Plan split and interleave: reduction of intermediate results, thus processing cost materialize-then-sort: naïve, overkill
Possibly orders of magnitude improvement Implementation in PostgreSQL plan1: traditional materialize-then-sort plan plan2-4: new ranking query plans Observations: an extended plan space with plans of various costs.
μp3 μp2 Scanp1(H) Example • Ranking function split, predicates evaluated incrementally. • Iterator structure. • Output order: how “promising” according to the evaluated predicates. Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 GetNext() Hp1
hotel upper-bound μp3 hotel upper-bound μp2 hotel upper-bound Scanp1(H) Example Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 Hp1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example h2 2.9 Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1
In contrast: materialize-then-sort Sortp1+p2+p3 Select * From Hotel H Order By p1+p2+p3 Limit 1 Scan(H)
Data Model: of Rank-Relation • Two Logical Properties • Membershipof the tuples: evaluated Boolean predicates • Orderamong the tuples: evaluated ranking predciates rank-relation Membership(B): h.area=m.area, h.star=3 Order(R): close(h.addr, “BWI airport”) related(m.collection,“dinosaur”) cheap(h.price) B Membership(B): h.star=3 Order(R): cheap(h.price) A rank-relation
Ranking Principle:upper-bound determines the order F=cheap + close + related • Processing in the “promising” order, avoiding unnecessary processing. B tuple h1 tuple h2 • Without further processing h1, we cannot output any result; A
Ranking Operators To achieve splitting and interleaving: • New operator: • μ: evaluate ranking predicates piece by piece. • Extended operators: • rank-selection • rank-join • rank-scan • rank-union, rank-intersection.
parser, rewriter operatorsandalgebraic laws Logical Query Plan optimizer Plan enumerator cost model implementation of operators Physical Query Plan Results executor Impact of Rank-Relational Algebra Ranking Query
Skyline Queries • A real estate example dist P4 P3 Skyline on price & dist P1 P5 P2 price • skyline contains points that are not dominated by any other point. [slide courtesy of Yidong Yuan]
Preference Queries • Semantically more flexible than top-k queries • Instead of enforcing total order, it allows partial order. • Efficient processing is more challenging
Motivation - Preferences SELECT * FROM used_cars PREFERRING (make = ‘Honda’ ELSE make = ‘Ford’) AND price AROUND 40000 SELECT * FROM used_cars WHERE (make = ‘Honda’ OR make = ‘Ford’) AND price = 40000
Preferences Semantics: Strict Partial Order • Preference: strict partial order A > B : “I like A better than B” • Example Car Make: {Honda, Ford, Toyota} Honda > Ford > {Toyota, Pontiac}