340 likes | 624 Vues
Query-Based Data Pricing. Paraschos Koutris Prasang Upadhyaya Magdalena Balazinska Bill Howe Dan Suciu. University of Washington PODS 2012. Motivation. Data is increasingly sold and bought on the web Websites that sell data: AggData [ www .aggdata. com ]
E N D
Query-Based Data Pricing ParaschosKoutris PrasangUpadhyaya Magdalena Balazinska Bill Howe Dan Suciu University of Washington PODS 2012
Motivation • Data is increasingly sold and bought on the web • Websites that sell data: • AggData[www.aggdata.com] • Xignite (financial data) [www.xignite.com] • Gnip (social media) [www.gnip.com] • Data marketplace services: • Windows Azure Marketplace (100+ datasets) [datamarket.azure.com] • Infochimps (15,000 datasets) [www.infochimps.com] Query-based pricing customized for buyers
Current Pricing (1) • A fixed price for the whole dataset or for a specific set of views • Example:CustomLists • USA Business Database for $399 • Email addresses for $299 • Businesses in WA for $199 • Limitations: • Restaurants in WA ? • Businesses in cities with population >100,000 ?
Current Pricing (2) • API Subscriptions (Azure Marketplace, Infochimps) • Allow queries over the data • Pay by number of transactions (page of results)
Issues With Pricing • Buyers today need to buy a superset of the data they are interested in • Sellers can’t easily anticipate all possible queries that buyers might ask • Solution: we need a more flexiblepricing scheme, parameterizedby queries
Outline • The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections
The Pricing Framework • The seller defines price points (view-price pairs): S = { (V1,p1), (V2,p2), … } • A buyer can buy anyquery Q • The system will compute priceDS(Q) Buyer Q(D) ? Seller priceDS(Q) Pricing System + Database D V1,p1 V2,p2 …
Instance-Based Determinacy Definition. V = V1,…,Vkdetermine Q given D, denoted D ⊢ V ↠ Q, if: forall D’, if V(D) = V(D’), then Q(D) = Q(D’) Intuitively, “V1,…, Vk determine Q” means that Q(D) can be answered only from V1(D),…,Vk(D), without accessing the database instance D
Arbitrage-Free • Axiom 1. • Given D, the pricing function priceD(Q) is arbitrage-free if for all views V1, …, Vk and query Qwhere D ⊢ V1, …, Vk↠ Q: • priceD(Q) ≤ priceD(V1) + … + priceD(Vk) Suppose V determines Q and priceD(Q) > priceD(V). Then, we can • buy V(D) for priceD(V) • compute Q(D) from V(D) • now we have answered Q at some price p<priceD(Q)
Discount-Free Axiom 2. The pricing function priceD(Q) should not offer any other additional discounts except for the explicit price points defined by the seller. • The intuition is that the price points represent discounts that the seller offers relative to the price of the whole database • A pricing function is discount-free if it is maximal
Example: Origami Database Database S Price points Get all dragon origami for $2 Get all red origami for $3 What is the price of the entire database? Q(x,y,z) :- S(x,y,z) Exhausts the active domain V1, V2, V3, V4determine Q: price(Q) ≤ $8W1, W2, W3determine Q: price(Q) ≤ $9 price(Q)=$8
Example: Origami Database R T S p(σcolor)=$50 p(σshape)=$99 p(σshape)=$2 p(σcolor)=$5 What is the price of the full join? Q(x,y,z,u,v) :- R(x,u), S(x,y,z), T(y,v)
Outline • The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections
The Query Pricing Formula • Given: • Price points S = {(V1,p1),…,(Vk, pk)} • Database instance D • Query Q. • Compute: priceDS(Q) • Properties: (a) arbitrage-free, (b) discount-free, (c) priceDS(Vi)=pi • If it exists, we say that the price points are consistent • Method: • Consider all subsets of V ={V1,…,Vk} that determine Q • Let C be the subset with the minimum price, Σi pi, for Viin C • Define pD(Q) = Σi pi Theorem. The price points are consistentiffpD(Vi)=pi for any price point i=1,…,k (b) priceDS(Q) = pD(Q) is the uniquearbitrage-free, discount-free pricing function that agrees with the price points 15
Discussion • If the result of Q1 is always a subset of Q2, should Q1 be priced less than Q2? No! Example: • V(x,y) :- Fortune500(x,y)Q(x,y) :- Fortune500(x,y), StrongBuyRec(x) • price(Q) >> price(V) • We ignore computation costs in our framework • Cost of computing query Q • Q(D)=f(V(D)), but f can be hard to compute
Outline • The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections
Determinacy Definition. [Instance-dependent] V determines Q given D, denoted as D ⊢ V ↠ Q, if: forall D’, if V(D’) = V(D), then Q(D) = Q(D’) [Nash, Segoufin, Vianu ‘07] Definition. [Instance-independent] V determines Q, denoted as V ↠ Q, if: forall D, D’, if V(D) = V(D’), then Q(D) = Q(D’) V ↠ Q iffthere exists a function f such that Q(D) = f(V(D)) for all D ifffor every D, we have that D ⊢ V ↠ Q
Complexity Of Determinacy Open Question: is the bound on the combined complexity tight?
Complexity Of Pricing • Corollary. • Deciding whether priceDS(Q) ≤ k is: • Combined complexity [input S, D]: Σp2 • Data complexity [input D]: coNP-hard Proposition. Pricing is at least as hard as determinacy How do we deal with the hardness of computation?
Outline • The Pricing Framework • The Pricing Formula • The Complexity of Pricing • Dichotomy and Algorithms for Selections
Restricting Price Points to Selections • A seller can specify only the prices of selectionqueries of the form σR.X=a: prices on columns • The domain of each column is finite and known to buyers and sellers • Price points on selections is how prices are set in most cases today
Dichotomy Theorem Theorem. Assuming selection views only, for any Conjunctive Query w/o self-joins Q, one of the following holds (data complexity): priceQS(D) is in PTIME checking whether priceQS(D)≤k is NP-complete • PTIME: • Q(x,y,z,u,v) :- R(x,u),S(x,y,z),T(y,v) [Chains] • Q(x1,…,xk) :- R1(x1,x2),…,Rk(xk,x1) [Cycles] • NP-complete: • Q(x) :- R(x,y) [Projections] • Q(x,y,z) :- R(x,y,z),S(x),T(y),U(z)
Algorithm For PTIME Cases • The algorithm uses a reduction to maximum flow • Edges of finite capacity represent price points • A set of edges of finite cost is a cutiff they determine the query • Example: • Chain query Q(x,y):-R(x),S(x,y),T(y) S R T Dom(X) = {a1,a2,a3,a4} Dom(Y)= {b1,b2,b3}
S Flow Graph R T R T a4 b1 a3 b2 a2 b3 a1 a4 b1 a3 b2 a2 b3 A set of edges of finite cost is a cutiff they determine the query a1 S
Conclusions • Summary: • The seller sets prices to some views, while the system computes the price of any query • Interesting application of query determinacy • Complexity: dichotomy for CQs w/o self-joins • Future Work: • Pricing in the presence of updates • How do we overcome pricing for intractable queries? • Connection of pricing and privacy