1 / 26

Querying Big Data by Accessing Small Data

Querying Big Data by Accessing Small Data. Wenfei Fan University of Edinburgh & Beihang University Floris Geerts University of Antwerp Yang Cao University of Edinburgh & Beihang University Ting Deng Beihang University Ping Lu Beihang University. Challenges introduced by big data.

trochester
Télécharger la présentation

Querying Big Data by Accessing Small Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying Big Data by Accessing Small Data Wenfei Fan University of Edinburgh & Beihang University Floris Geerts University of Antwerp Yang Cao University of Edinburgh & Beihang University Ting Deng Beihang University Ping Lu Beihang University

  2. Challenges introduced by big data • Traditional computational complexity theory of 50 years: • The ugly: PSPACE-hard, EXPTIME-hard, … , undecidable • The bad: NP-hard (intractable) • The good: polynomial time computable (PTIME) What happens when it comes to big data? • Using SSD of 6G/s, a linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) • 5.28 years when D is of 1EB (1018B) • O(n) time is already beyond reach on big data in practice! Can we still answer queries on big data with limited resource? 1

  3. Bounded evaluability • Input: A class L of queries • Question: Can we find, for any query Q  L and any (possibly big) dataset D, a fraction DQ of Dsuch that • Q(D) = Q(DQ), and • DQ can be identified in time determined by Q? D Q( ) Q( ) DQ DQ • Scales with D no matter how big D grows Making the cost of computing Q(D) independent of |D|! 2

  4. Graph Search (Facebook) • Find me restaurants in New York my friends have been to in 2014 1.38billion person tuples, and over 140 billion friend tuples select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2014 Data semantics in constraints • Facebook: 5000 friends per person • Each year has at most 366 days • Each person dines at most once per day • pid is a key for relation person Build an index from pid1 to pid2 for friend(pid1, pid2) Boundedly evaluable with indices under constraints?

  5. Bounded query evaluation • Find me restaurants in New York my friends have been to in 2014 Q(rid) =  p, p1, n, c, dd, mm, yy (friend(p, p1)  person(p, n, c)  dine(p, rid, dd, mm, yy)  p = p0  c = NYC  yy = 2014) A query plan under the constraints + indices • Fetch 5000pid’s for friends of p0 -- 5000 friends per person • For each pid, check whether she lives in NYC – 5000 person tuples • For pid’s living in NYC, find restaurants where they dined in 2014 – 5000 * 366 tuples at most In contrast to 1.38billion person tuples, and over 140 billion friend tuples Accessing 5000 + 5000 + 5000 * 366 tuples in total 4

  6. Overview • Formalization of bounded query plans and queries • The complexity of deciding the bounded evaluability for • CQ (SPJ), UCQ, FO+ (SPJU), FO • Effective syntax for boundedly evaluable queries • Approximate query answering with bounded evaluability • Bounded envelopes • Bounded query specialization We only know that bounded evaluability is • undecidable for FO [PODS 2014] • in PTME for CQ with very restricted query plans [VLDB 2014] Previous work: bounded query plans are not properly defined

  7. Boundedly evaluable queries: formulation

  8. Access constraints to capture data semantics Combining cardinality constraints and index On a relation schema R: X  (Y, N) • X, Y: sets of attributes of R • for any X-value, there exist at most N distinct Y values • Index on X for Y: given an X value, find relevant Y values Examples • friend(pid1, pid2): pid1  (pid2, 5000) 5000 friends per person • dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day • person(pid, name, city): pid  (city, 1) pid is a key for person Discovery: functional dependencies, simple aggregate queries Access schema: A set of access constraints 6

  9. Bounded plans for query Q In the presence of access schema A (Q, R): T1 = 1, …, Tn = n, where i is Y  X  Y’ • { a }: a constant in query Q • Fetch(X  Tj, R, Y): via access constraint R: X  (Y’, N), j < i • Y(Tj),C(Tj), (Tj): projection, selection, renaming • Tj Tk, Tj Tk, Tj -Tk: Cartesian product, union, set difference, for j < I, k < i The length of (Q, R): bounded by an exponential in |R|, |Q| and |A| not very practical for plans beyond exponential Fetch data by making use of indices in A Independent of the size of instances D of R 7

  10. Boundedly evaluable queries Q Q has a bounded query plan (Q, R) under an access schema A • CQ:only { a }, Fetch(X  Tj, R, Y), Y(Tj),C(Tj), (Tj), Tj Tk : • UCQ: at the end only • FO+: { a }, Fetch, , , , , , • FO: { a }, Fetch, , , , , ,  Coping with big data 8

  11. Deciding bounded evaluability

  12. The bounded evaluability problem (BEP(L)) • Input: A relational schema R, an access schema A, and a query Q in a query language L • Question: Is Q boundedly evaluable under A? When Q has a bounded query plan under A. Undecidable for FO [PODS 2014] • Is BEP decidable for CQ? UCQ? FO+? • If so, what is the complexity? The bounded evaluability analysis is nontrivial 9

  13. Example of bounded evaluable queries • Schema: R(A, B, C) • Access schema A: R( C, 1), R(AB C, N) • A CQ query: Q(x, y) =  x1, x2, z1, z2, z3 (R(x1, x2, x)  R(z1, z2, y )  R(x, y, z3)  x1 = 1  x2 = 1) Is Q boundedly evaluable? Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable: • x = y = z3 •  z1, z2 (R(1, 1, x)  R(z1, z2, y)) is entailed by R(1, 1, x) With indices in A, • “nontrivial” variables are fetchable; • combinations are indexed 10 We need to reason about A-equivalence and “nontrivial” variables

  14. The complexity of BEP BEP is EXPSPACE-complete for CQ, UCQ and FO+ • good news: decidable • bad news: to expensive to be practical lower bound: by reduction from the non-emptiness problem for parameterized regular expressions Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries Can we make practical use of bounded evaluability? 11

  15. Effective syntax for boundedly evaluable queries

  16. An effective syntax for bounded CQ A form of queries covered by an access schema A • A CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by A • All CQ queries covered by A are boundedly evaluable under A • It is in PTIME to syntactically check whether a CQ is covered by A in |Q|, |A| and |R| A CQ Q is covered by A if • all free variables and variables that participate in “selection / join” of Q are accessible via indices in A • combination of such variables in each atom R(x) is indexed by a single access constraint 12 A syntactic characterization of boundedly evaluable CQ

  17. More on covered queries • Schema: R(A, B, C) • Access schema A: R( C, 1), R(ABC, N) • Q(x, y) =  x1, x2, z1, z2, z3 (R(x1, x2, x)  R(z1, z2, y )  R(x, y, z3)  x1 = 1  x2 = 1) covered A query in FO+ is covered by A if for each CQ-subquery Qi • either Qiis covered by A, • or for each A-instance (Ti) of Qi, there exists a CQ-subquery Qj of Q such that Qi((Ti))  Qj((Ti)) and Qj is covered 2p-complete to decide whether a query in FO+ is covered 13

  18. Bounded envelopes

  19. Bounded envelopes What can we do if query Q in L is not boundedly evaluable under A? We find QL and QU in the same language L such that • QL and QU are boundedly evaluable under A • for all instances D that satisfy A • QL(D)  Q(D) QU(D), and • NL  | Q(D)  QL(D) |, and NU |QU(D)  Q(D) |, where NLand NU are constants QL and QU: upper and lower envelopes of Q S. Chaudhuri and P. G. Kolatis.Can datalog be approximated? JCSS 55(2), 1997 QL(D) andQU(D) are not too far from Q(D) Approximate query answering 14

  20. Example bounded envelopes • Schema: R(A, B) • Access schema A: R(A  B, N) • Q(x) =  y, z, w (R(w, x)  R(y, w)  R(x, z)  w = 1) not boundedly evaluable relaxation Bounded envelopes • Upper: QU(x) =  y, z (R(1, x)  R(x, z)) • Lower: QL(x) =  y, z (R(1, x)  R(y, 1) R(x, y)  R(x, z)) expansion Q(x, y) =  w (R(w, x)  R(y, w)  w = 1) Bounded envelopes may not exist 15

  21. The bounded envelope problems UPE(L): • Input: A relational schema R, an access schema A, and a query Q in a query language L • Question: Does Q have a bounded upper envelope under A? Similarly LPE(L) for lower envelopes. We consider covered envelopes when Q is in CQ, UCQ or FO+ Complexity bounds • For CQ, UEP and LEP are NP-complete • For UCQ, UPE is 2p-complete and LEP is NP-complete • For FO+, UPE is 2p-complete and LEP is DP-complete • For FO, UEP and LEP are undecidable 16

  22. Bounded specialized queries

  23. Bounded query specialization Access schema A, and query Q with a set X of parameters (variables) • Q(x = c): Q x = c: x  X, valuation c is a constant tuple • bounded evaluable under A for all valuations c Consider covered queries when Q is in CQ, UCQ or FO+ • Find me restaurants in New York my friends have been to in 2014 Q(p, rid) =  p, p1, n, c, dd, mm, yy (friend(p, p1)  person(p, n, c)  dine(p, rid, dd, mm, yy)  p = p0  c = NYC  yy = 2014) All valuations p0 Instantiate a minimum set of parameters and make Q bounded 17

  24. The bounded specialization problem (QSP(L)) • Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k • Question: Does Q have a bounded specialization Q(x = c) with k  | x | ? Complexity bounds • NP-complete for CQ • 2p-complete for UCQ and FO+ • undecidable for FO 18

  25. Summing up

  26. Bounded evaluability of queries Challenges: querying big data is cost-prohibitive • Bounded evaluability allows us to make big data small • However, the bounded evaluability analysis is expensive • Nonetheless, we can make practical use of bounded evaluability • Effective syntax: covered queries for CQ, UCQ and FO+ • Approximate query answering: • Bounded envelopes with a constant bound • Bounded specialization for parameterized queries Decidability and complexity An approach to effectively querying big data 26

More Related