1 / 50

Ten Thousand SQLs

Ten Thousand SQLs. Kalmesh Nyamagoudar 2010MCS3494. CONTENTS. CN Generation. Example Definitions Algorithm. Sequential Algorithm CLP : Naïve CLP : New OLP DLP Performance Studies. CN Evaluation. BANKS Model. Steiner Trees. Paper1. Paper2. Author1. Author2. Author1. Author2.

beau
Télécharger la présentation

Ten Thousand SQLs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ten Thousand SQLs KalmeshNyamagoudar 2010MCS3494

  2. CONTENTS • CN Generation • Example • Definitions • Algorithm • Sequential Algorithm • CLP : Naïve • CLP : New • OLP • DLP • Performance Studies • CN Evaluation

  3. BANKS Model Steiner Trees Paper1 Paper2 Author1 Author2 Author1 Author2

  4. DISCOVER Model Joining Network Of Tuple Sets AUTHOR WRITES PAPER CITE Paper{} AuthorAuthor2 TID TID Writes {} Writes {} TID TID Writes {} Paper{} PID1 AID AuthorAuthor2 AuthorAuthor1 NAME NAME AuthorAuthor1 Writes {} PID PID2 Joining Network Of Tuples Paper1 Paper2 Author2: Paper1 Author2: Paper2 Author1: Paper1 Author1: Paper2 Author1 Author2 Author1 Author2

  5. Background : DISCOVER • Database • n Relations • Each has attributes • Schema Graph : • Directed graph that captures p-f relationships in database schema • : Each relation • : for each p-f relationship • Assumption : No self loops/parallel edges • Undirected version of (Future reference : )

  6. Background : DISCOVER Schema Graph (TPC-H)

  7. Background : DISCOVER Example Data Source : Discover[3]

  8. Background : DISCOVER Query: Smith,Miller” Source : Discover[3]

  9. Background : DISCOVER Query: Smith,Miller” SIZE RESULT 2 O1 C1 O2 Source : Discover[3]

  10. Background : DISCOVER Query: Smith,Miller” Joining Network Of Tuples SIZE RESULT 2 O1 C1 O2 O1 C1 N1 C2 O3 4 Source : Discover[3]

  11. Background : DISCOVER Source : Discover[2] Joining Network Of Tuple Sets

  12. Background : DISCOVER • Final Answer : Joining Network Of Tuples () • Tree of tuples • For each pair of adjacent tuples , where , , there is an edge in schema graph and ( • Keyword Query • Given : Set of keywords • Result : Set of all possible joining networks of tuples that are both : • Total : every keyword is contained in at least one tuple of joining network. • Minimal : Removal of any tuple still gives a TJNT? • Ordered By : Size of MTJNTs • All such joining network of tuples are Minimal Total Joining Network Of Tuples (MTJNT)

  13. Background : DISCOVER • Joining Network Of Tuple Sets • Tree of tuple sets • For each pair of adjacent tuple sets , there is an edge in schema graph • Candidate Network • Given : Set of keywords • Is a Joining Network Of Tuple Sets such that there is an instance I of the database that has a MTJNT and no tuple that maps to a free tuple set contains any keywords • 2 Steps : • candidate networks (CNs) generation • CNs evaluation

  14. Background : DISCOVER • Candidate Networks Generation • Complete : Every possible MTJNT is produced by a candidate network output by the algorithm • Minimal : Does not produce any redundant candidate networks Example: • ORDERSSmith⋈ CUSTOMER{} ⋈ ORDERSMiller • ORDERSSmith⋈ CUSTOMER{} ⋈ ORDERSMiller⋈ CUSTOMER{} • ORDERSSmith⋈ CUSTOMER{} ⋈ ORDERS{} • ORDERSSmith ⋈ LINEITEM{} ⋈ ORDERSMiller • Tmax : Maximum number of tuple sets in a CN

  15. CN Generation Source : Discover[2]

  16. CN Generation Source : Discover[2]

  17. CN Generation Source : Discover[2]

  18. CN Evaluation : • Large number of CNs to be evaluated • CNs : usually tightly coupled with each other • Reuse the common sub expressions Example : • Dataset : DBLP • No. of tables : 4 • Max No. of tuples/result : 11 • CN join operations without sharing : 539,052 • CN join operations with sharing : 53,008 • probability for any two CNs to share computational cost : 59.64

  19. Sequential Algorithm :Example • Dataset : DBLP AUTHOR WRITE PAPER CITE TID TID TID TID PID1 AID NAME NAME PID PID2 Source : TTS[1]

  20. Sequential Algorithm :Example • Keywords Entered : • CNs generated : AUTHOR WRITE PAPER CITE TID TID TID TID PID1 AID NAME NAME PID PID2 Source : TTS[1]

  21. CN Evaluation : state-of-art sequential algorithm • Greedy algorithm: • In each iteration build intermediate result of size that maximizes • No of occurrences of IMR in CNs • Estimated no. of tuples of IMR • gives better results

  22. Sequential Algorithm :Execution Graph Source : TTS[1] Example CN : Total Cost : 2199

  23. Sequential Algorithm :Execution Graph • : DAG • ∈ V(GE) : An operator(e.g. or σ) • ∈ E(GE)) iff output of is part of the input of • Levels • σ : level 1 • A node is in level iff • ∃, such that ∈ E(GE) and is in level , and • ∀, such that ∈ E(GE), the level of is no larger than • : Maximum level of GE • Evaluation • Evaluated in a bottom-up fashion • No parallelism involved • For keyword queries with large number of keywords or with high selective keywords, the processing time is slow

  24. New Solution • Use of multi-core architecture • Why not existing parallel multi-query processing? • Large number of queries • Large sharing between queries • Large intermediate results • What we need on multi-core archs? • CNs in the same core share : most computational cost • CNs in different cores share : least computational cost • Handle high workload skew • Handle errors caused by estimation adaptively

  25. CN Level Parallelism : Straightforward Approach • largest first rule : partition with the least workload Source : TTS[1] Final Cost : max(cost of each core) = 1949

  26. CLP : Straightforward Approach O() select the core : O(n) Add CN to partition :O() Source : TTS[1] Execution Time = Problem : does not consider sub-expression sharing

  27. CLP: Sharing-Aware CN Partitioning • Which CN to distribute first? • the largest not-shared/extra cost • To which partition? • with maximum sharing if it does not destroy the workload balancing. • Total cost for a partition = cost after sharing sub-expressions for all CNs in that partition

  28. 3 : Non-Exec Graph of Core 3 MaxHeap 5 5 5 5 5 5 5 5 5 50 50 50 500 510 510 50 50 50 100 102 102 10 10 10 P P P W C W C C C A P P A P P

  29. 3 MaxHeap 5 5 5 5 5 5 5 5 5 50 50 50 500 510 510 50 50 50 100 102 102 10 10 10 P P P W C W C C C A P P A P P

  30. 3 MaxHeap 5 5 5 5 5 5 5 5 5 50 50 510 510 50 50 102 102 10 10 P P W W C C C A P P P

  31. 3 MaxHeap 5 5 5 5 5 5 5 5 5 50 510 510 50 102 102 10 P W C C C P P P

  32. 3 MaxHeap 5 5 5 5 5 5 5 5 5 510 510 102 102 W C C P P

  33. CLP: Sharing-Aware CN Partitioning Source : TTS[1] • Total Cost = 957 • 43.5 of the sequential cost

  34. CLP: Sharing-Aware CN Partitioning Source : TTS[1] Initialization Select the core : Update cost of overlapping CNs : |E(GE)|noverall • Execution Time = [ Assuming ] • Redundant works are done by multiple cores • Wrong Estimation : Accumulated Cost

  35. CLP: Error Accumulation Source : TTS[1]

  36. Operator Level Parallelism • Each CN is allowed to be processed in different cores, buteach operation must be processed in a certain core • Nodes in the same phase processed parallely • Time Of Partition? • In phase if node is in level of GE • Which operation to distribute? • the largest cost • To which partition? • Minimum cost if it does not destroy the workload balancing • Sharing between CNs and phases : Shared Memory

  37. Operator Level Parallelism Source : TTS[1] • Final Cost = 737 • 33.5of the sequential cost

  38. OLP : Overcoming Error Accumulation • Before each phase, re-estimate the cost of each operation • Cost of a select operation • Cost of a join operation • can be pre-computed and saved beforehand for each edge

  39. OLP : Overcoming Accumulated Cost 685 685 643 Source : TTS[1]

  40. Operator Level Parallelism nodes overall Source : TTS[1] • Execution Time • A join operation is much more costly than others => becomes the dominant cost when processing

  41. Data Level Parallelism • each operation in GE can be performed on multiple cores • uses the operation level parallelism if there is no workload skew • partition data adaptively before each time workload skew happens • Which node to partition? • Most costly node if its dominant • When to merge the sub-results? • At final phase

  42. Data Level Parallelism Core 1 Core 3 Core 2 Source : TTS[1]

  43. Data Level Parallelism Source : TTS[1] Select the child node to be partitioned • Lemma : In each phase, at most partition operations will be performed • Each node will have max copies <= • Execution Time = with = = Divide the tuples of child node Re-estimate Makes copies of selected child node and all its father nodes Adds corresponding edges

  44. Performance Studies • Implemented In • System Configuration • For LINEAR processing • processing time for the state of art sequential algorithm • no of cores

  45. Performance Studies • Default values(IMDB) • : 3 • ranges from 2 to 6 with a default value 4 • ranges from 4 to 7 with a default value 5 Source : TTS[1]

  46. Vary (IMDB) Source : TTS[1]

  47. Vary (IMDB) Source : TTS[1]

  48. Vary (IMDB) Source : TTS[1]

  49. References • Lu Qin, Jeffrey Xu Yu, Lijun Chang, Ten Thousand SQLs: Parallel Keyword Queries Computing, Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010 , Singapore • Vagelis Hristidis, YannisPapakonstantinou, Discover: keyword search in relational databases, VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases, Hong Kong • [PPT]  DISCOVER: Keyword Search in Relational Databases

  50. Thank You

More Related