1 / 19

Rapid Development of Data Generators Using Meta Generators in PDGF

MIDDLEWARE SYSTEMS. RESEARCH GROUP. Rapid Development of Data Generators Using Meta Generators in PDGF. MSRG .ORG. Tilmann Rabl, Meikel Poess, Manuel Danisch , Hans-Arno Jacobsen DBTest 2013, June 24, New York City. DBMS Benchmarking is Increasingly Complex.

martha
Télécharger la présentation

Rapid Development of Data Generators Using Meta Generators in PDGF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIDDLEWARE SYSTEMS RESEARCH GROUP Rapid Development of Data Generators Using Meta Generators in PDGF MSRG.ORG Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City

  2. DBMS Benchmarking is Increasingly Complex • Data Volumes are sky rocketing • Enterprise data warehouses double every three years • Many enterprise data warehouses are in petabyte size • Systems are becoming increasingly complex • Large number of processor cores • Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware) • Multi node systems (sky is the limit) • Large memory • Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems • How to challenge these systems?

  3. Benchmarks are increasingly complex • More tables, columns • More relationships, dependencies, data types, … • How to build these benchmarks? • Parallel Data Generation Framework to the rescue!

  4. Parallel Data Generation Framework • Generic data generation framework • Relational model • Schema specified in configuration file • Post-processing stage for alternative representations • Repeatable computation • Based on XORSHIFT random number generators • Hierarchical seeding strategy

  5. Repeatable Data Generation • Data generation based on random numbers • More specifically parallel random number generation • Generation of numbers within range (e.g., age) • What if we want NULL values? • Repeat that logic in every generator?

  6. PDGF Architecture • Controller Initialization • Meta Scheduler Internodescheduling • Scheduler Interthreadscheduling • Worker Blockwisedatageneration • Update Black Box  Co-ordination ofdataupdates • Seeding System  Random sequenceadaption • Generators  Value generation • Output system Data formating • Togeneratedatafor a schematheuserdefines: • Schema XML file • Defines relational schema • Generation XML file • Definesoutputformat (CSV, XML, mergingtables)

  7. Configuring PDGF • Schema configuration • Data model • Relational model • Tables, fields • Properties • Table size, characters, … • Generators • Base generators • Meta generators • Update definition • Insert, update, delete • Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..]

  8. Base Generators in PDGF • DictList generator • Random line from file • Long generator • Random long in interval • Others • StaticValue • Double • Date • String • Text • … <table name="users"> <size>10000</size> <fields> <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <gen_DictList> <file>dicts/names.dict</file> </gen_DictList> </field> <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </field> </fields> </table>

  9. Null Generator • Add NULL logic to every generator? • Could easily be implemented in higher class • Adds to the configuration file • Reduces performance (every time) • Higher order generator NullGenerator • Only used if added to the schema • Can be added to any generator <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_NullGenerator> <probability>0.05</probability> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </gen_NullGenerator> </field>

  10. Meta Generators • Control flow and post-processing generators • Null generator controls flow • Post-processing • FormattedNumberGenerator • PaddingGenerator • UpperLowerCaseGenerator • PrePostfixGenerator • FormulaGenerator • Flow control • ProbabilityGenerator • SequentialGenerator • IfGenerator • SwitchGenerator • ReferenceGenerator

  11. Post-Processing Example • Phone number for users • 10s of representations • PhoneNumberGenerator was too inflexible • Formatted long number • Long numbers between 10010001 and 9999999999 • Number formatting (%d%d%d) %d%d%d-%d%d%d%d <field name="phonenumber"> <type>java.sql.types.VARCHAR</type> <size>30</size> <generator name="FormattedNumberGenerator"> <generator name="LongGenerator"> <min>10010001</min> <max>9999999999</max> </generator> <format>(%d%d%d) %d%d%d-%d%d%d%d</format> </generator> </field>

  12. Flow Control Example • More elaborate name field • Name male or female • 50% chance • All upper case • Padded to 100 characters • Sequential generator • Probability generator • DictList generator • UpperLowerCase generator • Padding generator <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <generator name="SequentialGenerator"> <generator name="ProbabilityGenerator"> <probability value="0.5"> <generator name="DictList"> <file>dicts/female.dict</file> </generator> </probability> <probability value="0.5"> <generator name="DictList"> <file>dicts/male.dict</file> </generator> </probability> </generator> <generator name="UpperLowerCaseGenerator"> <mode>uppercase</mode> </generator> <generator name="PaddingGenerator"> <character> </character> <padToLeft>true</padToLeft> </generator> </generator> </field>

  13. Core Performance • Test environment: single core laptop, no I/O • Base time for framework ~ 55 ns (Base Time) • Seeding, method invocation, setting a value • Computation time for generator 50+ ns (Gen Time) • Cache update if referenced ~ 50 ns (Cache Update) • Cache lookup if intra row reference ~ 50 ns (Cache Lookup) • Sub-generator invocation ~ 50 ns

  14. Performance Basic Generators • Basic generators without formatting • 120ns – 510ns

  15. Performance Formatted Values • Basic Generators with formatting • Usually > 1000ns

  16. Performance Meta Generators • Meta generator overhead: • Base overhead ~ 50 ns • Generator overhead starts from 50 ns • Sub generator invocation ~ 50ns • Often negligible due to lazy formatting

  17. Use Cases • TPC-H / SSB • 8 tables, 61 columns (first non-trivial example) • Without meta-FVGs: 26 custom FVGs • 2h editing: 10 custom FVGs • 1 day reimplementation: 0 custom FVGs, i.e. no coding • SSB variations • skews on dimension attributes, fact measures, references • TPC-DI (in process) • 20 tables, 200 columns • 19 custom FVGs (mainly for performance in corner cases) • 56x NullGenerator • 32x ProbabilityGenerator • 3000 lines of config (XML import for multiple files).

  18. Conclusion & Future Work • Meta generators • Improve usability and expressiveness • Speed up schema definition • Remove necessity for coding • Enlarged configuration files • Used in TPC benchmark(s) • Performance overhead is small, often negligible • Future work • GUI and SQL export • SQL import and data extraction

  19. Thanks • Questions? • Contact: tilmann.rabl@utoronto.ca • Download and try PDGF: • http://www.paralleldatageneration.org • Some big data info in our BigBench presentation • Tuesday, 4pm, Industry 3

More Related