1 / 26

Hash Objects – Why Use Them?

Hash Objects – Why Use Them?. Carolyn Cunnison SAS Technical Training Specialist. Agenda. What are a Hash objects? When should I use them? Some sample code. What are HASH objects?. Hash object can be thought of as rows of keys and data loaded into memory. Keys. Data. Data.

duncan
Télécharger la présentation

Hash Objects – Why Use Them?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hash Objects – Why Use Them? Carolyn Cunnison SAS Technical Training Specialist

  2. Agenda • What are a Hash objects? • When should I use them? • Some sample code.

  3. What are HASH objects? • Hash object can be thought of as rows of keys and data loaded into memory. Keys Data Data ...

  4. Advantages of Hash Objects • Values can be hard-coded or loaded from a SAS data set. • Keys and data can be a mixture of character and numeric. • Provides in-memory data storage and retrieval. • Does not require that data be sorted. • Is sized dynamically.

  5. When to Use Hash Objects (1) Joining tables • I cut my processing time by 90% using hash tables - You can do it too!; Jennifer K. Warner-Freeman • http://www.nesug.info/Proceedings/nesug07/bb/bb16.pdf • Jennifer took an existing Proc SQL join which took between 2 and 4 hours to run. When she rewrote the program to use Hash tables, the program ran in 11 minutes.

  6. When to Use Hash Objects (2) Summary-less summarization • Hash-Crash and Beyond; Paul Dortman et al • http://www2.sas.com/proceedings/forum2008/037-2008.pdf • Compared PROC SUMMARY with NWAY option to Hash Object proc summary data = input nway ; class k1 k2 ; var num ; output out = summ_sum (drop = _:) sum = sum ; • The Hash Object did “the job more than twice as fast at the same time utilizing ⅓the memory”

  7. When to Use Hash Objects (3) Dynamically output to multiple files • Paul Dortman paper (continued) • Use a Hash table instead of the following: Data out1 out2; Set tablein;… If id = 1 then output out1; Else if id = 2 then output out2;

  8. When to Use Hash Objects (4) Removing data extremes • Knowledge Base Sample 25990 • http://support.sas.com/kb/25/990.html • Removes top and bottom 10% of data values.

  9. When to Use Hash Objects(5) Perform data sampling without Proc Surveyselect • Better Hashing in SAS9.2; Robert Ray and Jason Secosky • http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf • Select observations from a table without replacement. • Perform sampling and data manipulation in one step.

  10. Terminology Partial list of methods:

  11. Business Scenario • You need to read orion.product_list and then look up information in the orion.supplier table. • Structure of orion.product_list • Structure of orion.supplier

  12. Loading Data from a SAS Data Set • data supplier_info; • drop rc; • length Supplier_Name $40 Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address','Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address,Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; p305d02

  13. Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; . . . D ...

  14. Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; . . . D ...

  15. Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; . . . D ...

  16. Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; True . . . D ...

  17. Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; Implicit OUTPUT; Implicit RETURN; . . . D ...

  18. Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; Continue until EOF . . . D ...

  19. Results proc print data=supplier_info(obs=10); var Product_ID Supplier_ID Supplier_Name Supplier_Address Country; title "Product Information"; run; Partial PROC PRINT Output Product Information Obs Product_ID Supplier_ID Supplier_Name Supplier_Address Country 1 210200100009 3298 A Team Sports 2687 Julie Ann Ct US 2 210200100017 3298 A Team Sports 2687 Julie Ann Ct US 3 210200200022 6153 Nautlius SportsWear Inc 56 Bagwell Ave US 4 210200200023 6153 Nautlius SportsWear Inc 56 Bagwell Ave US 5 210200300006 1303 Eclipse Inc 1218 Carriole Ct US 6 210200300007 1303 Eclipse Inc 1218 Carriole Ct US 7 210200300052 1303 Eclipse Inc 1218 Carriole Ct US 8 210200400020 1303 Eclipse Inc 1218 Carriole Ct US 9 210200400070 1303 Eclipse Inc 1218 Carriole Ct US 10 210200500002 772 AllSeasons Outdoor Clothing 553 Cliffview Dr US

  20. Could I do the same thing with a MERGE ? • Yes. But …… • Would have to sort both tables. • Reading from disk is slower than reading from memory.

  21. What about data size ? • Scalability of Table Lookup Techniques,Rick Langston • http://support.sas.com/resources/papers/proceedings09/037-2009.pdf • Compared Hash table, Sort/Merge, Indexing, Proc SQL and Proc Format as table lookup techniques. • Hash object processing was successful up to around 1,900,000 rows and then ran out of memory.

  22. Did you know that…. • PROC SQL sometimes uses hashing to join tables. • Possible processing methods are: sqxjsl - Step Loop Join (Cartesian product)sqxjm - Merge Joinsqxjndx- Index Joinsqxjhsh- Hash Join • To view the method used: Procsql _method;

  23. The HITER object • The HITER object must point to a HASH object. • Read the HITER using the following methods.

  24. Conclusion • Hash and Hiter objects are very flexible. • Data has to fit into memory. • Results will depend on your data, your environment, and what you are trying to do. • You have to benchmark.

  25. Want to know more? • SAS Programming III: Advanced Techniques and Efficiencies https://support.sas.com/edu/schedules.html?ctry=ca&id=279 • Also available as Live Web course.

  26. Questions?

More Related