Efficient SAS Programming for Handling Large Datasets

Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007

Axes if Efficiency • processing speed: • CPU • real • storage: • disk • memory • … • user: • functionality • interface to other systems • ease of use • learning • user development: • methodologies • reusable code • facilitate extension, rewriting • maintenance

Dataset / Table

Datasets consist of three parts

General (and obvious) principles • Avoid doing the job if possible • Keep only the data you need to perform a particular task (use drop, keep, where and if’s)

Combining datasets -- concatenation

General (and obvious) principles • Often efficient methods were written to perform the required task – use them.

General (and obvious) principles • Often efficient methods were written to perform other tasks – use them with caution. • Write data driven code • it’s easier to maintain data than to update code • Use length statements to limit the size of variables in a dataset to no more than is needed. • don’t always know what size this should be, don’t always produce your own data. • Use formatted data rather than the data itself

Memory resident datasets

Compressing Datasets • Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job • delays execution and there is need to keep track of data and program dependency. • Use a general purpose compression utility and decompress it within SAS for sequential access. • system dependent (need a named pipe), sequential dataset storage.

Compressing Datasets

SAS internal Compression • allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much. • “There is a trade-off between data size and CPU time”.

indata is a large dataset and you want to produce a version of indata without any observations

The data step is a two stage process • compile phase • execute phase

Data step logic

data step

data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: compile phase

data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: execute phase

data admits; set admits; discharge = admit + length; format discharge date8.; run; /* implicit output */ PDV: execute phase

data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: execute phase

Efficiency: suspend the PDV activities

General principles • Use by processing whenever you can • Given the data below, for each region, siteid, and date, calculate the mean and maximum ozone value.

General principles • Easy:

General principles • Suppose there are multiple monitors at each site and you still need to calculate the daily mean? • Combine multiple observations onto one line and then compute the statistics? • Suppose you want the 10% trimmed mean? • Suppose you want the second maximum? • Use Arrays to sort the data? • Write your own function?

Efficient SAS Programming for Handling Large Datasets

Efficient SAS Programming for Handling Large Datasets

Presentation Transcript

Efficient Record Linkage in Large Data Sets

SAS Statistics with SAS package

Notes for SAS programming

SAS Programming: Working With Variables

An Efficient Data Envelopment Analysis with a large data set in Stata

Working Efficiently with Large SAS® Datasets

Programming in SAS

Linear Programming with SAS

Good practices in programming and data analysis and parallel programming with SAS

SAS Programming

Introduction to SAS Programming

Good practices in programming and data analysis and parallel programming with SAS

Solving Your Problems with SAS Enterprise Guide - SAS without programming

XMM data reduction with SAS

SAS Programming:

Literate programming with SAS - and other languages

Efficient SAS Coding with Proc SQL

Notes for SAS programming

Introduction to SAS Programming

Clinical SAS Programming | SAS Training | Big Data | Hadoop | Business Analytics | Clinical SAS : Epoch Research Institu

SAS Clinical Programming

SAS Base Programming