Statistical Software Programming

Statistical Software Programming

STAT 6360 –Statistical Software Programming Sorting, Printing, Summarizing Data Now that we can input data and do some manipulations, let’s learn some simple PROCs. Using PROCs – Some Commonalities PROCs do a wide variety of different things and syntax varies, but there are some statements and options that are available in most or all PROCs. • The PROC statement and data= option: • All PROCs start with a PROC statement. More importantly, every PROC statement has a data= option. This option tells the PROC which dataset to operate on. • data= is optional, but get in the habit of using it every time. If omitted, the PROC uses the most recently created dataset. • It is sometimes useful to add SAS dataset options that modify the dataset as it is brought in to the PROC. • Syntax: PROCXXXXdata=dsname(options);

STAT 6360 –Statistical Software Programming Using PROCs – Some Commonalities • BY statement. • Syntax: BYvariable_list; • If the dataset is sorted by the variables in the variable_list, then the BY statement tells SAS to repeat the action of the PROC for each combination of the variables on the variable_list. • If dataset htdatahas data on boys and girls and is sorted by gender, the following code computes summary statistics on height separately by gender : procmeansdata=htdata; by gender; var height; run; • The one PROC for which BY is required is PROC SORT, where it tells the procedure the sorting scheme to use.

STAT 6360 –Statistical Software Programming Using PROCs – Some Commonalities • LABEL statement. • Syntax: LABELvar1=‘label1’ var2=‘label2’ …; • This statement associates a label with a variable. If used, the label is used in place of the variable name in the output. • This statement is valid in both data steps and PROCs. In a data step the label is permanently associated with the variable. In a PROC the label applies only within the PROC.

STAT 6360 –Statistical Software Programming Using PROCs – Some Commonalities • WHERE statement. • Syntax: WHEREcondition; • This statement applies in any PROC that operates on a dataset. • It limits the application of the PROC to only a subset of the dataset that satisfies the condition. • E.g., the following code computes summary statistics on height only for females: procmeansdata=htdata; where gender='F'; var height; run; • The WHERE statement is very convenient! • There is also a where data set option that can be used to bring in a subset of a dataset into the PROC instead of the whole dataset. The net effect is the same.

STAT 6360 –Statistical Software Programming PROC SORT • Sorting our data is useful for many reasons. • Helpful to understand patterns in the data when examining them directly. • Necessary for many PROCs and plotting routines. • Necessary for BY group processing within a PROC. • Necessary for assigning ranks, creating indices. • Basic syntax of PROCSORT: procsortdata=dsname1 out=dsname2; byvariable_list; run; • The out= option asks SAS to put the sorted dataset in a new dataset with a name specified by the user (dsname2). If this option is omitted, the sorted dataset over-writes the original dataset (dsname1). • By default, PROCSORT sorts in ascending order. This can be changed by preceding any variable on the variable_list with the keyword descending. • See Example #1 (MAPData example) in Lec4Examps.sas.

STAT 6360 –Statistical Software Programming PROC PRINT • We’ve used PROCPRINT several times now. But here are some additional things to know: • On PROC statement: • NOOBS option suppresses printing of observation index. • LABEL option tells SAS to print variable labels not variable names as column headings. • N option asks for number of obs per BY group and overall to be printed. Also allows labels to be specified for those values (via N= ‘label1’ ‘label2’). • SUM statement will generate totals for each BY group and overall for specified numeric variables. • FORMAT statement specifies formats to be used when printing for selected variables. • ID statement can specify a variable to use in place of the observation number (OBS) which is, by default, printed in the left-most column of output. • See documentation for other options of less importance. • See Example #2 (STAT Enrollment Example) in Lec4Examps.sas.

STAT 6360 –Statistical Software Programming Formats and PROC FORMAT • SAS formats are to printing and outputting data as informats are to inputting data. • They allow values of numeric, character, date and time variables to be printed or output to a file in modified (e.g.,“pretty”) form without changing the values themselves. • Most formats are either character, numeric or date & time formats and have syntax as given below. • Essentially the same general syntax as informats. • Many analogous formats & informats are identical. • Crucial to use date and time formats for these values to be easily understood.

STAT 6360 –Statistical Software Programming Formats and PROC FORMAT FORMAT Statement: Syntax: FORMATvar_list1format1.var_list2format2....; • If used in PROC, temporarily associates variables with formats. • If used in data step, permanently associates variables with formats. • Formats will be used • when values of a formatted variable are printed by PROCs (e.g., PROCs PRINT, FREQ, any time a BY statement is used) • When viewing formatted variables in VIEWTABLE. • When outputting values of a formatted variable with a PUT statement. • Formats vs. Labels: • Formats modify how SAS prints the values of a variable that has been given a format. • Labels modify how SAS prints the name of the variable itself, not its values.

STAT 6360 –Statistical Software Programming Formats and PROC FORMAT PUT Statement:Analogous to and similar syntax as INPUT statement. Syntax: flexible. Essentially same as syntax of INPUT. • Writes to the log window by default. If a FILE statement is used in same data step, writes to a file. • FILE analogous to and similar syntax as INFILE. • I use PUT very rarely. • See Example #3 (skiing example) in Lec4Examps.sas.

STAT 6360 –Statistical Software Programming Formats and PROC FORMAT PROC FORMAT:Allows you to create custom formats. • Syntax: PROCFORMAT; VALUE $chformatname value_list1='text1' value_list2='text2' ... ; VALUEnumformatname range1='text1' range2='text2' ... ; RUN; • Can define character formats (1stVALUE statement above) or numeric formats (2ndVALUE statement). • Each possible value of a variable can be given a distinct text label, or values can be grouped together via ranges or lists and each group given a distinct text label.

STAT 6360 –Statistical Software Programming Formats and PROC FORMAT • See Example #4 (Enrollment Revisited) in Lec4Examps.sas. • Here we create formats to be used for term, which is a character variable taking values like '201202'. Such a value means semester 02, which is spring semester, in the year 2012. • The new format $semester. gives more meaningful labels to each distinct value of term. • The new format $AY.groups terms together into academic years, and then gives each group a meaningful label. • PROCFORMAT creates these new formats and makes them available to be assigned to any variable for which we think they are appropriate. • It is not until the FORMAT statement is executed, though, that they are assigned to the variable term. • Note that if a format is used to group the values of a variable, then as long as that format is in force, the formatted values (i.e., the groups) are used, not the values of the variable(!), for BY processing, for contingency tables in PROCFREQ, for forming factor levels in a CLASS statement (PROCs ANOVA, GLM, etc.).

STAT 6360 –Statistical Software Programming PROC MEANS Uses: • Computing summary/descriptive statistics on numeric variables over entire dataset or separately within BY groups. • Writing those summary statistics to an output dataset for further manipulation and analysis. • Implements some simple inferential techniques (confidence interval for the mean, one sample t-test). Syntax: PROCMEANSdata=dsname <stat_keywords> <options> ; BYby_var_list; varvar_list; RUN; • Computes summary statistics for each variable on var_list over the whole dataset, or if a BY statement is present, separately within each group defined by the by_var_list. • Which summary statistics it computes is determined by the list of stat_keywords. • CLASSstatement similar to BY, but data need not be sorted first.

STAT 6360 –Statistical Software Programming PROC MEANS Statistics Keywords – Many to choose from: CLM CSS CV KURTOSIS LCLM / UCLM MAX MEAN MIN N NMISS RANGE SKEWNESS STDEV or STD STDERR (of mean) SUM SUMWGT USS VAR MEDIAN or P50 P1/P5/P10/P90/P95/P99 Q1 or P25 / Q3 or P75 QRANGE PROBT T • Options: • NOPRINT – suppresses output; useful if you want to write summary stats to a new dataset, but don’t want the results in the output window. • ALPHA= - sets the significance level for confidence limits on the mean if you request them with CLM, LCLM, or UCLM options. • If no Statistics Keywords are specified, PROCMEANS produces the mean, standard deviation, min, max, and sample size (N). • Example 5 in Lec4Examps.sas is a simple example with default results.

STAT 6360 –Statistical Software Programming PROC MEANS OUTPUT Statement:Used to write summary statistics to a dataset for further analysis. Especially useful with BY groups. Syntax: outputout=dsname keywrd1(varlist1)=namlist1 keywrd2(varlist2)=namlist2 ...; E.g. (see Example #6 in Lec4Examps.sas): procmeansdata=sasdata.carsnoprint; by type; varcity_mpghwy_mpg passengers; outputout=car_summariesmean(city_mpghwy_mpg)= city_mnhwy_mn range(city_mpghwy_mpg)=city_rghwy_rg mode(passengers)=pass_md; run; • Here, no output is generated, but a new dataset, car_summaries, is made with means and ranges of city_mpg & hwy_mpg, and the mode of passengers. These are computed for each car type (compact, midsize,etc.).

STAT 6360 –Statistical Software Programming FEV1 Example – A Paired t-test To assess respiratory health, the volume and speed of air flow when breathing is measured with a spirometer. One important measure is FEV1 (forced expiratory volume in 1 second). N=15 subjects who work in a job exposed to smoke are measured with a spirometer in the morning (Pre-shift) and evening (Post-shift). We’d like to test whether mean FEV1 differs after a day at work. • This can be done with a paired t-test in PROC MEANS. • Complication: To get FEV1, subjects are asked to try (blow into a tube) several times. Once 3 “good blows” are obtained, the max value is taken as the official measurement of FEV1. • So to get each subject’s Pre and Post measurement, we first need to find the max of several trials. We’ll also find and analyze the mean of the trials, to see how that summary measure behaves. • See Example #7 in Lec4Examps.sas. • Conclusion: Using two-sided alternative, we reject null of no pre vs post difference at α=.05 if we use max FEV1 value to summarize the trials. FEV1 tends to decrease from Pre to Post-shift. Mean diff= -10.53, 95% CI: (-20.11,-0.96).

STAT 6360 –Statistical Software Programming PROC FREQ Uses: • Counting distinct values of variables or combinations of variables and summarizing those frequency and joint frequency distributions in contingency tables. Most useful for discrete or categorical data. • Computing summary statistics (e.g., measures of association) and inferential statistics (e.g., hypothesis tests) on categorical data. • Writing frequency distributions to output data sets for further analysis. Syntax: PROCFREQ data=dsname <options> ; TABLES table_spec1 table_spec2 / <options> <out=outdsname>; RUN; • More than one contingency table can be requested in single TABLES statement. • More than one TABLES statement can appear in a singe PROC FREQ. • table_specdenotes the contingency table to create. A K-way table is specified as var1*var2*…*varK.

STAT 6360 –Statistical Software Programming PROC FREQ • See Example #8 in Lec4Examps.sas – Freq Distributions for C.R. Car Data. • 1st TABLES statement requests two 2-way contingency tables: tables domestic*man_trans type*passengers /chisq; • The 1st table: Marginal distribution for domestic. 48 out of the 93 cars, or 51.61%, were domestic. Key to the 4 entries in each cell. 1st entry is the raw freq, 2nd entry is the overall percentage, etc. 22 is 45.83% of 48. 45.83% of domestic car models offer man transmissions. 22 is 23.66% of 93. 23.66% of all the cars are domestic with man trans available. 22 is 36.07% of 61. 36.07% of cars that offer man transmissions are domestic. Marginal distribution for man_trans. Manual transmission available on 61 of 93 models.

STAT 6360 –Statistical Software Programming

STAT 6360 –Statistical Software Programming PROC STANDARD Uses: • Centers and/or rescales variables so that they a given mean and SD. • Standardized variables useful for many purposes (e.g., in multiple regression). • Can be used to replace missing values with a computed mean or specified value. Syntax: PROCSTANDARDdata=indataout=outdata <mean=mn> <std=sd>; BYby_vars; VARvar_list; RUN; • Creates a new dataset (outdata) that is a copy of indata with each of the variables on var_list replaced by standardized versions of themselves. • If BY statement is omitted, the standardized variables have sample mean mn and sample SD sd when computed over the entire dataset. • If BY statement is used, the standardized variables have sample mean mn and sample SD sd when computed separately within each BY group. • REPLACE option replaces missing values with a computed mean or given value. • PROCSTANDARD produces no output.

STAT 6360 –Statistical Software Programming PROC RANK Uses: • Computes ranks for specified variables and creates a new dataset that includes all variables in the input dataset plus new variables containing the ranks. • Ranks are useful for many nonparametric procedures Syntax: PROCRANKdata=indata out=outdata DESCENDINGTIES=tie_meth; BY by_vars; VAR var_list; RANKS rank_vars; RUN; • Creates a new dataset (outdata) that is a copy of indata with ranks of each variable on var_list in new variables with names given by rank_vars. • If BY statement is included, variables are ranked within each BY group; otherwise they are ranked over entire dataset. • DESCENDING option ranks from high (rank 1) to low (rank N). Default is low to high. • TIES option controls how ties are ranked. • PROC RANK produces no output.

Statistical Software Programming