280 likes | 381 Vues
E-Compare is a SAS tool designed for thorough data review, identification of changes, and comparison of datasets. Learn the goals, basics of Proc.Compare, and step-by-step instructions for implementing E-Compare in your data management process.
E N D
Axio Research E-Compare A Tool for Data ReviewBill Coar
Motivation • Consider the case when programming with near final data • Begin running some standard validation checks • Identify problem records and request changes • Desire to know all changes are made, and no unexpected changes occurred
Motivation • Consider the case where you receive accumulating data throughout the life of a project • In each iteration, some data has already been reviewed and queried • For subsequent reviews • Wish to know the requested changes were made • Only review data that is new Goal is to develop a tool using SAS to assist in these areas of data review
Outline • Identify the goals of the tool (E-compare) • Introduction and steps of E-compare • Look at some examples • Extension to comparing many datasets • Final remarks
Goals • Based on needs of data management group and clinical scientists • Identify new records • Identify which records were changed • Review new values versus old values • Identify records that did not change • Identify records that were deleted
Proc Compare • Compares (two) datasets (based on key variables) • Base versus compare • Identify attributes that differ • Identify variables\records in one but not the other • Allows for variable names to differ but values be compared • Can set tolerances for defining what is really “different” • Many other procedure options to assist
Basics of Proc Compare Proc compare base=basedatacompare=compddatalistvarlistobs; id key variables; var var1 var2 var3; with ovar1 ovar2 ovar3; Run; In preparing for this presentation, I found the TRANSPOSE option that might help!
Proc Compare • Pros • Displays a lot of relevant information • Fairly straightforward • Cons • Not always easy to read • Amount of text that gets displayed for differences • Non-SAS users seem to be intimidated by it
Introduction to E-compare • Idea originated from talking with data managers and clinical scientists • Different group with different needs • Many not comfortable working within SAS • Excel • Review listings • Desire for repeatability • Extend to many datasets • D-compare
Introduction to E-compare • Parameters: • Base data, compare data, key variables, variables to compare (optional), output data, debugging indicator • Assumes the same data structure, and that the key variables exist • Uniqueness identified by key variables • Output is a SAS dataset with essentially the same structure as the input datasets • One additional flag to identify the results of the compare
Steps in E-Compare • Sorting and creating working copies of input datasets • Check for uniqueness based on key variables • First. and last. on the last key variable • Check both the base and compare datasets • If there are records with duplicate key variables • Print a message in the output and log • Goto the end of the macro to stop execution %goto NOEXEC; . . %NOEXEC: %mend;
Steps in E-Compare • Merge on key variables, create 3 datasets • NEW records (zz_newrecs) • DELETED records (zz_delrecs) • Records in BOTH datasets needed to identify differences (zz_both) • Perform proc compare • ID key variables • Default compares all variables • Obtain the output dataset using OUT= and OUTNOEQUAL options
Steps in E-Compare Straight-forward merge… data zz_newrecszz_delrecszz_both; merge zz_comp(in=a keep=&keyvar) zz_base(in=b keep=&keyvar); by &keyvar; if a and ^b then output zz_newrecs; if b and ^a then output zz_delrecs; if (a and b) then output zz_both; run;
Steps in E-Compare Straight-forward proc compare proc compare base=zz_base compare=zz_comp out=zz_coutnoprintoutnoequal; id &keyvar; %if &compvar ne ALL %then %do; var &compvar; %end; run;
Steps in E-Compare • If a record changed, it is in the output data (zz_cout) from proc compare due to the OUTNOEQUAL option • Merge various datasets on key variables • Identify records that did not change • Remerge ZZ_COUT with ZZ_BOTH to obtain records that did not change • For records that did change • Remerge ZZ_COUT with ZZ_BASE to obtain old values • Remerge ZZ_COUT with ZZ_COMP to obtain new values
Steps in E-Compare • Set 5 datasets together and define flags using the in= option • 1 - No change • 2 - Change from • 3 - Change to • 4 - New record • 5 - Deleted record • Clean up work space by deleting interim data, unless • DEBUG option is specified to be TRUE
Steps in E-Compare Basic set statement… data &out; set zz_nodiff(in=a) zz_diffbase(in=b) zz_diffcomp(in=c) zz_newcomp(in=e) zz_delbase(in=f) ; by &keyvar; length zz_compflg $15; if a then zz_compflg='1 - No Change'; else if b then zz_compflg='2 - Change From'; else if c then zz_compflg='3 - Change To'; else if d then zz_compflg=‘4 - Rec Added'; else if e then zz_compflg=‘5 - Rec Deleted'; label zz_compflg='Per record comparison'; Run;
Steps in E-Compare Some cleaning up of the work space… %if &debug=F %then %do; proc datasets library=work nodetailsnolist; delete zz_: / memtype=data; quit; %end;
Steps in E-Compare • Note about DEBUG • If macro does not execute because of non-uniqueness in key variables, set DEBUG=TRUE • This does not delete the working datasets • Allows one to identify the problem records using a viewtable
E-compare • What E-compare does not do: • Does not identify the variable that changed • Does not indicate if the attributes of a variable change • Does not actually generate a report • Generation of a report can be added, but… • This component was considered in extending E-compare to all corresponding datasets in two libraries allowing for a single output • Proc report or export to Excel • This part is defined by the needs of the users
E-compare Example Output • Creation of RTF via Proc Report and ODS • Creation of Excel file via SAS Access to PC File formats or ODBC Consider repeating E-compare on all datasets in two libraries
Schematic of D-compare with Excel Output • Use proc contents output to obtain information about datasets in each • Identify mismatches (in one library but not the other) • Subset using a list of datasets to exclude • Obtain a list of datasets for looping
Schematic of D-compare with Excel Output • Check if the Excel file exists (may need to delete) • For each iteration, identify key variables from a proc format and %sysfunc • For each iteration, perform E-compare • For each iteration, update the Excel file • Select records to include • SAS\Access to PC File Formats • SAS\Access to ODBC %let kvars=%sysfunc(putc(&&MEM&I,$fmtname.));
D-compare with Excel Output • Proc export • Requires SAS\Access to PC File Formats • Specify the SHEET to have the name of the dataset being compared • Appends to the excel file if it exists proc export data=zz_fnloutfile="&OUTFILE" DBMS=excel; sheet="&&MEM&I."; run;
D-compare with Excel Output • Export using a data step and ODBC • Requires SAS\Access to ODBC • libname prior to iteration through each dataset • Data step to append within each iteration LIBNAME _lbxlsodbc NOprompt= "dsn=Excel Files; Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; dbq=&OUTFILE"; DATA _lbxls.&&MEM&I; SET zz_fnl; run;
E-compare Example Output • Creation of Excel file via SAS Access to PC File formats or ODBC
Conclusions • E-compare is just a different way of looking at Proc Compare results • Provides the ability to monitor data as changes are applied to the central database • Reports can be printed or saved to assist in documentation • Strict data structures allow for simplification across studies
Conclusions Any Question?