1 / 23

Preventing data loss when combining SAS datasets.

Preventing data loss when combining SAS datasets. John Cantrell Univ. North Carolina Chapel Hill, NC. SAS has powerful tools for dataset combination. Proc SQL SAS data step. SAS data step. Datastep methods use SET, MERGE, and BY statements. The workhorses:. Matched merge & concatenation.

baruch
Télécharger la présentation

Preventing data loss when combining SAS datasets.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preventing data loss when combining SAS datasets. John Cantrell Univ. North Carolina Chapel Hill, NC

  2. SAS has powerful tools for dataset combination • Proc SQL • SAS data step

  3. SAS data step • Datastep methods use SET, MERGE, and BY statements

  4. The workhorses: • Matched merge & concatenation. • Data twothree; merge two three;run; • Forgot BY statement, yields 1 to 1 merge instead of matched merge

  5. LOG: errors, warnings & notes • Unintentional 1 to 1 merge caught with “options mergenoby=error;” • Error messages in the LOG help to find mistakes • Not all problems yield error messages: the note, “repeats of BY values” > many-to-many merge

  6. LOG does not flag all mistakes • A serious problem can occur without anymessage in the LOG • For example, data loss by truncation due to unequal variable lengths

  7. Example: Concatenation • Concatenation of 2 datasets named two and three, each input dataset contains: • A numeric variable named id. • A character variable named var.

  8. Example: input datasets:

  9. Example: variable “var” • two:var and three:var arenot the same • two:var has a length of $2, all values begin with the character “2” • three:var has a length of $3, all values begin with the character “3”

  10. Example: the datastep • data twothree; set two three;run; • Note: the output dataset name indicates the order of input datasets in SET statement.

  11. Dataset twothree

  12. Example: data loss • In output: twothree:id=1 var =21 and var=31 • In inputs: two:id=1 var=21 three:id=1 var=31a • Value of var in twothree that came from three lacks the ‘a’, Why?

  13. Explanation: • Variable length in output is determined by variable length in the first (leftmost) input dataset in the SET statement, two:var $2. • 31a does not fit into a character variable of length $2, so the ‘a’ is lost through truncation. • LOG does not mention data loss.

  14. Example: another datastep • What would happen if we reverse the order of the datasets in the SET statement? • data threetwo; set three two;run;

  15. Dataset threetwo

  16. Example: Matched merge • Matched merge: data from last dataset overwrites data from first dataset. • Usually re-name variables with same name to avoid overwriting (except BY variables). • Maybe you want matched values to overwrite, so you do not rename.

  17. Example: Matched merge • data twothree; merge two three; by id;run;

  18. Example: dataset twothree

  19. Length of variable: • Determined by first, or left most, dataset in MERGE statement. • Again, the LOG is silent on data loss through truncation. • What to do?

  20. Solutions: • Complain to SAS • Until SAS corrects, use macro before merge or concatenation to check for unequal character variable lengths • If found, re-size variables

  21. Macro: verifyVariables.sas • Invoke before merge or concatenation. • %verifyVariables(two,three,set);data twothree; set two three;run;

  22. How does it work? • proc contents on two and three, outputs results to datasets • Merges datasets and uses datastep to find like named character variables • It they have different lengths, puts ERROR statement into log.

  23. Macro code and slides: • On website: http://www.unc.edu/~jcantrel/ • Macro in paper. • John Cantrell University of North Carolina Chapel Hill, NC 27599-7411 (919)843-6495 john_cantrell@unc.edu

More Related