Download
chapter 22 reading hierarchical files n.
Skip this Video
Loading SlideShow in 5 Seconds..
Chapter 22 Reading Hierarchical Files PowerPoint Presentation
Download Presentation
Chapter 22 Reading Hierarchical Files

Chapter 22 Reading Hierarchical Files

108 Views Download Presentation
Download Presentation

Chapter 22 Reading Hierarchical Files

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Chapter 22 Reading Hierarchical Files Reading Hierarchical Raw Data Files

  2. Objectives • Read data with mixed record types • Read a hierarchical file and create one observation per detail record. • Read a hierarchical file and create one observation per header record.

  3. Mixed Record Types Not all records have the same format. 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Multiple INPUT statements are needed using conditional statement to control. ...

  4. Desired Output Sales Sale ID Location Date Amount 101 USA 14264 3295.50 3034 EUR 14274 1876.30 101 USA 14274 2938.00 128 USA 14280 2908.74 1345 EUR 14281 3145.60 109 USA 14320 2789.10

  5. The INPUT Statement • Multiple INPUT statements are needed for different formats of the same variable: input SalesID $ Location $; if Location='USA' then input SaleDate : mmddyy10. Amount; else if location='EUR' then input SaleDate : date9. Amount : commax8.;

  6. The INPUT Statement NOTE: 6 records were read from the infile 'sales.dat'. The minimum record length was 24. The maximum record length was 26. NOTE: The data set WORK.SALES has 3 observations and 4 variables. NOTE: This is NOT correct. We suppose have 6 cases (6 observations), not 3. ...

  7. Undesirable Output Sales Sale ID Location Date Amount 101 USA . . 101 USA . . 1345 EUR . . NOTE: This is NOT correct. We have 6 cases (6 observations), not 3. Besides, all the Sale date and Amount are missing! ...

  8. The raw data: 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 The program: input SalesID $ Location $; if Location='USA' then input SaleDate : mmddyy10. Amount; else if location='EUR' then input SaleDate : date9. Amount : commax8.; The output: Sales Sale ID Location Date Amount 101 USA . . 101 USA . . 1345 EUR . . NOTE: Each INPUT statement reads a new case (observation), based on the IF condition.

  9. Use The Single Trailing @ to control reading the same case requiring more than one INPUT statement • The single trailing @ option holds a raw data record in the input buffer until SAS • executes an INPUT statement with no trailing @, or • reaches the bottom of the DATA step. • General form of an INPUT statement with the single trailing @: INPUTvar1 var2 var3 … @;

  10. Processing the Trailing @ Hold record for nextINPUT statement. Load next record. input SalesID $ Location $ @; if location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.;

  11. Compile Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer AMOUNT SALESID LOCATION SALEDATE data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; PDV ...

  12. Execute Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; PDV AMOUNT SALESID LOCATION SALEDATE . . ...

  13. Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; 1 0 1 U S A 1 - 2 0 - 1 9 9 9 3 2 9 5 . 5 0 PDV AMOUNT SALESID LOCATION SALEDATE . . ...

  14. Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; True Hold record. 1 0 1 U S A 1 - 2 0 - 1 9 9 9 3 2 9 5 . 5 0 PDV AMOUNT SALESID LOCATION SALEDATE . . 101 USA ...

  15. Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; 1 0 1 U S A 1 - 2 0 - 1 9 9 9 3 2 9 5 . 5 0 PDV AMOUNT SALESID LOCATION SALEDATE . . 101 USA 14264 3295.50 ...

  16. Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer Write out observation to sales. data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; Implicit output 1 0 1 U S A 1 - 2 0 - 1 9 9 9 3 2 9 5 . 5 0 PDV AMOUNT SALESID LOCATION SALEDATE . . 101 USA 14264 3295.50 ...

  17. Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; Implicit return 1 0 1 U S A 1 - 2 0 - 1 9 9 9 3 2 9 5 . 5 0 PDV AMOUNT SALESID LOCATION SALEDATE . . 101 USA 14264 3295.50 ...

  18. Raw Data File 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10 Input Buffer data sales; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; run; Continue processinguntil end of the raw data file. 1 0 1 U S A 1 - 2 0 - 1 9 9 9 3 2 9 5 . 5 0 PDV AMOUNT SALESID LOCATION SALEDATE . . ...

  19. Mixed Record Types Partial Log NOTE: 6 records were read from the infile 'sales.dat'. The minimum record length was 24. The maximum record length was 26. NOTE: The data set WORK.SALES has 6 observations and 4 variables.

  20. Mixed Record Types proc print data=sales noobs; run; PROC PRINT Output Sales Sale ID Location Date Amount 101 USA 14264 3295.50 3034 EUR 14274 1876.30 101 USA 14274 2938.00 128 USA 14280 2908.74 1345 EUR 14281 3145.60 109 USA 14320 2789.10

  21. Subsetting from a Raw Data File This scenario uses the raw data file from the previous example. 101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10

  22. Desired Output The sales manager wants to see sales for the European branch only. Sales Sale ID Location Date Amount 3034 EUR 14274 1876.30 1345 EUR 14281 3145.60

  23. The Subsetting IF Statement data europe; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='USA' then input SaleDate : mmddyy10. Amount ; else if Location='EUR' then input SaleDate : date9. Amount : commax8.; if Location='EUR'; run; This is okay, but not efficient. It reads the entire data first, then select EUR location.

  24. The Subsetting IF Statement • The subsetting IF should appear as early in the program as possible but after the variables used in the condition are calculated. • In this case, we should read only the EUR cases by adding the IF statement right after reading Location.

  25. The Subsetting IF Statement data europe; length SalesID $ 4 Location $ 3; infile 'raw-data-file'; input SalesID $ Location $ @; if Location='EUR'; input SaleDate : date9. Amount : commax8.; run; Because the program reads only European sales, the INPUT statement for USA sales is not needed.

  26. The Subsetting IF Statement proc print data=europenoobs; run; Sales Sale ID Location Date Amount 3034 EUR 14274 1876.30 1345 EUR 14281 3145.60

  27. Processing Hierarchical Files • Many files are hierarchical in structure, consisting of • a header record • one or more related detail records. • Typically, each record contains a field that identifies whether it is a header record or a detail record. Header Detail Detail Header Header Detail Header Detail Detail

  28. Processing Hierarchical Files You can read a hierarchical file into a SAS data set by creating one observation per detail record and storing the header information as part of each observation. SAS Data Set Detail Variables Detail 1 Detail 2 Detail 3 Detail 1 Detail 1 Detail 2 Hierarchical File Header Variables Header 1 Header 1 Header 1 Header 2 Header 3 Header 3 Header 1 Detail 1 Detail 2 Detail 3 Header 2 Detail 1 Header 3 Detail 1 Detail 2

  29. Processing Hierarchical Files You can also create one observation per header record and store the information from detail records in summary variables. Hierarchical File SAS Data Set Header 1 Detail 1 Detail 2 Detail 3 Header 2 Detail 1 Header 3 Detail 1 Detail 2 Header Variables Header 1 Header 2 Header 3 Summary Variables Summary 1Summary 2Summary 3

  30. Creating One Observation Per Detail E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S The raw data file dependentshas a header record containing the name of the employee and a detail recordfor each dependenton the employee’s health insurance. E: Employee, D: Dependent C: Child, S: Spouse Each data value is separated by :

  31. Desired Output Personnel wants a list of all the dependents and the name of the associated employee. EmpLName EmpFName DepName Relation Adams Susan Michael C Adams Susan Lindsay C Porter David Susan S Lewis Dorian D. Richard C Nicholls James Roberta C Slaydon Marla John S

  32. A Hierarchical File E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S • Not all the records are the same. • The fields are separated by colons. • There is a field indicating whether the record is a header or a detail record.

  33. How to Read the Hierarchical Data input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else input DepName $ Relation $;

  34. How to Output Only the Dependents Try the following program. Observe what is wrong with the result. input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end;

  35. Compile Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm= ':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S ...

  36. Execute Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S ...

  37. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S E : A d a m s:S u s a n ...

  38. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S Hold record. E : A d a m s :S u s a n E ...

  39. True Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S E : A d a m s :S u s a n E ...

  40. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S E : A d a m s :S u s a n E Adams Susan ...

  41. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S No implicit output E : A d a m s :S u s a n E Adams Susan ...

  42. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S Implicit return E : A d a m s :S u s a n E Adams Susan ...

  43. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S E : A d a m s :S u s a n Reinitialize PDV. ...

  44. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S E : A d a m s :S u s a n ...

  45. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S D : M i c h ea l :C ...

  46. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S Hold record. D : M i c h ae l :C D ...

  47. False Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S D : M i c h ae l :C D ...

  48. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S D : M i c h ae l :C D Michael C ...

  49. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION Write out observation to dependents. data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S Explicit output D : M i c h ae l :C R D Michael C ...

  50. Input Buffer D TYPE EMPLNAME EMPFNAME DEPNAME RELATION data dependents(drop=Type); length Type $ 1 EmpLNameEmpFName DepName $ 20 Relation $ 1; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run; E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S Implicit return D : M i c h ae l :C R D Michael C ...