Download
chapter 18 reading free format data n.
Skip this Video
Loading SlideShow in 5 Seconds..
Chapter 18 Reading Free-Format Data PowerPoint Presentation
Download Presentation
Chapter 18 Reading Free-Format Data

Chapter 18 Reading Free-Format Data

222 Views Download Presentation
Download Presentation

Chapter 18 Reading Free-Format Data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Chapter 18Reading Free-Format Data

  2. Objectives • Read free-format data not recognized in fixed fields. • Read free-format data separated by non-blank delimiters, such as commas. • Read a raw data file with missing data (at the end middle or beginning of a record). • Read character values exceeding 8 characters. • Read nonstandard free-format data. • Read character values containing embedded blanks.

  3. What is FREE-FORMAT data • The data values not arranged in fixed fields. • Data values separated by blanks or some specific delimiters. • Numeric data values that are not in standard format. Issues that need special attention when reading free-format data: • How to handle missing data in free-format data set? • The danger of incorrect variable length. • How to handle data values with quotation marks? • Informats used in Formatted Input are not the same when reading free-format data values.

  4. List Input with the Default Delimiter(Blank is the Default Delimiter) • The data is not in fixed columns. • The fields are separated by spaces. • There is one nonstandard field. 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520

  5. LIST INPUT and its variations To read a free-format data, the simplest INPUT is by using LIST INPUT. The general Syntax: INPUT variable <$> ; • Variable is the variable name to be read. • $ specifies character variable. NOTE: • The list input style signals to the SAS System that fields are separated by delimiters. • SAS then reads from non-delimiter to delimiter instead of from a specific location on the raw data record.

  6. IMPORTANT CONDITIONS for LIST Input: • All fields must be separated by at least one blank. • Fields must be read sequentially from left to right • Can not skip or re-read fields. • Missing data for character variable must be specified using user-defined missing (can not use blank as missing, since Blank is the delimiter. • Missing data for numeric must be specified using ‘. ‘ Or other user-defined missing (can not use blank for numeric missing).

  7. blanks commas tab characters Delimiters Common delimiters are A space (blank) is the default delimiter.

  8. 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Input Data involving Date, Time • The second field is a date. How does SAS store dates?

  9. Standard Data • The term standard data refers to character and numeric data that SAS recognizes automatically. • Some examples of standard numeric data include • 35469.93 • 3E5 (exponential notation) • -46859. • Standard character data is any character you can type on your keyboard. Standard character values are always left-justified by SAS.

  10. Nonstandard Data • The term nonstandard data refers to character and numeric data that SAS does not recognize automatically. • Examples of nonstandard numeric data include • 12/12/2012 • 29FEB2000 • 4,242 • $89,000.

  11. Informats • To read in nonstandard data, you must apply an informat. • General form of an informat: • Informats are instructions that specify how SAS reads raw data. <$>INFORMAT-NAME<w>.<d>

  12. Informats Examples of informats are • COMMAw. reads numeric data ($4,242) and strips out selected nonnumeric characters, such as dollar signs and commas, dashes, blanks. • MMDDYYw. reads dates in the form 12/31/2012. • DATEw. reads dates in the form 29Feb2000.

  13. Reading Free-Format data with Delimiters • By default, free-format data values are separated by BLANKS. SAS reads a data value until it reaches the next blank. • Blank is not the only delimiter to separate data values. SAS allows user-specified delimiters, as long as it is not part of the data values. For example, one can use / , % ; and so on as delimiter to create the external free-format data set. • The option DLM = ‘ ‘ in the INFILE statement is needed to inform the SAS INPUT statement the delimiters used. Ex: INFILE ‘path-to-the-file’ DLM = ‘,’ ; informs the INPUT statement to read data value until comma ( , ) is reached.

  14. Example The following is an airplane data set consisting of ID, date_inservice, # of passenger capacity and # of cargo capacity LA50001,4feb1989,132, 530 PHIL50002, 11nov1989, 152 ,540 NEWYORK50003 ,22oct1991, 90, 530 CHICAGO50004, 4feb1993 ,172 ,550 DETROIT50005 ,24jun1993, 170 ,510 DALLAS50006, 20dec1994, 180, 520 The data values are separated by comma and space. How does SAS read this data set?

  15. Reading a Delimited Raw Data File data airplanes; infile'raw-data-file‘ DLM = ‘, ’; input ID $ InServicedate9. PassCapCargoCap; run;

  16. Exercise • Write a SAS program to read the following data.Variables are: Location, date # of passengers # of cargos for the flight LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152 ,540 NEWYORK50003,22oct1991 , 90, 530 CHICAGO50004,4feb1993 , 172 ,550 DETROIT50005,24jun1993, 170 ,510 DALLAS50006,20dec1994 , 180, 520 • Print the data. Save the program as c18_freeform1 to the SASEx folder in your c-drive. • Observe the results. You should notice that some data values for Location are not complete. • What is the cause of incomplete data values? • How to solve this problem?

  17. Answer data airplane; infiledatalinesdlm=', ' ; input Loc $ date date9. npasncargo; datalines; LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152 ,540 NEWYORK50003,22oct1991 , 90, 530 CHICAGO50004,4feb1993 , 172 ,550 DETROIT50005,24jun1993, 170 ,510 DALLAS50006,20dec1994 , 180, 520 ; run; proc print; format date date9. ; run;

  18. Results Obs Loc date npasncargo 1 LA50001 04FEB1989 132 530 2 PHIL5000 11NOV1989 152 540 3 NEWYORK5 22OCT1991 90 530 4 CHICAGO5 04FEB1993 172 550 5 DETROIT5 24JUN1993 170 510 6 DALLAS50 20DEC1994 180 520 What is wrong with this result? NOTE: The some of the LOC’s are not complete. NOTE: It is 8 characters. But, some of the ID’s are more than 8.

  19. Lengths of Variables read using free-format • When you use list input, the default length for character and numeric variables is 8 bytes. • You can set the length of character variables with a LENGTH statement or with an informat. • General form of a LENGTH statement: LENGTHvariable-name <$> length-specification ...;

  20. Setting the Length of a Variable data airplanes; length ID $ 15.; infile'raw-data-file‘ DLM = ‘ , ‘; input LOC $ InServicedate9. PassCapCargoCap; run;

  21. Exercise Open the program c18_freeform1, revise the program to make the data values for Location are complete.

  22. Answer data airplane; Length Loc $ 15.; infiledatalinesdlm=', ' ; input Loc $ date date9. npasncargo; datalines; LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152 ,540 NEWYORK50003,22oct1991 , 90, 530 CHICAGO50004,4feb1993 , 172 ,550 DETROIT50005,24jun1993, 170 ,510 DALLAS50006,20dec1994 , 180, 520 ; run; proc print; format date date9. ; run;

  23. Correct Results Obs LOC date npasncargo 1 LA50001 04FEB1989 132 530 2 PHIL50002 11NOV1989 152 540 3 NEWYORK50003 22OCT1991 90 530 4 CHICAGO50004 04FEB1993 172 550 5 DETROIT50005 24JUN1993 170 510 6 DALLAS50006 20DEC1994 180 520

  24. Compile Input Buffer ID $ 5 data airplanes; length ID $ 5.’; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 PDV ...

  25. Compile Input Buffer CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 PDV ...

  26. Execute data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Input Buffer PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . ...

  27. data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Input Buffer 5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . ...

  28. data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Input Buffer 5 0 0 0 1 4f eb 19 8 9 1 3 2 5 3 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 530 10627 132 50001 ...

  29. Write out observation to airplanes. data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Implicit output Input Buffer 5 0 0 0 1 4f eb 19 8 9 1 3 2 5 3 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 530 10627 132 50001 ...

  30. data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InServicedate9. PassCapCargoCap; run; Raw Data File 50001 4feb1989 132 530 50002 11nov1989 152 540 50003 22oct1991 90 530 50004 4feb1993 172 550 50005 24jun1993 170 510 50006 20dec1994 180 520 Implicit return Input Buffer 5 0 0 0 1 4f eb 19 8 9 1 3 2 5 3 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 530 10627 132 50001 ...

  31. Using the DLM= Option in the INFILE statement • The DLM= option sets a character or characters that SAS recognizes as a delimiter in the raw data file. • General form of the INFILE statement with the DLM= option: • Any character you can type on your keyboard can be a delimiter. You can also use hexadecimal characters. INFILE 'raw-data-file' DLM='delimiter(s)';

  32. Reading Missing Values There are two situations may occur when reading a free-format data involving missing data: • Missing values at the END of a record • Missing values at the BEGINNING or MIDDLE of a record

  33. Missing Data at the End of a Record 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520

  34. Missing Data at the End of a Row • By default, when there is missing data at the end of a row, SAS will continue to read the missing data value from the next record: 1. SAS loads the next record to finish the observation. 2. A note is written to the log 3. SAS loads a new record at the top of the DATA stepand continues processing.

  35. Execute Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . ...

  36. Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 5 0 0 0 1 , 4 f e b 1 9 89, 1 3 2 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . ...

  37. Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 No data 5 0 0 0 1 , 4 f e b 1 9 89, 1 3 2 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 10627 132 50001 ...

  38. Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 5 0 002 , 1 1 n ov19 8 9,1 5 2 , 5 40 SAS loadsnext record. PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 50001 10627 132 50002 ...

  39. Input Buffer Write out observation to airplanes. data airplanes; length ID $ 5; infile'raw-data-file' dlm=','; input ID $ InService : date9. PassCapCargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Implicit output 5 0 002 , 1 1 n o v1 9 8 9,1 5 2 , 5 4 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 50002 10627 132 50001

  40. Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Implicit return 5 0 002 , 1 1 n o v1 9 8 9,1 5 2 , 5 4 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . 50002 10627 132 50001 ...

  41. Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 5 0 0 0 3 , 2 2 o c t1 9 9 1 ,9 0 , 5 3 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . ...

  42. Input Buffer data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap; run; Raw Data File 50001 , 4feb1989,132 50002, 11nov1989,152, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520 Continue processinguntil end of the raw data file. 5 0 0 0 3 , 2 2 o c t1 9 9 1 ,9 0 , 5 3 0 PDV CARGOCAP ID INSERVICE PASSCAP N $ N N 8 5 8 8 . . . ...

  43. Partial Log NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.AIRPLANES3 has 4 observations and 4 variables.

  44. Missing Data at the End of the Row proc print data=airplanes3 noobs; run; PROC PRINT Output In Pass Cargo ID Service Cap Cap 50001 10627 132 50002 50003 11617 90 530 50004 12088 172 50005 50006 12772 180 520

  45. Use the MISSOVEROption in INFILE statement to handle missing at the end of a record • The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached. • General form of the INFILE statement with the MISSOVER option: • If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing. INFILE 'raw-data-file'MISSOVER;

  46. Using the MISSOVER Option data airplanes; length ID $ 5; infile 'raw-data-file' dlm=','missover; input ID $ InService : date9. PassCapCargoCap; run;

  47. Using the MISSOVER Option Partial SAS Log NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26. NOTE: The data set WORK.AIRPLANES3 has 6 observations and 4 variables.

  48. Using the MISSOVER Option proc print data=airplanes noobs; run; PROC PRINT Output In Pass Cargo ID Service Cap Cap 50001 10627 132 . 50002 10907 152 540 50003 11617 90 530 50004 12088 172 . 50005 12228 170 510 50006 12772 180 520

  49. Missing Values at the beginning or Middle of a record There are situations where missing values occur in the beginning of a record or middle of a record. Since multiple delimiters , such as ,, is treated as a delimiter, simply using DLM = ‘,’ will not able to take care of these situations here.

  50. Missing Values without Placeholders • There is missing data represented by two consecutive delimiters. 50001 , 4feb1989,, 530 50002, 11nov1989,132, 540 50003, 22oct1991,90, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180, 520