300 likes | 445 Vues
Lesson 2. Topic - Reading in data Programs 1 and 2 in course notes Chapter 2 (Little SAS Book). Raw Data. Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset). Data Step. Analyze Data Using Statistical Procedures. PROCs. Raw Data Sources.
E N D
Lesson 2 • Topic - Reading in data • Programs 1 and 2 in course notes • Chapter 2 (Little SAS Book)
Raw Data Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset) Data Step Analyze Data Using Statistical Procedures PROCs
Raw Data Sources • You type it in the SAS program • Text file • Spreadsheet (Excel) • Database (Access, Oracle) • SAS dataset
Data in Text Files Text files are simple character files that you can create or view in a text editor like Notepad. They can also be created as “dumps” from spreadsheet files like excel. • Delimited data – variables are separated by a special character (e.g. a comma) • Fixed position – data is organized into columns
Data delimited with spaces: C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A . . 86 155 C 81 145 86 140 • Note: Missing data is identified with a period.
Data delimited with commas C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,.,.,86,155 C,81,145,86,140 • Note: Missing data is identified with a period.
Data delimited by commas(.csv file) C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 • Note: Missing data is identified by multiple commas.
Column Data C084138093143 D089150091140 A078116100162 A 086155 C081145086140 • Note: Missing data values are blank.
INFILE and INPUT Statements When you write a SAS program to read in raw data, you’ll use two key statements: • The INFILE statement tells SAS where to find the data and how it is organized. • The INPUT statement tells SAS which variables to read-in
Program 1 * List Directed Input: Reading data values separated by spaces; DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A . . 86 155 C 81 145 86 140 ; RUN ; TITLE'Data Separated by Spaces'; PROCPRINTDATA=bp; RUN; Obs clinic dbp6 sbp6 dbpbl sbpbl 1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140
PARTIAL SASLOG 1 DATA bp; 2 INFILE DATALINES; 3 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; 4 DATALINES; NOTE: The data set WORK.BP has 5 observations and 5 variables. NOTE: DATA statement used: real time 0.39 seconds cpu time 0.03 seconds
* List Directed Input: Reading data values separated by commas; DATA bp; INFILE DATALINES DLM = ',' ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,.,.,86,155 C,81,145,86,140 ; RUN ; TITLE'Data separated by a comma'; PROCPRINTDATA=bp; RUN;
* List Directed Input: Reading .csv files DATA bp; INFILE DATALINES DLM = ','DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 ; TITLE 'Reading in Data using the DSD Option'; PROCPRINTDATA=bp; RUN; Consecutive commas indicate missing data
* List Directed Input: Reading data values separated by tabs (.txt files); DATA bp; INFILE DATALINES DLM = '09'xDSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A 86 155 C 81 145 86 140 ; TITLE'Reading in Data separated by a tab'; PROCPRINTDATA=bp; RUN;
* Column Input: Data in fixed columns. DATA bp; INFILE DATALINES ; INPUT clinic $ 1-1 dbp6 2-4 sbp6 5-7 dbpbl 8-10 sbpbl 11-13 ; DATALINES; C084138093143 D089150091140 A078116100162 A 086155 C081145086140 ; Title'Reading in Data using Column Input'; PROCPRINTDATA=bp; Note: missing data is blank
* Reading data using Pointers and Informats DATA bp; INFILE DATALINES ; INPUT @1 clinic $1. @2 dbp6 3. @5 sbp6 3. @8 dbpbl 3. @11 sbpbl 3. ; DATALINES; C084138093143 D089150091140 A078116100162 A 086155 C081145086140 ; Title'Reading in Data using Point/Informats'; PROCPRINTDATA=bp; Informats must end with a period.
Program 2 * Reading data from an external file DATA bp; INFILE‘C:\SAS_Files\bp.csv'DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE'Reading in Data from an External File'; PROCPRINTDATA=bp; clinic,dbp6,sbp6,dbpbl,sbpbl C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 Content of bp.csv
PARTIAL SAS LOG 7 DATA bp; 8 INFILE 'C:\SAS_Files\bp.csv' DSD FIRSTOBS=2 ; 9 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; NOTE: The infile 'C:\SAS_Files\bp.csv' is: File Name=C:\SAS_Files\bp.csv, RECFM=V,LRECL=256 NOTE: 5 records were read from the infile 'C:\SAS_Files\bp.csv'. The minimum record length was 10. The maximum record length was 16. NOTE: The data set WORK.BP has 5 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.10 seconds cpu time 0.01 seconds
*Reading data from an external file using a FILENAME statement; FILENAME bpdata ‘C:\SAS_Files\bp.csv'; DATA bp; INFILE bpdata DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE'Reading in Data Using FILENAME'; PROCPRINTDATA=bp;
* Using PROC IMPORT to read in data ; * Can skip data step; * Can also try IMPORT Wizard ; PROCIMPORT DATAFILE=‘C:\SAS_Files\bp.csv' OUT = bp DBMS = csv REPLACE ; GETNAMES = yes; GUESSINGROWS = 9999; TITLE'Reading in Data Using PROC IMPORT'; PROCPRINTDATA=bp; PROCCONTENTS DATA=bp; Uses first row for variable names
* PC SAS can read excel files directly; PROCIMPORT DATAFILE=‘C:\SAS_Files\bp.xls' OUT = bp DBMS = excel REPLACE ; GETNAMES = yes; TITLE'Reading in Data from excel'; PROCPRINTDATA=bp; PROCCONTENTS; Uses first row for variable names
The CONTENTS Procedure Data Set Name WORK.BP Observations 5 Member Type DATA Variables 5 Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 1 Clinic Char 1 $1. $1. 2 DBP6 Num 8 BEST12. BEST32. 4 DBPBL Num 8 BEST12. BEST32. 3 SBP6 Num 8 BEST12. BEST32. 5 SBPBL Num 8 BEST12. BEST32.
SOME INFILE OPTIONS • OBS - limits number of observations read • FIRSTOBS - start reading from this obs. • MISSOVER and PAD - used to read in data with short records • TERMSTR= used to read files from different OS. • LRECL= needed when you have data with long records (> 256 characters)
Problem when reading past default logical record length; DATA temp; INFILE‘\...\tomhs.data' OBS=6 ; INPUT @260 jntpain 2. ; TITLE'Data not read in correctly because variable is past LRECL '; PROCPRINT; NOTE: Invalid data for jntpain in line 2 NOTE: SAS went to a new line when INPUT statement reached past the end of a line Obs jntpain 1 . 2 . 3 .
*Add LRECL option to fix problem ; DATA temp; INFILE‘\…\tomhs.data' OBS=6 LRECL=500; INPUT @260 jntpain 2. ; TITLE'Data read in correctly using LRECL option'; PROCPRINT; Obs jntpain 1 1 2 1 3 1 4 1 5 1 6 2
Reading Special Data • 04/11/1982 Date • 59,365 Comma in number • 086-59-9054 Long (>8) characters Informat • 04/11/1982 mmddyy10. • 59,365 comma6. • 086-59-9054 $11.
* Reading special data with fixed position data; DATA info; INFILE DATALINES; INPUT @1 ssn $11. @13 taxdate mmddyy10. @25 income comma6. ; DATALINES; 086-59-9054 04/12/2001 59,365 405-65-0987 03/15/2002 26,925 212-44-9054 04/15/2003 44,999 ; TITLE'Variables with Special Formats'; PROCPRINTDATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999
* Reading special data with list input using colon modifier; DATA info; INFILE DATALINES DLM=‘;’; INPUT ssn : $11. taxdate : mmddyy10. income : comma6. ; DATALINES; 086-59-9054;04/12/2001;59,365 405-65-0987;03/15/2002;26,925 212-44-9054;04/15/2003;44,999 ; TITLE'Variables with Special Formats'; PROCPRINTDATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999
Summary of Ways of Reading in Data You may not have a choice - data may come to you in a certain format • List input - data is separated by a delimiter; must read in all variables. • Column input - data is in fixed columns;must know where each variable starts and ends; can read in selected variables • Pointers and Informats - alternative to column input; most flexible; must be used for special data • PROC IMPORT
Exercise 2 • See exercise 2 in course notes