Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

PhUSE 2011 Comparing dataset metadata Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

Comparing dataset metadata • AGENDA / CONTENTS • Comparing dataset data and metadata • PROC COMPARE • macro %CrossRef • Dataset and variable attributes • Example results (in dataset) • Dataset attributes • Variable attributes • Application of macro %CrossRef • Some technical information • Future features

Comparing dataset metadata • A. Comparing dataset data and metadata • PROC COMPARE • data oriented (attributes: NOVALUES option) • only 2 datasets (or variables in one) at a time • cumbersome output (summary: OUT= dataset) • may be tuned as desired, yet limited to pairs • SAS macro %CrossRef • structure oriented: dataset & variable attributes • any number of specified datasets (from 1) • tabular summarisation (in result dataset only) • columns: dataset names; rows: attributes • user specification of desired attributes

Comparing dataset metadata • B. Dataset and variable attributes • Dataset attributes • MemName, MemLabel and LibName • Creation and Modification date and time • Number of variables and physical observations • Variable attributes • Name (common name in first attribute column) • Label: as value in above Name attribute record if no label then text: "-no label-" if no corresponding variable: empty • optional variable’s Type and Length (combined) • optional variable’s Informat and Format

Comparing dataset metadata • C.Example results (in dataset) 1/2 • Dataset attributes • attributedatasetdatasetdataset • column123

Comparing dataset metadata C.Example results (in dataset) 2/2 Variable attributes attributedatasetdatasetdataset column123 6

Comparing dataset metadata D. Application of macro %CrossRef not with entirely different datasetsbut with a (limited) number of rather similar datasets to view differences master datasets and subsets of them different versions of datasets same datasets with different names similar datasets with different data Goal: to see whether more datasets could be combined into one dataset (or ignored if the data are identical) 7

Comparing dataset metadata • E. Some technical information • all fields are type character of length $256, first, attribute field has $36 • internally SAS name literal variable names are applied • OPTIONS VALIDVARNAME=ANY is set, and reset to the original state at the end of the macro • variable names starting with an asterisk (*) or ending with an exclamation mark (!) and one digit. Avoid such names in your datasets and limit your variable name length to maximally 30 • WORK dataset names start with __

Comparing dataset metadata • F. Future features 1/2 • comparing all datasets in one or more libraries using a wildcard (LibName.*) • optional aggregated data for both numerical and character variables • (non-deleted) logical number of observations • number of non-missing values • number of missing values • frequency distribution of a limited number of distinct (formatted) values (categories) • minimum and maximum (formatted) value(first and last non-missing character value)

Comparing dataset metadata F. Future features 2/2 optional aggregated, univariate data for (mainly) numerical variables mean value median value (also approximate middle, non-missing, sorted, character value) (formatted) mode value (also most occurring non-missing character value) standard deviation various percentiles and more, e.g. distribution information and the statistics that PROC COMPARE can generate 10

Comparing dataset metadata • QUESTIONS • & • ANSWERS • SASquestions@ocs-consulting.com • Jim.Groeneveld@OCS-Consulting.com • http://jim.groeneveld.eu.tf

Q&A: Comparing dataset metadata SAS name literal A name expressed as a string within quotes, followed by the letter N. Applicable to variable names, statement labels and imported variable and table names from DBMS tables (e.g. Excel). Advantage: more compatibility. Example: 'This @#$name'n = 'a SAS name literal'; More information in:SAS Language Reference: Concepts. 12

Q&A: Comparing dataset metadata • Straightforward inventory of metadata • save results of PROC CONTENTS (or of the CONTENTS statement of PROC DATASETS for one or more libraries) to datasets, • if desired keep the most important variables LibName, MemName, Name, Label, Type, Length, Format, FormatL, FormatD, Informat, InformL and InformD; • concatenate all metadata datasets (SET); • if desired sort by variable NAME. This generates all dataset and variable information in subsequent records.

Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

Presentation Transcript

The National Geo-data Repository of The Netherlands - DINO

Avian Influenza in the Netherlands

LINH

Towards an integrated supervisor: experiences from The Netherlands

Systems Biology: Today and Tomorrow; the WTEC visit to The Netherlands

(The Future of) INIS Liaison Activities in the Netherlands

Securing Your Swiss Cheese Environment

MPI Netherlands Chapter Board retreat 2009 - 2010

Advising on the Construction of Psychological and Educational Tests

Welcome To The Denison Consulting Certification Workshop

OPIM 5984 ANALYTICAL CONSULTING in FINANCIAL SERVICES

South African Association for Consulting Engineers Management Information Survey July – December 2005

Systems Biology: Today and Tomorrow; the WTEC visit to The Netherlands

Business Opportunities Seminar Consulting Services United States (August 2014)

Schlumberger Business Consulting

IMETER INTRODUCTION

Philip and Elizabeth

THE NETHERLANDS AFTER 1945: TWO SILENT REVOLUTIONS AND ONE VIOLENT ONE?

An ICT Journey: From Coventry to Netherlands (ICT in Education 世界的潮流與趨勢 )

Five Best Apps For Financial Advisors In 2015