End-to-End Management of the Statistical Process An Initiative by ABS

End-to-End Management of the Statistical ProcessAn Initiative by ABS Bryan Fitzpatrick Rapanea Consulting Limited and Australian Bureau of Statistics Work Session on Statistical Metadata (METIS) March 2010, Geneva

The Objectives • Business transformation aimed at • reducing cost • improving effectiveness and ability to respond • A holistic approach to managing and improving the entire statistical life-cycle • International collaboration • ABS does not want to go it alone • aim is for a shared approach • sharing of ideas, interfaces, tools • but with acceptance of national differences • Build on recent progress in international statistical community • standards (SDMX, DDI), GSBPM • aim is to make them work in practice • A new program – IMTP • Information Management Transformation Program

End-to-End Management of the Statistical Process • Metadata is always the key to better approaches and process improvements • it has been in all previous ABS improvement programs • ABS has a long history in trying to manage metadata (with modest successes) • Metadata means all the information we use in and around the processes and the data • to improve things we need to understand it, rationalise it, share it, and use it to automate and drive processes and make the outputs more integrated and usable • Previous improvement programs have generally been much more limited • Focused on few areas in a few projects • Narrow metadata focus

SDMX and DDI • They are useful standards • they are not the focus of ABS interest in the exercise • the focus is optimising the statistical processes and improving the results from the processes • but we need to describe and manage all aspects the statistical process and that is their target domain • they are international standards • sponsored and used by the community ABS is part of for purposes that are relevant to IMTP • to discuss the issues internally and with other organisations we need models • SDMX and DDI are in use, relevant, and fit for purpose • IMTP aims to apply these standards (along with some others – ISO 11179, ISO 19115) and make them work • build on recent work in the international statistical community

IMTP and Metadata Management • Metadata Management will be a major part of IMTP • storing it, rationalising it, making it available for sharing and easy use, presenting it in different ways • and integrating with existing stores such as Input Data Warehouse, Data Element Repository, ABS Information Warehouse • we talk of a “Metadata Bus” and “Metadata Services” • some technical jargon • it means the metadata is easily available to all systems running in the ABS environment • we are still figuring out precisely what we mean and how it should look • we need to get “use cases” – examples of what business areas and their systems need to do with the metadata • but the services will deliver various sorts of metadata in XML formats • conforming to schemas from DDI and SDMX

IMTP and Metadata Management • IMTP focus will be on metadata that is “actionable” • it means we want it in a form that both people and systems can use • that can be easily stored and passed around • that can be used easily to generate whatever format is required in any particular case • including web pages, PDFs, manuals, other human-readable forms • SDMX and DDI both represent the metadata in XML • Major focus on metadata management • version and maintained as in SDMX and DDI • “confrontation” across collections and processes • aim is consistent, standard, metadata across the organisation • and consistent with international use wherever sensible

What sorts of metadata? • Current ABS metadata management has many shortcomings • much metadata in corporate stores • in too many stores, and often documentary rather than actionable • often not used to drive systems even where it is available and actionable • the systems predated the stores • but much metadata is still embedded in individual systems • there are cases of good managed shared approaches • but often narrowly focused • eg around dissemination • End-to-end management of the process requires a comprehensive, consistent approach • questions, question controls, interviewer instructions • coding, editing and derivation metadata • data relationship metadata • table structures • classification evolution and history • alternative hierarchies in geography and other classifications • …

SDMX and DDI • SDMX comes from the international agencies (OECD, IMF, Eurostat, UNSD, World Bank, ECB, BIS) • they get aggregate statistical tables from many countries regularly over time • they want to automate and manage the process • they need standard agreed definitions and classifications, standard agreed table structures, standard agreed formats for both data and metadata • They commissioned SDMX in 2002 • started a project, gathered use cases, employed consultants • produced a standard and presented it to large numbers of international statistical forums • started to use it and to pressure NSOs to use it • SDMX is pretty good • excellent for managing dissemination of statistical data • very good tools for very impressive web sites based on data organised in the SDMX model • also some good frameworks for managing evolution of classifications • a framework for discussing agreements on concepts and classifications • Metadata Common Vocabulary, Cross-Domain Concepts, Domain-specific Concepts

SDMX and DDI • DDI (Data Documentation Initiative) comes from the data archive organisations across many countries • trying to capture and store survey data for future use • and to document it so future users can understand it and make sense of it • mostly social science collections from researchers • funding organisations are requiring such data to be preserved for further use • mostly they had to grab data and try to salvage metadata after the event • but DDI now aims to capture all metadata “at source” • early versions were narrowly focused on an individual data set • grew out of their documentation processes • latest version (DDI V3) is much more extensive, better organised • common analysis/designer support with SDMX • an end-to-end model compatible with the Generic Statistical Business Process Model (GSBPM)

DDI Metadata • DDI has • Survey-level metadata • Citation, Abstract, Purpose, Coverage, Analysis Unit, Embargo, … • Data Collection Metadata • Methodology, Sampling, Collection strategy • Questions, Control constructs, and Interviewer Instructions organised into schemes • Processing metadata • Coding, Editing, Derivation, Weighting • Conceptual metadata • Concepts organised into schemes • Including 11179 links • Universes organised into schemes • Geography structures and locations organised into schemes

DDI Metadata • DDI has (cont) • Logical metadata • Categories organised into schemes • (categories are labels and descriptions for question responses, eg, Male, Unemployed, Plumber, Australia, ..) • Codes organised into schemes and linked to Categories • Codes are representations for Categories, eg “M” for Male, “Aus” for Australia) • Variables organised into schemes • Variables are the places where we hold the codes that correspond to a response to a question • Data relationship metadata • eg, how Persons are linked to Households and Dwellings • NCube schemes • descriptions for tables

DDI Metadata • DDI has (cont) • Physical metadata • record structures and layouts • File instance metadata • specific data files linked to their record structures • Archive metadata • archival formats, locations, retention times, etc • Places for other stuff not elsewhere described • Notes, Other Material • References to “Agencies” which own artefacts but no explicit structure to describe them • Inheritance and links embedded in most schemes • but need to be ferreted out, not necessarily easily usable

SDMX Metadata • SDMX has • Organisations organised into schemes • Organisations own and manage artefacts, and provide or receive things • Concepts organised into schemes| • Codelists, including classifications • a Codelist combines DDI Categories and Codes • Data Structure Definitions (Key Families) • a DSD describes a conceptual multi-dimensional cube used in a Data Flow and referenced in Datasets

SDMX Metadata • SDMX has • Data Flows • described by a DSD, linked to registered data sets, and categorised • Categories organised into schemes • not the same as a DDI Category • provide a basis for indexing and searching data • Hierarchical Codelists • a misnomer – maps relationships amongst inter-related classifications • explicit, actionable representations of relationships • Process metadata • a Process has steps with descriptions, transition rules, computation information, inputs, outputs • all actionable, linked to other SDMX artefacts or to external sources

SDMX Metadata • SDMX has • Structure Sets • additional linking of related DSD and Flows • Reporting Taxonomies • information about assembling reports or publications • Reference Metadata, Metadata Structure Definitions, and Metadata Flows • additional, probably useful, options for attaching metadata to data • Annotations almost everywhere • good options for managed, actionable extensions

What sorts of metadata? • What are we interested in? • Concepts • probably organised into schemes • what are the use cases? • Classifications • broken up into Categories and Codes DDI-style? • with links to related classifications SDMX Hierarchical Codelist-style? • what are the use cases? • Questions and related metadata • just how should it look? • a DDI package but precisely what is useful • what are the use cases?

What sorts of metadata? • What are we interested in? • Survey-level metadata? • what are the use cases? • Structure Definitions • almost certainly, but we need use cases • Variable, Relationship, and Record Structure metadata • maybe, but we need use cases • Processing metadata • almost certainly, but we need use cases • SDMX Process and/or DDI artefacts

What are the next steps? • Basically we need use cases • How do we see our metadata being used? • What are trying to support? • What can we get from our pilot programs? • we need to do our own abstraction from that • We can then start to define a provisional set of services • with parameters and schemas • We can then think about existing sources and demonstration systems • We can then think about repositories and stores

Timeframe and Process • We are at the start of the process • a project team that is still forming • several “satellite” projects • small, sometimes significant projects attempting to apply ideas • and provide use cases for design • Have had substantial training and discussion around application of DDI and SDMX • international experts providing training • significant numbers of ABS staff involved • more to come later this month • Not a “big bang” new implementation • rather a framework and environment for all new developments • with some retro-fitting to existing systems • some direct development of key components

International Collaboration • A definite part of the project • most national agencies are feeling financial pressures and struggling to build everything themselves • Need to discuss how collaboration might proceed • some discussions have been held amongst heads of NSOs • more planned • agreed standards are important enabler • need participation of NSOs in evolution of standards • what are barriers to collaboration and how might we manage it • probably do not want too large a group of collaborators at the start • ABS (and others) will continue to report to international forums and meetings • managerial and technical • important part of fostering the collaboration • and finding out what others are doing • and getting feedback on our ideas

Questions? • BryanMFitzpatrick@Yahoo.CO.UK

End-to-End Management of the Statistical Process An Initiative by ABS