1 / 44

New approaches to data modelling

New approaches to data modelling. Prepared and presented by Francesco Rizzo (ISTAT, Italy) and David Barraclough (OECD). Content. Why and how SDMX Disseminating multi-dimensional statistical tables Statistical datasets and data cube model Metadata in SDMX SDMX Information Model

Télécharger la présentation

New approaches to data modelling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. New approaches to data modelling Prepared and presented by Francesco Rizzo (ISTAT, Italy) and David Barraclough (OECD)

  2. Content • Why and how SDMX • Disseminating multi-dimensional statistical tables • Statistical datasets and data cube model • Metadata in SDMX • SDMX Information Model • SDMX main artefacts • SDMX Data Modelling • Dissemination capabilities and business constraints • Data modelling principles • SDMX data modelling main steps • Modelling Unit information

  3. Why and how SDMX • Modernising the Dissemination business process: • Using statistical standards: GSBPM, SDMX • SDMX improves quality and efficiencies in the exchange and dissemination of data and metadata: • harmonization and coherence of data (semantic interoperability); • preservation of meaning; • open formats (XML, JSON, CSV) rather than a proprietary • SDMX IM for Linked Open Data vocabulary (RDF Data Cube). More in general Open (statistical) Data • Software free and open source easily to integrate (component-based) within a Statistical Information System

  4. Why and how SDMX • SDMX use cases: • Reporting (to the International Organisations) • “push” - reporter pushes data to collector • “pull” - collector pulls data from reporter • Metadata management • Data warehousing • SDMX as a model for the structure of a data warehouse or metadata repository • for extraction, transformation, and load of data • Dissemination (GUI browsing and machine2machine) • to drive website presentation of data and metadata • as a queryable data source • for standardized file downloads • pull method is most useful, can cater for different queries • Validate data using XML schemes and VTL

  5. Disseminating multi-dimensional statistical tables

  6. Statistical datasets and data cube model • A statistical data set comprises a collection of observations made at some points across some logical space, • The collection can be characterized by a set of dimensions that define what the observation applies to (e.g. time, area, gender) along with metadata describing what has been measured (e.g. economic activity, population), how it was measured and how the observations are expressed (e.g. units, multipliers, status) • We can think of the statistical data set as a multi-dimensional space, or hyper-cube, indexed by those dimensions (This space is commonly referred to as a cube for short) • A cube is organized according to a set of dimensions, attributes and measures • Data can be “sliced” by performing a query on the dimension items in the cube.

  7. Example of statistical data cube Tourism activity Time Country B020 ES 2004

  8. Time series representation ES FR IT AT Country

  9. Data Reference Metadata grouped into grouped into Metadata set Data set described by described by Structural Metadata DSD MSD Metadata in SDMX • Structural metadata: • Metadata act as identifiers and descriptors of the data (e.g. variables, code-lists, dataset, data structure definitions, etc.) • Reference metadata: • Conceptual metadata (e.g. definition of income) • Methodological and processing metadata (e.g. description of data processing) • Quality metadata (e.g. Availability)

  10. High level schematic of the SDMX Information Model Categorisation

  11. SDMX main artefacts (1/2) • Concept Scheme a list of Concepts, for a specific statistical domain, or for other purposes such as cross-domain Concepts • a Concept provides the semantics (and possibly also default representations) for Dimensions (Age, Reference area, Sector, Time), Data Attributes (Observation status, Unit multiplier) and Metadata Attributes • Code List alist of Codes from which some statistical Concepts (coded Concepts) take their values. A Statistical classification can be represented as a Code List • Data Structure Definition a DSD describes the structure of an organised collection of data (using Concepts defined in Concept Schemes and Codelists) identifying: • which Concepts are used as Dimensions and their representation (coded or free text) • which Concepts are used as Attributes and their representation (coded or free text), attachment level (Dataset, Dimensions group, Observation), obligatoriness • which Concepts are used as Measures (at least one measure must be identified)

  12. SDMX main artefacts (2/2) • Data Flow filtered (constrained) sub-cube defined on a DSD. It is used to present simplified perspectives/views of the data to users • Constraint limits Dimensions and Attributes in a Dataflow to a set of Codes • Category Scheme container for Categories. Categories are used to classify any other structure in SDMX (e.g. Data Flows). Very often a Category Scheme in dissemination database GUI is identified by a Theme(Topic)-Tree • Categorisation a link between a Category in a Category Scheme to any other Identifiable structure in SDMX (e.g. a category with a Data Flow) • Metadata Structure Definition a MSD is used to define a template for reporting and disseminating Reference Metadata • Metadataflow it is like a Dataflow but for reference metadata

  13. Data Structure Definition, Dataflow, Dataset DSD Dataflow Dataset Constraints on dimensions describes Constraints on dimensions describes

  14. Basic characteristics of SDMX artefacts (v2.1)

  15. SDMX Data Modelling

  16. Skills of the Data Modeler • Business analysis skills to discuss and interpret the needs of users • Multidimensional modelling to understand statistical hypercubes, need to slice data • Metadata management to store, retrieve, version structural metadata • Dissemination/reporting experience to understand dataflows • (advantageous) Knowledge of the statistical domain

  17. Important resources for data modelling • Modelling Statistical Domains in SDMX • Guidelines for SDMX Data Structure Definitions • Checklist for SDMX Design Projects Home

  18. Step 1: Agree on the exchange needs (GSBPM: Specify Needs) • Data sources (e.g. Excel, Pdf, SAS/STAT/SPPS files, Legacy database, etc.) • Expected statistical visualisations (tables, graphs, full downloads, etc.) • Use case(s) of the data to be modelled, e.g. reporting, dissemination • Tables to disseminate • Available tools • Governance and maintenance constraints of the metadata and data • Any constraints on remodeling the data and metadata

  19. Step 2: Plan the modeling project • Ensure the team has the knowledge • Define a roadmap • Identify the SDMX modelling tools and methodology to use • Method and tools used to collaborate with statisticians/modellers/IT • Issue log • Project plan with scope and above details

  20. Step 3: SDMX data modelling main steps (GSBPM: Design) • Describe the Data Flows in the exchange. This is an examplefrom Global Labour Statistics

  21. Step 3: SDMX data modelling main steps (GSBPM: Design) • Create the Data flow model (previous slide) • Create a Concept Scheme that satisfies the exchange needs • Decide the roles of the concepts • For each Data Flow, specify which concepts will be used and the subset of allowed codes • Optimise the model, removing or merging concepts that are rarely used in the Data Flows • Create DSDs by assigning Data Flows that have similar dimensionality • Create the Code Lists for the Concepts • Categorise the Data Flows. Allows navigation in registries, browsers

  22. Some data modelling principles • Reuseof existing common structures (The Global SDMX Registry is the primary location to search for “global” and “cross-domain” SDMX artefacts) • For visualization, minimisethe numbers of Concepts displayed • It is preferred to show at most 5 Dimensions in a table • Merging Concepts may be done for short breakdown codelists, but this may compromise filtering and simple processing of the information • In .Stat, hiding concepts can be achieved using Dataflows and Constraints without merging concepts . Single-fixed concepts are hidden • Optimise the statistical domain model, taking in consideration the requirements of the Domain Manager • the minimal number of Concepts used in a DSD, • the minimum data sparseness in Data Flows, and • the minimum number of DSDs required • Use the SDMX concept-oriented guidelines! (COG)! • Especially the Guidelines for the creation and Management of SDMX Code Lists

  23. Step 4: Creation of SDMX artefacts (GSBPM: Build) • Create the SDMX artefacts from the previous step • Several tools automatically generate the SDMX-ML • Upload the artefacts to the structural metadata repository • In .Stat, use the Data Lifecycle Manager to store the artefacts in the .Stat datawarehouse • Create guidelines, especially if the project replaces a non-SDMX reporting framework or if it is new • Perform internal and external tests

  24. Example of identifying which tables must be disseminated 1) Domain manager requirements 2) I.Stat Data modelling 3) I.Stat database

  25. Industrial production index

  26. Index of production in construction

  27. Industrial new orders index

  28. Industrial turnover index

  29. An example of finding statistical concepts that describe the tables

  30. Concepts and statistical tables matrix

  31. An example of statistical concept coding and their code lists

  32. A Data Structure Definition and related artefacts

  33. A DSD matrix example showing Data Flows and Constraints

  34. Content Constraints

  35. An example of Categorizing the Data Flows

  36. Demo of content in Data Lifecycle Manager and Data Explorer

  37. Modelling Unit Information

  38. Issues with Indicators (or Variable, Subject, etc.) Attempts have been made to define what the difference might be between these • Often the name/code is not used uniformly (indicator, measure, variable, subject, transaction; IND, INDICATOR, MEASURE, VAR, VARIABLE, etc.) In many instances, the compound indicator is used to denote “that one dimension into which we crammed all the information that we could not allocate into one dimension”. Some ‘indicators’ • Adolescents (15-year-olds) who report regularly engaging in vigorous physical activity outside of school (%) • Children (11-, 13- and 15-year-olds) who are overweight or obese (%) • Average disposable household income, 0-17 year-olds, 2010 USD PPP A ‘measure’ • Out-of-work income maintenance and support Useful, orthogonal information is often difficult to separate, map, and compare While the concept names/IDs are different, it makes mapping and transcoding very difficult! For the modeler, difficult to know how to construct the compound name. No standard, deterministic way to do this

  39. Modelling the “Unit” information The unit informationmay be: • a compound “Statistical indicator”, or • decomposed into Unit characteristic concepts Compound indicators can be improved by: • removing the concept synonyms, and • harmonising the indicator composition to make mapping more consistent Now we describe the OECD’s method on how to decompose unit information using a methodological process. Even for compound indicators, it is good practice to decompose the information into separate concepts anyway, then merge back • Allows to standardize the composition of the indicator

  40. The Mountain example Object: Chimborazo Property: height Value: 6263 Unit of measure: meters CC BY-SA: David Torres Costales

  41. OECD’s Unit Decomposition Method • 3 mandatory unit concepts for all data: Measure, Unit of Measure, Unit multiplier • Measure: The entity+property being measured, e.g. Chimborazo height • Unit of Measure: Standard for measurement of the same kind of quantity • Use standard units • Typical UoMs: • Persons, Years, Metres, Tonnes, etc. • <Currency> • <Unit> per <unit>, e.g. National currency per US Dollar • Percentage points. Used for differences in percentages • Percentage of <denominator> • Avoid usingPercentage, Index or Rate, Share, Ratio alone. E.g. percentage of what? • Percentage should always be Percentage of something • Convert rates, share, ratio to percentages. For standardisation but also to remove synonyms • Unit multiplier: Exponent in base 10 used for calculating the actual value. Apply it to the observation value • Additional concepts that further qualify Unit of Measure • Base period: When the intrinsic value changes over time. Used in conjunction with • Prices: Current or Constant price. Needs Base period • Seasonal adjustment: Used to remove seasonal and calendar influences from a series • Time transformation: Coding for time-related operations on series: growth rates, moving averages, etc: Guidelines on coding time transformations in SDMX

  42. Decomposition Examples Proportion (%) of children (aged 0-14) that live in households where all adults are in employment (working) Replaced by: • Measure: Children • UoM: Percentage of children • Unit multiplier: Units • Labour force status provider: All in employment • Age: 0 to 14 years • Living arrangement: Total Proportion (%) of children (aged 0-17) living with two parents Replaced by: • Measure: Children • UoM: Percentage of children • Unit multiplier: Units • Age: 0 to 17 years • Living arrangement: 2 parents

  43. Decomposition Examples Total public social expenditure on families as a % of GDP Replaced by: • Measure: Public social expenditure on families • UoM: Percentage of GDP • Unit multiplier: Units • Benefit type: Total Public social expenditure on cash benefits for families as a % of GDP Replaced by: • Measure: Public social expenditure on families • UoM: Percentage of GDP • Unit multiplier: Units • Benefit type: Cash benefits Public social expenditure on services and in-kind benefits for families as a % of GDP Replaced by: • Measure: Public social expenditure on families • UoM: Percentage of GDP • Unit multiplier: Units • Benefit type: Services and in-kind benefits

  44. Demo of SDMX Matrix Generator Open source Excel-based tool used to model domains/frameworks and generate SDMX-ML Available on Github: https://github.com/OECDSTD/sdmx-matrix-generator

More Related