280 likes | 423 Vues
ESSnet on Datawarehousing - the business register. Pieter Vlag – Statistics Netherlands. Outline of the presentation. DataWareHouse and importance population frame relationship population frame - business register - (default) target population, statistical units
E N D
ESSnetonDatawarehousing- the business register Pieter Vlag – StatisticsNetherlands
Outline of the presentation • DataWareHouse and importance population frame • relationship population frame - business register • - (default) target population, statistical units • other crucial datasources: “backbones” • - turnover + employment • datalinking : the statistical unit base • conflicting information between datasources • - when correcting in statistical DWH • - when correcting in backbones • - when feedback to business register ESSnet DWH – business register 1
Definition of a statistical Datawarehouse (according to the FPA) The broad definition of a data warehouse to be used in this ESSnet is therefore: ‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’ ESSnet DWH – business register 2
A DataWarehouse: the general idea • As stagingarea is “core business” forNSIs, • term statistical DWH is usedforstagingarea + WareHouse ESSnet DWH – business register 3
The statistical DataWarehouse: architecture and layers ESSnet DWH – business register 4
The statistical DataWareHouse:processing steps the GSBPM model 5.1a: link data 5.1b:integrate data seepresentationFursova input DWH / int. data process Calculateaggregates ESSnet DWH – business register 5
GSBPM -step 7 Output 3 Output 1 Output 2 6 Integrated data 5.7 p.analyse p.analyse 5.2 - 5.6 Processing (integrationlayer) Linking 5.1 4 datasource 1 datasource 2 datasource 3 Titel van de presentatie
A datawarehouse without population frame Datasource I: Survey 1 Datasource I: Admin data Datasource I: BIG DATA Datasource I: Survey 2 • different sourcescover different enterprises -> informationabout ? • timing of availabilitysourcesdiffers -> when complete desc. available ? ESSnet DWH – business register 7
A Datawarehouse with a population frame Datasource 4: survey 2 Population. Datasource 1: admin data 1 Datasource 3: survey1 Datasource 2: BIG DATA ADVANTAGE: the coverage of DWH is known (e.g. whichenterprises are included in a DWH) ESSnet DWH – business register 8
Units and target population • The population should be known for the • datawarehouse; e.g. “about which enterprises info” • its preparation phase ; e.g. when linking data sources • Challenges are: • units may differ between the data sources • - decision: which unit used for linking • what is the reference population • - decision: how is the default target population defined ESSnet DWH – business register 9
Proposals • Only statistical unit (=enterprise) is used • - for data-linking • - in processing phase of the statistical - DWH • - justification: most obvious, ESSnet on Consistency, maintenance • Default target population : • all enterprises with economic activity in reference period (e.g. year) • - justification: SBS-regulation • - widest definition of enterprises from which flexible outputs for subpopulations can be derived • - term default is used: as subpopulations do have a target population, too ESSnet DWH – business register 10
GSBPM -step 7 6 flexible output for different populations, and units Integrated data 5.7 Weightingto flexible pop. 5.2 - 5.6 Processing on stat. unit + default target populationonly Linked data 5.1 flexibledatasourceswith different populations and units 4 Titel van de presentatie
Population frame and the Business Register • Determination of the default target population in SDWH in 2 steps: • the population frame, i.e. a list of enterprises with a certain kind of activity during a period. • confirmationwhich enterprises of the list really performed economic activities during a period • The business register provides information for the population frame. • Therefore, the statistical Business Register is an indirect datasource for the statistical-DWH ESSnet DWH – business register 12
Informationneededfromstat.business register • Recommended information for the population frame : • the frame reference year • the statistical enterprises unit, including national ID and EGR ID • the name and address of the enterprise • the national identification number (ID) of the enterprise • the date in population (mm/yr) • the date out of population (mm/yr) • the NACE-code • the institutional sector code • a size class ESSnet DWH – business register 13
Otherbackbones • ESSnetAdminData: VAT and socialsecurityadmin • almost complete forquarter and annual • canbeusedforhigh-qualityestimatesforturnover + employmentrespectively. • ESSnet DWH: VAT and socialsecurity data are crucial • to confirm the activity status of enterprises • implictly to determine the default target population • to integrate data suitable for flexible outputs • measurement errors are reduced of sample survey (or data about subpopulation) if weighting to pop.numbers + VAT-turnover + employment • Proposal: to include these admin data as backbones in a stat-DWH ESSnet DWH – business register 14
Backbones in a statistical-DWH Access layer GSBPM 7-9: disseminate Integrationlayer Int. + Analyses layer GSBPM 6: analyse / “DATAWAREHOUSE” GSBPM 5.7-5.8: calculateaggregates GSBPM 5.2-5.6: “process” Check processing GSBPM 5.1: link & integrate data 1 data 2 Sourcelayer Pop-frame VAT empl. Backbones are crucialfordata-linking and data-integration; ->need to bechecked/cleanedbysource in the sourcelayer SBR
Observed: admin data incorporated in BR GSBPM 7-9: disseminate Integrationlayer Int. + Analyses layer GSBPM 6: analyse Whenchoosingthisoption, • important part oflinkingprocessoutside the S-DWH • unless S-DWH integral part of S-DWH (maintenance ?) GSBPM 5.7-5.8: calculateaggregates GSBPM 5.2-5.6: “process” Check processing GSBPM 5.1: link & integrate data 1 data 2 Pop-frame Sourcelayer VAT SBR empl.
Determiningdefault target population • Ifstatistical-DWH covers annualstatisticsonly • relatively straightforward • - derive population frame from business register at the end of reference year t • determine active or non-active as soon as VAT and/or employment data become available • If STS included in statistical-DWH more complicated: • - updating necessary ! ESSnet DWH – business register 17
Updating population ESSnet DWH – business register 18
The largestenterprises output 2 output 1 output 3 GSBPM 5.7-5.8: calculateaggregates “DATAWAREHOUSE” If a team within a NSI produces consistent microdata forlargestenterprises -> considerthissource as backbone GSBPM 5.2-5.6: “process” Check processing GSBPM 5.1: link & integrate Pop-frame VAT empl. L.E. data 1 data 2 SBR
Units: idealsituation • enterprise has a unique ID • enterprise group has a unique ID • enterprise and enterprise group • correspond with statistical definitions • are used in all data sources • In practice more complex situations do exist • (especially when using more admin data) ESSnet DWH – business register 20
GSBPM -step Keyquestion: how to manage these different in- and output units and theirrelationships to the statistical unit 7 6 Flexible output for different populations, and units Integrated data 5.7 processing onone unit + onepopulationonly 5.2 - 5.6 Linked data 5.1 4 flexibledatasourceswith different population and units Titel van de presentatie
INPUT IN S-DWH processing OUTPUT Legal unit KAU “Accountìng” unit LKAU “VAT-unit” Local unit ENTERPRISE GROUP ENTERPRISE (=statistical unit) Enterprise “othertax” units Enterprisegroup enterprise other units ESSnet DWH – business register
The unit base • Someremarks: • Complexity of unit base depends on • - scope of statistical-DWH • national legislation (practices) with respect to enterprise units • Unit base closely related to Business Register. Main motivation to place this base outside the Business registers • - more flexible in case of new in- and outputs • - more transparent in case of linking errors ESSnet DWH – business register 23
Position of Business Register in stat -DWH output 2 output 1 output 3 GSBPM 5.7-5.8: calculateaggregates “DATAWAREHOUSE” GSBPM 5.2-5.6: “process” Check processing GSBPM 5.1: link & integrate tax Pop-frame BIG DATA L.E. survey other units SBR VAT empl.
Feedback to Business Register • In case of conflictinginformationbetweendatasources and conclusion is influentialerror in backbones (and indirectly SBR) • When incorporating corrections in statistical DWH ? • When incorporating corrections in backbones ? • When incorporating corrections in SBR? ESSnet DWH – business register 25
Correction of information output 2 output 1 output 3 “DATAWAREHOUSE” GSBPM 5.7-5.8: calculateaggregates In SDWH: corrections at 5.6 GSBPM 5.1-5.6: “process” Check processing GSBPM 5.1: link & integrate other survey units Pop-frame VAT empl. L.E. In backbonesthemselves: timing most important revisions SBR In SBR: after end of year (forconsistency) – exception major impact
Conclusions • Requirementsforstatistical-DWH • Population well defined • Use of one unit in processing • Backbonesdesiredfor • populations, VAT-turnover, admin data employment, largeenterprises • Business Register is indirect input forstatistical DWH • population frame, unit base, survey • Timing of correctionserrors (backboneinformation) • in DWH: beforeweighting • in backbone: whenrevising • in Business Register: end of year ESSnet DWH – business register 27