Benchmark database inhomogeneous data, surrogate data and synthetic data

Benchmark databaseinhomogeneous data, surrogate data and synthetic data Victor Venema

Content • Introduction to benchmark dataset • Some results • Some questions about exercise • Question about future work • Analyse and publish the results

Benchmark dataset • Real (inhomogeneous) climate records • Most realistic case • Investigate if various HA find the same breaks • Synthetic data • For example, Gaussian white noise • Insert know inhomogeneities • Test performance • Surrogate data • Empirical distribution and correlations • Insert know inhomogeneities • Compare to synthetic data: test of assumptions

Creation benchmark – Outline talk • Start with homogeneous data • Multiple surrogate and synthetic realisations • Mask surrogate records • Add global trend • Insert inhomogeneities in station time series • Published on the web • Homogenize by COST participants and third parties • Analyse the results and publish

1) Start with homogeneous data • Monthly mean temperature and precipitation • Later also daily data (WG4), maybe other variables (pressure, wind) • Homogeneous, no missing data • Longer surrogates are based on multiple copies • Generated networks are 100 a

2) Multiple surrogate realisations • Multiple surrogate realisations • Temporal correlations • Station cross-correlations • Empirical distribution function • Annual cycle removed before, added at the end • Number of stations, 5, 9 or 15 • Cross correlation varies as much as possible

5) Insert inhomogeneities in stations • Independent breaks • Determined at random for every station and time • 5 Breaks per 100 a • Monthly slightly different perturbations • Temperature • Additive • Size: Gaussian distribution, σ=0.8°C • Rain • Multiplicative • Size: Gaussian distribution, <x>=1, σ=10%

Example break perturbations station

Example break perturbations network

5) Insert inhomogeneities in stations • Correlated break in network • One break in 10 % of networks • In 30 % of the station simultaneously • Position random • At least 10 % of data points on either side

Example correlated break

5) Insert inhomogeneities in stations • Outliers • Size • Temperature: < 1 or > 99 percentile • Rain: < 0.1 or > 99.9 percentile • Frequency • 50 % of networks: 1 % • 50 % of networks: 3 %

Example outlier perturbations station

Example outliers network

5) Insert inhomogeneities in stations • Local trends (only temperature) • Linear increase or decrease in one station • Duration: between 30 and 60a • Maximum size: Gaussian distribution, σ=0.8°C • Frequency: once in 10 % of the stations

Example local trends

6) Published on the web • Inhomogeneous data are published on the COST-HOME homepage • Everyone is welcome to download and homogenize the data • http://www.meteo.uni-bonn.de/ venema/themes/homogenisation

7) Homogenize by participants • Return homogenised data • Should be in COST-HOME file format (next slide) • For real data including quality flags • Return break detection file • BREAK • OUTLI • BEGTR • ENDTR • Multiple breaks at one data possible

Typical errors • The file format needs to be perfect! • Forgetting the station-file that describes which stations belong to the homogenised network • Changing the file names in this station file to homogeneous data files ► • (Forgetting to return the files with the quality flags) • The sizes of the breaks are not in the break file • Please, keep directory structure of the benchmark like it is, also for partial contributions • The only difference is the main directory • All files are tab-delimited ASCII files

COST-HOME file format – network file

Typical errors • The file format needs to be perfect! • Forgetting the station-file that describes which stations belong to the homogenised network • Changing the file names in this station file to homogeneous data files • (Forgetting to return the files with the quality flags) • The sizes of the breaks are not in the break file ► • Please, keep directory structure of the benchmark like it is, also for partial contributions • The only difference is the main directory • All files are tab-delimited ASCII files

Detected breaks file

Typical errors – see discussion • The file format needs to be perfect! • Forgetting the station-file that describes which stations belong to the homogenised network • Changing the file names in this station file to homogeneous data files • (Forgetting to return the files with the quality flags) • The sizes of the breaks are not in the break file • Please, keep directory structure of the benchmark like it is, also for partial contributions • The only difference is the main directory • All files are tab-delimited ASCII files ►

COST-HOME file format – monthly data

Contributions

No. homogenised networks - algorithm

No. homogenised networks – input data

Mean no. outliers per station

Mean no. breaks per station

Homogenising the exercise • Tab-delimited files: also space-delimited? • Mixture of strings and numbers • Data quality files only for real data section • Do we want to use the Diurnal Temperature Range (DTR)? • Not useful for surrogate and synthetic data! • If we do, everyone should do it • End or begin uncorrected? • Compute statistics independent of absolute level? • Filling missing values part exercise? • Human quality control or raw algorithm output? • Homogenise all or homogenisable networks, times

Contributions – who is missing?

Analysing the results • What measures define a well homogenised dataset? • Real data vs. data with known truth • Ensemble mean for real data? • Breaks • Position, hit rate • size distribution • detection probability as function of size • Data itself • Root mean square error (RMSE) • RMSE (without outliers) • RMSE (bias corrected) • Uncertainty in the network mean trend • How to study which components are best?

Deadline(s) • Agreed on 09/2009, September this year • Multiple deadlines • For example: synthetic data, real data, surrogate data • After deadline the truth can be revealed • After deadline the other contributions can be revealed(?) • Start earlier analysing the results • For example: May, July, September • Bologna, 25 – 26 May, EGU, 19 – 24 April

Articles • Articles • Overview COST Action & benchmark with very basic analysis results • Performance difference between synthetic (Gaussian, white noise) and surrogate data • How to deal multiple contributions per algorithm? • Do we have references to all algorithms? • What should the others be about • Analysing results, which components are best • Who will organise, coordinate it? • Not everyone should do the same analysis • How to subdivide the work? • After deadline: sensitivity analysis

Benchmark database inhomogeneous data, surrogate data and synthetic data

Benchmark database inhomogeneous data, surrogate data and synthetic data

Presentation Transcript

Data and Database Administration

Data Security, Data Administration and Database Administration

Acuity Benchmark 2 Data

Data and database administration

DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS

Benchmark Data

Data and Database Administration

Benchmark Data

Data and database administration

Data and Database Administration

Benchmark Data

Benchmark Data Meetings

DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS

DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS

Annex C - Benchmark data

Acquired Data Benchmark Report