540 likes | 558 Vues
Between and beyond: Irregular series, interpolation, variograms, and smoothing. Nicholas J. Cox. Mind the gap!. Repeated reminder, London Underground. Executive summary. A new program mipolate for several kinds of interpolation is now available.
E N D
Between and beyond: Irregular series, interpolation, variograms, and smoothing Nicholas J. Cox
Mind the gap! Repeated reminder, London Underground.
Executive summary A new program mipolate for several kinds of interpolation is now available. It can be downloaded from SSC (3 September 2015). Variograms are useful for examining dependence structure in time and spatial series. Work is in progress on a new program vgram for variograms.
Irregular series Irregular series are series in which non-missing values are not all equally spaced. Special case: Values would be equally spaced (every day, every year, …), but there are some gaps with missing values, for human or inhuman reasons. General case: Values are just at known times or points with no necessary rules about spacing. Irregular series often seem to invite interpolation.
Luke Howard (1772 – 1864) Best remembered for his nomenclature for clouds (cumulus, stratus, cirrus and so forth). Here we use as sandbox some of his temperature data from Plaistow, near London, in 1807.
Howard, Luke. 1818. The Climate of London, Deduced from Meteorological Observations, Made at Different Places in the Neighbourhood of the Metropolis. Volume I. London: W. Phillips, etc.
Series of events N.B. We are not talking here about series of events, or realisations of point processes. In such series occurrences are typically irregularly spaced, but the gaps are inherent in the process, not a failing of our data. Examples range from eruptions to elections.
Interpolation Interpolation is the art of reading between the lines. Historically, it is a deterministic process, often a matter of going beyond printed tables of functions (logarithmic, trigonometric, and so forth). In principle, we should worry about the statistical properties of interpolation. It is local prediction. In practice, imputation now appears better known among statistical researchers.
Interpolation in (official) Stata The ipolate command for linear interpolation (and extrapolation) was added in Stata 3.1 (1993). The Mata functionsspline3() and spline3eval() were added in Stata 9.0 (2005).
User-written programs on SSC Programs (NJC) have been available from SSC for cubic interpolation: cipolate (2002) cubic spline interpolation: csipolate (2009) piecewise cubic Hermite interpolation: pchipolate (2012) nearest neighbour interpolation: nnipolate (2012) A combined and extended program mipolate is now available too.
Two dimensions too Note also bipolate (Joseph Canner, SSC) (2014). By default it uses quintic polynomials. Other available methods include thin plate splines and Shepard’s method. Note also twoway contour.
mipolate generalises ipolate Interpolation is of yvar with respect to specified xvar. Prior tsset or xtset is not assumed. Regular spacing is not assumed. Multiple values of yvar at the same xvar are averaged first. Groupwise operations using by: are supported.
Linear and cubic Linear interpolation just uses previous and following known values (only). This is done by ipolate, and also mipolate by default. Cubic interpolation is another classic method, using two previous and two following known values (only). This is done by mipolate, cubic. The default of mipolate with either method (as with ipolate) is not to extrapolate.
Un peu d’histoire Cubic interpolation, as a particular kind of polynomial interpolation, is often attributed to Joseph-Louis Lagrange (1736–1813) but was proposed earlier by Edward Waring (1735?–1798). In fact there is a long history of work with contributions by many outstanding mathematicians, not least Isaac Newton (1643–1727) and Leonhard Euler (1707–1783).
Cubic splines As before, we are using cubic polynomials locally, but they are constrained to join smoothly. The syntax is mipolate, spline. This is merely a wrapper for the official Mata functions. As before, the default of mipolate with this option is not to extrapolate.
Linear extrapolation As with ipolate linear extrapolation is available as an option in mipolate to fill in missings at the end of series. What your teachers told you is true: extrapolation is dangerous. “Don’t point that straight line: It can go off anywhere.” (Allude here to Mark Twain on the Mississippi.)
Piecewise cubic Hermite interpolation This method also uses piecewise cubics joining smoothly. The syntax is mipolate, pchip. The interpolant is shape-preserving and cannot overshoot locally. Sections in which yvar is increasing, decreasing or constant with xvar remain so after interpolation. Hence local maxima and minima also remain so. This interpolation method also extrapolates.
Inverse distance weighting Interpolation can use a weighted average of known values, the weights being inverse powers of distance d from unknown value. If I don’t know the value at 42, 41 and 43 are distance 1 away, 40 and 44 distance 2, and so on. For weights d-p,limiting case p = 0makes all weights equal, and so the interpolant is the overall mean, while p very large means that only the very nearest values have effect.
Other methods mipolate adds forward, backward, nearest neighbour and groupwise interpolation: Use the previous, next or the nearest known value. Or extend the single non-missing value in a group to all others. Using the last known value is often dubious statistically, but it is a very common request in data management. The other methods are provided partly for completeness. There is small print (option choices) about how to break ties when two values are equally near.
mipolate summary Nine methods: linear cubic (cubic) spline pchip idw forward backward nearest groupwise Linear extrapolation? yes yes yes no no no no no no
Simple messages There are many interpolation methods to choose from. They will often disagree, even for simple-looking instances. Disagreement gives a handle on uncertainty. In a real problem, simulate missings and test how well known values are estimated. What makes most sense in your problem will reflect its dependence structure.
We turn from a project that is done to one that is very much in progress.
Variograms Variograms (more properly semivariograms) are plots of (mean) half difference between values squared versus separation, distance or lag. By a tempting abuse of terminology, we often use the same name for the underlying relationship as a function.
First known use of term ‘variogram’ Geoffrey H. Jowett (1922– ) in 1955: The comparison of means of sets of observations from sections of independent stochastic series. Journal of the Royal Statistical Society. Series B (Methodological) 17: 208–227.
Spatial and time series Variograms are central to one approach to spatial statistics, in this context often known as geostatistics. Georges Matheron (1930–2000) is most often mentioned here. But variograms can be very useful for time series too.
Time series too Variograms are prominent in these texts on time series and longitudinal data: Diggle, P.J. 1990. Time Series: A Biostatistical Introduction. Oxford: Oxford University Press. Diggle, P.J., Heagerty, P.J., Liang, K-Y. and Zeger, S.L. 2002. Analysis of Longitudinal Data. Oxford: Oxford University Press.
User-written programs Programs (NJC) are available from SSC for variograms in one dimension: variog (2005) variograms in two dimensions: variog2 (2005) A combined and extended program vgram is under development.
Generality of variograms So, variograms are – without undue strain – defined for time series and for spatial series, whether regular or irregular, as they just depend on separation being measured. Plotting the mean for each distinct separation is a common, but not compulsory, convention.
Variograms vgram air vgram air, recast(connected) xla(0(12)72)
Comparison at different lags We are plotting mean squared differences between values compared at lags 1, 2, 3, … In this example, we have monthly data, so are comparing values 1, 2, 3, … months apart. Many readers may be familiar with the same idea for calculating autocorrelation and cross-correlation. The variogram – like the raw data plot – hints at a structure of trend plus seasonality.
Variograms of residuals, not data Here, as elsewhere, it is a good idea to work with residuals, rather than the original data. Time series modellers could have a happy time arguing which model was best for the airline data, but we just use a Poisson regression on time and look at its residuals. On the versatility and virtuosity of Poisson regression, check out Gould, William. http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/
Sometimes, structure is this simple Poisson regression Residuals from Poisson
A little more formally The semivariogram γ(h) for response z is given by 2 γ(h) = A{ [z(i) − z(i + h)]2 } where A{} denotes averaging over pairs of values at lag h. As emphasised, using a mean is a convention. The fuller picture (literally!) is a plot of [z(i) − z(i + h)]2 versus h. This is often known as a variogram cloud. I borrow the notation A() from Whittle, P. 1970. Probability. Harmondsworth: Penguin.
Where does the 2 come from? The units of the semivariogram are those of the response squared. Adding the variance to the graph as a reference line underlines the connection. A non-standard formula for the variance is, for any i, j, (1/2) E{ (zi − zj)2 } .
Back to vgram vgram (not yet public) is already quite general. We take possibilities one by one. • With just one argument, the response, it checks for a tsset or xtset time variable and uses it to define separations if found. Note that panel data are supported for free. • With just one argument otherwise, the order of the observations is taken to define position in time or space.
With two arguments, the second variable is taken to define position. A width() option is required to specify the width of bins within which differences squared are averaged. Equal and unequal spacing can thus both be accommodated. • With three arguments, the second and third variables are taken to define position. A width() option is required to specify the width of bins within which differences squared are averaged. Distance is calculated from coordinates using Pythagoras’ theorem.
Why not just use autocorrelation? Variograms are defined for a wider class of processes. Autocorrelation functions require weak stationarity; variograms are defined for processes with stationary increments. Variograms are more flexible in the face of irregular spacing. The very wide use of autocorrelation reflects custom and familiarity as well as intrinsic merit.
A further example We look at rainfalls for 8 May 1986 (a single day) for 467 stations in Switzerland.
How much information ? Optionally the semivariogram results can be saved in vgram to new variables. Keeping track of the number of pairs used at each lag is important. Here we exploit the feature that spikeplot can show frequencies on a square root scale.
To do list variogram clouds robust estimators more flexible binning spherical distances too direction as well as lag model fitting (valid functional forms) use for interpolation (and smoothing) (kriging, Gaussian process regression)