An Introduction to Functional Data Analysis Jim Ramsay McGill University
Overview We’ll use three case studies to see what is meant by functional data, and to consider some important issues in the analysis of functional data • Human growth data • US nondurable goods manufacturing index • Thirty years of Montreal weather
We need repeated and regular access to subjects for up to 20 years. • Height changes over the day, and must be measured at a fixed time. • Height is measured in supine position in infancy, followed by standing height. The change involves an adjustment of about 1 cm. • Measurement error is about 0.5 cm in later years, but is rather larger in infancy. • Measurements are not taken at equally spaced points in time.
Challenges to functional modeling • We want smooth curves that fit the data as well as is reasonable. • We will want to look at velocity and acceleration, so we want to differentiate twice and still be smooth. • In principle the curves should be monotone; i. e., have a positive derivative.
The monotonicity problem The tibia of a newborn measured daily shows us that over the short term growth takes places in spurts. This baby’s tibia grows as fast as 2 mm/day! How can we fit a smooth monotone function?
Weighted sums of basis functions • We need a flexible method for constructing curves to fit the data. • We begin with a set of basic functional building blocks φk(t), called basis functions. • Our fitting function x(t) is a weighted sum of these:
B-splines for growth data • Order 4 splines look smooth, but their second derivatives are rough. • We use order 6 B-splines because we want to differentiate the result at least twice. • We place a knot at each of the 31 ages. • The total number of basis functions = order + number of interior knots. 35 in this case.
Isn’t using 35 basis functions to fit 31 observations a problem? • Yes. We will fit each observation exactly. • This will ignore the fact that the measurement error is typically about 0.5 cm. • But we’ll fix this up later, when we look at roughness penalties.
Okay, let’s see what happens These two Matlab commands define the basis and fit the data: hgtbasis = create_bspline_basis([1,18], 35, 6, age); hgtfd = data2fd(hgtfmat, age, hgtbasis);
Why we need to smooth Noise in the data has a huge impact on derivative estimates.
Please let me smooth the data! This command sets up 12 B-spline basis functions defined by equally spaced knots. This gives us about the right amount of fitting power given the error level. hgtbasis = create_bspline_basis([1,18], 12, 6);
These are velocities are much better. • They go negative on the right, though.
Let’s see some accelerations • These acceleration curves are too unstable at the ends. • We need something better.
A measure of roughness • What do we mean by “smooth”? • A function that is smooth has limited curvature. • Curvature depends on the second derivative. A straight line is completely smooth.
Total curvature We can measure the roughness of a function x(t) by integrating its squared second derivative. The second derivative notation is D2x(t).
Total curvature of acceleration Since we want acceleration to be smooth, we measure roughness at the level of acceleration:
The penalized least squares criterion We strike a compromise between fitting the data and keeping the fit smooth.
How does this control roughness? • Smoothing parameter λ controls roughness. • When λ= 0,only fitting the data matters. • But as λincreases, we place more and more emphasis on penalizing roughness. • As λ ∞,only roughness matters, and functions having zero roughness are used.
We can either smooth at the data fitting step, or smooth a rough function. • This Matlab command smooths the fit to the data obtained using knots at ages. The roughness of the fourth derivative is controlled. lambda = 0.01; hgtfd = smooth_fd(hgtfd, lambda, 4);
Accelerations using a roughness penalty These accelerations are much less variable at the extremes.
How did you choose λ? • We smooth just enough to obtain tolerable roughness in the estimated curves (accelerations in this case), but not so much as to lose interesting variation. • There are data-driven methods for choosing λ, but they offer only a reasonable place to begin exploring. • But this is inevitably involves judgment.
What about monotonicity? • The growth curves should be monotonic. • The velocities should be non-negative. • It’s hard to prevent linear combinations of anything from breaking rules like monotonicity. • We need an indirect approach to constructing a monotonic model.
A differential equation for monotonicity Any strictly monotonic function x(t) must satisfy a simple linear differential equation: The reason is simple: Because of strict monotonicity, the first derivative Dx(t) will never be 0, and function w(t) is therefore simply D2x(t)/Dx(t).
The solution of the differential equation Consequently, any strictly monotonic function x(t) must be expressible in the form This suggests that we transform the monotone smoothing problem into one of estimating function w(t), and constants β0and β1.
What we have learned from the growth data • We can control smoothness by either using a restricted number of basis functions, or by imposing a roughness penalty. • Roughness penalty methods generally work better than simple basis expansions. • Differential equations can play a useful role in defining constrained functions.
Nondurable goods last less than two years: Food, clothing, cigarettes, alcohol, but not personal computers!! The nondurable goods manufacturing index is an indicator of the economics of everyday life. The index has been published monthly by the US Federal Reserve Board since 1919. It complements the durable goods manufacturing index.
What we want to do • Look at important events. • Examine the overall trend in the index. • Have a look at the annual or seasonal behavior of the index. • Understand how the seasonal behavior changes over the years and with specific events.
Events and Trends • Short term: • 1929 stock market crash • 1937 restriction of money supply • 1974 end of Vietnam war, OPEC oil crisis • Medium term: • Depression • World War II • Unusually rapid growth 1960-1974 • Unusually slow growth 1990 to present • Long term increase of 1.5% per year
The evolution of seasonal trend • We focus on the years 1948 to 1999 • We estimate long- and medium-term trend by spline smoothing, but with knots too far apart to capture seasonal trend • We subtract this smooth trend to leave only seasonal trend
Smoothing the data We want to represent the data yj by a smooth curve x(t). The curve should have at least two smooth derivatives. We use spline smoothing, penalizing the size of the 4th derivative. A function Pspline in S-PLUS is available by ftp from ego.psych.mcgill.ca/pub/ramsay/FDAfuns
Seasonal Trend • Typically three peaks per year • The largest is in the fall, peaking at the beginning of October • The low point is mid-December
Phase-Plane Plots • Looking at seasonal trend itself does not reveal as much as looking at the interplay between: • Velocity or its first derivative, reflecting kinetic energy in the system. • Acceleration or its second derivative, reflecting potential energy. • The phase-plane diagram plots acceleration against velocity. • For purely sinusoidal trend, the plot would be an ellipse.
Phase-plane plot for 1964 There are three large loops separated by two small loops or cusps: • Spring cycle: mid-January into April • Summer cycle: May through August • Fall cycle: October through December
1929 through 1931 • The stock market crash shows up as a large negative surge in velocity. • Subsequent years nearly lose the fall production cycle, as people tighten their belts and spend less at Christmas.