Modeling Data Counts over Geographic Areas

SPATIAL MODELS FOR DATA REPORTED AS COUNTS OVER GEOGRAPHIC AREAS Gary Simon, 28 APRIL 2006

With special thanks… Frank LoPresti, Academic Computing Services, GIS Group Kevin Tun, Stern I.T. Group

Here’s an interesting obscure formula. Consider a set of points: Point 1: (x1 , y1) Point 2: (x2 , y2) …. Point n: (xn , yn)

Connect the points in order. Draw a line from point 1 to point 2, then from point 2 to point 3, …., from point n-1 to point n. Finally draw a line from point n back to point 1. Assume that none of the segments cross, so that this is a polygon.

The area of the resulting polygon is given by The + occurs when the perimeter is drawn counter-clockwise, the – when drawn clockwise.

The data: K regions Counts zl , z2 , …, zK Total count z+ Populations P1,P2 , …, PK Total population P+

The obvious null hypothesis of uniformity is tested by G2 =

Uniformity is often rejected. What should be the alternative to uniformity? Techniques like kriging assess covariance structure and not the structure of the expected counts.

There are also techniques that measure spatial association (Cliff and Ord, 1973, 1981) with I and with c, and these also relate to covariance notions. Cliff, A.D. and Ord, J.K. (1981) Spatial Autocorrelation, London: Pion. Cliff, A.D. and Ord, J.K. (1981) Spatial Processes: Models and Applications, London: Pion. Spatial association can also be given angular interpretations (Simon, 1997). Simon, Gary (1997) An Angular Version of Spatial Correlations, with Exact Significance Tests, Geographical Analysis, vol 29, #3, pp 267-278.

Let’s form a model for the “spatial force” and give this model a central location or hot spot. Note this location as s = . Here sxand syare parameters to be estimated.

Let f(z) be the spatial force at location z = . Then let f(z) = =

Since f(z) = , f(s) = c . At any z with = α , f(z) = . Thus α is a “half-strength” distance.

In this form, the only role of c is to assure the condition

This can be generalized to mix uniform and hot-spot features. f(z) = The parameter ω assesses the strength of the hot-spot relative to uniformity. Negative ω notes a protective effect.

The maximum likelihood expected counts { ek } will be used in the test statistic G2 =

The value of ekwill be computed as Pk× “average” force on county k scaled so that

Consider cancer rates in Florida. “Age-Adjusted Death Rates for Florida, 1998 – 2002.” http://www.stateofflorida.com

Florida has 67 counties. There were 38,814 cases in a population of 15,982,378. The rate is 2.43 per 1,000. The G2 statistic is 2,816.27 on 66 degrees of freedom. The cancer rates are not uniform.

The maximum likelihood fit occurred at parameter values sx = 375.8877 sy =300.6793 α = 13.4375 ω = 2.325

This fit has G2 = 2,246.93 on 67 - 4 = 62 degrees of freedom. This is still an inadequate fit, but the reduction in G2 is 569.34 with four degrees of freedom.

The fitted values are these: The hot spot is at (82.56 w long, 28.80 n lat), in Citrus County.

Map information comes in (longitude, latitude) form that needs to be converted to (x, y) form in (say) miles.

Each degree of latitude has the same mile equivalent. North Pole One degree of latitude cuts off same arc length at all latitudes. Equatorial plane

However, a degree of longitude represents a small distance near the poles and a large distance near the equator. 30° N Latitude Equator

Problem: Find the length of one degree of longitude at latitude θ. Solution: Form a triangle with one corner at the north pole, an angle of one degree at the north pole, and with sides 90°-θ.

30° N Latitude Equator In a spherical triangle, the sides also have angle measure.

We can use the law of sines for spherical triangles: A, B, C are the angles and a, b, c are the sides.

The computation of E(zk) = ek is found as Pk× “average” force on county k. This average force could be f(ck), where ck is the center of the county.

Instead we will use where  denotes the county and h is the two-dimensional variable of integration.

The value of can be obtained from outside sources. The challenge comes in finding This can be difficult even for simple figures;  is not simple.

Finding requires some organized description of , the boundary of . Fortunately, such descriptions are available from mapping programs.

Consider this geographical region:

Mapping program MapInfo will export an MIF file giving coordinates of (latitude, longitude) points on the boundary. The file has layout 26 -75 40.1288 -75.0154 40.1378 -75.1094 40.0454 . . . -75 40.0294 -74.9755 40.0485 -74.9893 40.1259 -75 40.1288

A graph of these points:

With the boundary so identified, county  is a polygon, so the task of finding is equivalent to integrating over that polygon. The mathematics can be done with Green’s theorem.

Green’s theorem for connected region  and for scalar functions P and Q of two variables is =

The boundary  needs to be parameterized as a function of a single variable, say t. This is possible when the boundary is made up of simple curves or, as in the MapInfo story, straight lines.

The line connecting to is parameterized as Note that dy means .

In the statement of Green’s theorem, = let’s use and so that

Green’s theorem is now = = Area() =

This solves as P(x, y) = 0 and Q(x, y) = x and then Area() =

With the boundary  given as a polygon, the calculation is routine. The consequence is Area() = where m is the number of boundary points of region .

This calculation finds the area of region  and, as a side benefit, discovers whether the point ordering was clockwise or counter-clockwise.

We need also the integrated force function

Match to Green’s theorem = with P(x, y) ≡ 0 and

This means that we need to be able to find Q(x, y) = The solution is Q(x, y) =

Then = =

Let , , … , be the boundary points of . Then Segment k connects point k to point k + 1. (Last segment goes back to point 1.)

Modeling Data Counts over Geographic Areas

Modeling Data Counts over Geographic Areas

Presentation Transcript

Approximate Frequency Counts over Data Streams

Spatial Data Analysis Areas II – Exploratory Spatial Data Analysis

Spatial data models (types)

Spatial Data Models

Spatial Data Analysis of Areas: Regression

Nonlinear Models with Spatial Data

Reported emissions for models

Nonlinear Models with Spatial Data

Spatial Data and Geographic/Spatial Databases

Geographic Information and Spatial Data Types

Spatial Data Models

Spatial Concepts and Data Models

Spatial data models

Spatial Data Models and Structure

Geographic Data Models

Spatial Data Models

As Reported

Nonlinear Models with Spatial Data

Approximate Frequency Counts over Data Streams