Statistical Analysis of Geographical Information

Statistical Analysis of Geographical Information Dr. Marina Gavrilova

Topics • Introduction • Distribution Descriptors: One Variable • Relationship Descriptors: Two Variables • Point Pattern Descriptors • Point Pattern Analyzers • Autocorrelation

Introdution: quantitative measures to describe data Statistics classification • Classified by function: • Description statistics • Inferential statistics • Classified by areasof application: • Classical statistics: sociology, political science, medicine and engineering. • Spatial statistics: based on classical and extended to the spatially referenced data. • Geostatistics: one kind of Spatial statistics and originated in geo-science.

Random and Systematic process • A certain phenomenon occurs: Random process or Systematic Process? • Soil Example: • Hypothesis – soil fertility of a farm is low • To test the hypothesis, gather more data about the soil. • Collect a sample of soil for further examination instead of the entire population. • Observation: each examined location; Sample size: number of observations selected.

Features about spatial data(1) A region can be partitioned in many ways based on the given criteria. USA: States boundaries, census geography. Modifiable Area Unit Problem (MAUP) include: • Scale effect: Analyze data at multiple levels of spatial resolution results in inconsistency. • Zoning effect: Analyze data derived from different zonal systems with similar number of areal units results in inconsistency.

Features about spatial data(2) Spatial autocorrelation represents the nature of geography and, consequently, will almost always be present in spatial data. Tober“First Law of Geography”: “All things related to each other, but closer things are related more”. Butterfly Effect: Butterfly flapping in China may cause a hurricane landfall in the US due to spatial propagation of air disturbances.

Distribution descriptors:one variable

Measure of central tendency • Mode: The value that occurs most frequently in a set of data or called the modal value. If two or more categories have the highest frequency, then data is bimodal or multimodal. • Median: The middle value after all values are sorted in ascending or descending order. • Mean or Average: n observation, each with an observed value xithen the simple arithmetic mean is defined as

Measure of central tendency • Grouped or weighted mean: if data values are grouped into classes, then all data within each group are represented by on value as the overall value in that class. A mean derived from the grouped data is called a grouped mean or a weighted mean. • If xi is the midpoint of the i th class (k classes together) with fi as the number of data values in that class (frequency), the weighted mean:

Measures of dispersion (1) While mean is a good measure of the central tendency of a set of data, it captures no information about how the values are concentrated or scattered around the mean. • Range, Minimum, Maximum, and Percentiles: • Range = Maximum-Minumum • Percentiles are the corresponding data values that have certain percentages of the data smaller than these values. Data Xa and Xb have the same median 7, different 25th (3 for Xa and -5 for Xb) Xa = 1 3 5 7 9 11 13 Xb = -11 -5 1 7 13 19 25

Measures of dispersion (2) • Mean Deviation: unlike the dispersion measures discussed so far using one or a few data values in the series, the mean deviation takes into account all data values. It is calculated by summing all the differences that individual data values have from the mean and then dividing this sum by the number of observation.

Measures of dispersion (3) • Variance and Standard Deviation: Another way to avoid the offsets caused by adding positive and negative deviations from the mean together is to square all deviations from the mean before summing them.

Measures of dispersion (4) • Weighted Variance and Weighted Standard Deviation. fi is the frequency for the i th group or class, xi is the midpoint value in the i th group, is the weighted mean, and k is the number of groups.

Relationship Descriptors: Two Variables

One Variables • The mean and its variations address the issue of location, where the observations distribute along the continuous value line. Median and mode consider this central tendency issue. Variance, standard deviation, and percentiles address the issue of dispersion. Skewness deals with direction clustering. Kurtosis addresses the issue of concentration. All these measures focus on the distribution of the values using one variable at a time.

Relationship Descriptors • Mean, standard variable cannot measure the relationships between different distributions quantitatively. • One of statistics is based on the concept correlation measures statistically the direction and strength of the relationship between two sets of data or two variables for a number of observation. Regression measures the dependence of one variable on another.

Correlation Analysis (1) • Education is traditionally regarded as an asset. It enriches a person’s life in many ways. We usually believe that education and income are somewhat related and change in the same direction. If we recognize the value of education in eventually achieving a higher income, it would be nice to know how strong this relationship is, that is, how these aspects of life are related or correlated.

Correlation Analysis (2) • Each relationship has two important aspects: the direction and strength of the relationship. Between two related variable, the relationship is typically measured as correlation– a statistical measure indicating how values in one variable are related to values in the other variable. • Positive or direct correlation • Negative or inverse correlation

Trend Analysis • Trend analysis is a technique measuring the trend, while correlation is a statistical measure of two variables. • Trend analysis addresses the dependence of one variable on another. • Going beyond the strength and direction of the relationship, trend analysis allow us to model the relationship and to estimate likely value of one variable based on the value of another variable. • Models that are constructed with this technique are known as regression models.

Simple Linear Regression Model • Simple linear regression model or bivariate regression model: Using a straight line to model the relationship between tow variables. Here are an example. A regression between median household income and median house value for 51 states.

Regression model • Some phenomena may be modeled by the regression reasonable well, and others may not. • Regression model assumes a linear relationship between the variable. If the relationship is not linear or if the two variables have weak or no relationship, then the model will perform poorly. • A multivariate regression model, which can accommodate multiple independent variables. Under either circumstance, we may have committed a model specification error.

Point Pattern descriptors and analyzers

Point Pattern • Point Pattern Descriptors • Central Tendency • Dispersion and Orientation • Point Pattern Analyzers • Quadrant Analysis • Nearest-Neighbor Analysis • Spatial Autocorrelation of Points • K-Function

The Nature of Point Features Point pattern descriptors cover: • The methods for determining the overall patterns of a given set of points. • Measures used to describe the magnitude of spatial dispersion of a given set of points. • How the direction bias of a set of points can be extracted statistically.

Central Tendency of Point Distributions • A set of point descriptors provide certain descriptive information on the distribution of a set of points. • Central tendency information, mean centers, weighted mean centers, and median centers provide a good summary of how a set of points distributes in the geographic space. • To describe the spatial dispersion characteristics of a set of points, the measures of standard distance and standard ellipse will be discussed. These measures indicate the spatial variation and orientation of a point distribution.

Mean Center The mean center, or spatial mean, is a central or average location of a set of points. For npoints xmc and ymc are the coordinates of the mean center, xiand yi are the coordinates of point i, and n is the number of points.

Weighted Mean Center The weighted mean center of a distribution of points can be found by multiplying the x- and y- coordinates of each point by the weight assigned to each observation or location. • wi is the weight at point i

Dispersion and Orientation of Point Distributions • Two sets of points may occupy the same geographic space and may be interrelated. • For example, one set of points represents the location of forest fires and the other the locations of camping cabins in a wildlife region. They may have the same overall locations, but forest fire have a more dispersed spatial pattern than cabins. • In additional to spatial central tendency, it may be interesting to evaluate the magnitude of dispersion of locations and the orientation of the spatial distribution.

Standard Distance Similar to those in classical statistics, the population standard deviation, ,or the sample standard deviation, S, can be computed as:

Weighted Standard Distance Points in a distribution may have different attribute values that reflect the relative importance of different point observation. • Wi is the weight for point i, and • (xwmc, ywmc) is the weighted spatial mean.

Standard Deviational Ellipses • The standard distance circle is a very effective visualization tool to show the spatial spread of a set of point location. • A logical extension of the standard distance circle is the standard deviational ellipse. It can capture the directional bias in a point distribution. Three components are needed to describe it: • An angle of rotation • Deviation along the major axis • Deviation along the minor axis

Elements defining a standard deviational ellipse

Standard deviational ellipses for men-only and women-only shelters

Point Pattern Analyzers • To fully understand the various states and dynamics of a particular geographic phenomenon, an analyst must be able to detect spatial patterns from the point distributions and to track the changes in point patterns at different time.

Point Pattern Analyzers • Quadrant Analysis allows analysts to determine if a point distribution is similar to a random pattern using a spatial sampling framework. • Nearest Neighbor Analysis compares the average distance between nearest neighbors in a set of points to that of a theoretical pattern. • Spatial autocorrelation coefficients measure how similar neighboring points are. • K-function analysis can identify and evaluate the clustering of points at different spatial scales, or extents.

Quadrant Analysis • Quadrant Analysis evaluates a point distribution by examining how its density changes over space. • The density measured by Quadrant Analysis is then compared with the density of a theoretically constructed random pattern to see if the point distribution in question is more clustered or more dispersed than the random pattern.

General Concept in Quadrant Analysis (1) A regular square grid and a number of points falling in some squares. • The square are referred to as quadrants, which are essentially sampling units in spatial statistical jargon. • Circle is the most geometrically compact shape, however circles cannot cover the entire geographic space unless they overlap. • In an extremely clustered point pattern, all or most of the points fall inside one or a few squares only. In an extremely dispersed pattern referred to as a uniform pattern or a triangular lattice, all squares contain similar number of points.

Observed pattern of Ohio cities and hypothetical clustering and dispersed pattern

General Concept in Quadrant Analysis (2) • Statistically, Quadrant Analysis will achieve a fair evaluation of the density across the study area if it applies a large enough number of randomly generated quadrants. • An optimal size of quadrant can be calculated by 2A/r . A is the area of study area, and r is the number of points in the distribution. • Once the quadrant size for a point distribution is determined, Quadrant Analysis can proceed to establish the frequency distribution of the number of points for all quadrant.

Examples of systematic and random quadrants

Comparing Observed and Expected Patterns Besides using K-S statistics to test if the observed pattern is different from a random pattern, one may perform the Variance-Mean Ratio Test by taking advantage of a specific statistical property of the Position distribution.

Ordered Neighbor Analysis • Quadrant Analysis is useful in comparing an observed point pattern to a random or theoretically known distribution. However, it has certain limitations. • The analysis captures information on the points within each quadrant, but no information on points between quadrants is used in the analysis. As a result, Quadrant Analysis may be insufficient to distinguish between certain point pattern in the following figures.

Spatial Configurations Visually, the two patterns are different. Using Quadrat Analysis, however, the two patterns yield the same result.

Nearest Neighbor Statistic • Nearest Neighbor Statistic is derived from the average distance between points and each of their nearest neighbors. • The second-ordered neighbor statistic uses the distance of the second nearest neighbors. Higher-ordered neighbors can be defined in similar ways. • Ordered Statistics can evaluate the pattern at different spatial scales.

Quadrant Analysis and Nearest Neighbor Analysis • While both Quadrant Analysis and Nearest Neighbor Analysis test point distribution, they utilize different spatial concepts. • Quadrant Analysis tests a point distribution with the points per area concept using quadrants as sampling units. • Nearest Neighbor Analysis uses the concept of area per point. • Both methods are similar in sense that the observed pattern is compared with some know distribution (random pattern).

Nearest Neighbor statistics How Nearest Neighbor Analysis works. • In a homogeneous region, the most uniform pattern formed by a set of points occurs when this region is partitioned into a set of identical hexagons with a point at its center. The distance between points will be , where A is the area of the region and n is the number of points.

R statistic or R scale • R statistic is the ratio of the observed average distance between nearest neighbors of a point distribution and the expected average nearest neighbor distance. It is also the nearest neighbor statistic. • robsis the observed average distance between nearest neighbors and rexp is the expected average distance between nearest neighbors as determined by the theoretical pattern.

Calculation of the observed nearest neighbor distance d1=d13 d2=d23 d3=d32 d4=d43 (For point 1, the nearest neighbor is 3)

Cities in Ohio By selecting the seven largest cities in Ohio, we can compute their nearest neighbor distance and the observed average nearest neighbor distance robs =51.82miles.

Higher-order neighbor statistics • Nearest Neighbor Analysis has been extended to accommodate the second, third, and other higher-order neighbor definitions. When two points are not immediate nearest neighbors but rather the second nearest neighbors, the way distances are computed between them will need to be adjusted accordingly.

Statistical Analysis of Geographical Information

Statistical Analysis of Geographical Information

Presentation Transcript

Programming for Geographical Information Analysis: Advanced Skills

Geographical Information Retrieval

Geographical Information System (Dam and Watershed Analysis )

Geographical information systems

Geographical Information Systems

Statistical Analysis in Information Assurance

Programming for Geographical Information Analysis: Advanced Skills

Geographical Information Retrieval

Programming for Geographical Information Analysis: Core Skills

Geographical information systems

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Advanced Skills

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Advanced Skills

Programming for Geographical Information Analysis: Advanced Skills

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Advanced Skills

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Core Skills

Programming for Geographical Information Analysis: Core Skills