Univarient & Bivarient Geo-statistical analysis

Univarient & Bivarient Geo-statistical analysis Mirza Muhammad Waqar Contact: mirza.waqar@ist.edu.pk +92-21-34650765-79 EXT:2257 RG712 Course: Special Topics in Remote Sensing & GIS

What is Statistics About? • Statistics is the science of collecting, organizing, analyzing and interpreting data in order to make decisions • Statistics is the science of data-based decision making in the case of uncertainty

Statistical Analysis Problem Statistical Cycle Plan Conclusion Analysis Data

Problem • "I wonder if there are differences between...“ • What information will you need to answer the question? • Identify two or more sub-groups of the population to compare. • What variables are likely to show differences?

Plan • If collecting data you will need to plan a survey of questionnaire. • Using available data sets is recommended • If using a data set decide what sub-groups of data are needed and choose from the available variables (choose carefully so you can answer the problem

Data • Collect data by making a survey or questionnaire, OR take a sample from large data set. (at least 30 values) • For example, Census data • Clean the data set before continuing

Analysis • Analyze the data to find similarities and differences. • You will need measures of central tendency (mean, median, mode) AND measures of spread (range, inter quartile range, standard deviation) • Use technology to calculate the statistics: calculator, or EXCEL (using excel)

Conclusion • Remember that you are analysing and comparing data from a SAMPLE from a population • Is there a difference between the subgroups? • Comparisons made from a Box-and-Whisker graph • Comparisons bases on measures of central tendency • Comparisons made from measures of spread

Role of Statistics in GIS • To describe and summarize spatial data. • To make generalizations concerning complex spatial patterns. • To use samples of geographic data to infer characteristics for a larger set of geographic data. • To determine if the magnitude or frequency of some phenomenon differs from one location to another. • To learn whether an actual spatial pattern matches some expected pattern.

What is Geostatistics? • Applies the theories of statistical inference to geographic phenomena. • Methods of geostatistics are used in petroleum geology, hydrogeology, hydrology, meteorology, oceanography, geochemistry • A way of describing the spatial continuity as an essential feature of natural phenomena. • Recognized to have emerged in the early 1980’s as a hybrid of mathematics, statistics, and mining engineering.

Some Useful Definitions • Data –information coming from observations, counts, measurements or responses. • The data you will be analyzing will almost always be a sample form a population. • Population – the collection of all outcomes, responses, measurements or counts that are of interest. • Sample – a subset of a population. • We will almost always be dealing with samples and hopping to make inference about the population.

Some Useful Definitions • Parameter – numerical description of a characteristic of the population. • Statistic – a description of a characteristic of the sample. • We will often wish to make inferences about parameter based on statistics.

Some Useful Definitions • Descriptive Statistics – relate to organizing, summarizing and displaying data. • Inferential Statistics – relate to using a sample to draw conclusions about a population. • Inferential statistics involves drawing a conclusion from some data.

Inferences vs. Descriptive • Consider: • Average length of females and males: 90cm and 100cm respectively. • Descriptive statistics: the values. • Inference: males are (in general) taller than females.

Descriptive Statistics • 3 categories of descriptive statics in geostatistics • Univariate Descriptive Statistics • Use to describe and summarize single data/variable • Bivariate Descriptive Statistics • Use to describe relationship between two data/variable • Spatial Descriptive Statistics • Describe data in term of space and time

Univariate Description • Describe and summarize single variable • Graphical methods • Histogram • Cumulative Frequency • Numerical methods divides in three categories • Measurement of location • Measurement of spread • Measurement of shape

Univariate Description • Measurement of location • Measurement of center location • Mean • Median • Mode • Measurement of other part • Qunatile • Quartile • percentile • Measurement of spread (variability) • Variance • Standard Deviation • Inter-Quartile range • Measurement of shape (symmetry & length) • Coefficient of skewness • Coefficient of Variation

Frequency Table and Histogram • Histogram – is a bar graph that plots the frequency of distribution of dataset. • The horizontal scale is representing classes/bin • The vertical scale measures the frequencies of the classes. • Consecutive boundaries much touch

Ideal Histogram for Image Analysis Vegetation Urban Area Frequency (f) Soil Water Band A

Actual Histogram from Image Analysis Vegetation Urban Area Frequency (f) Soil Water Band A

Histogram from Image Analysis • Very informative tool for analysis. • Histogram define the contrast of satellite image. • More the BV’s range, more the contrast. Low Contrast Histogram High Contrast Histogram

Histogram from Image Analysis • We can also identify the largest land cover in satellite image by histogram. • Rough quantification of landcovers can be made using histogram. • This rough quantification leads to correct quantification. • Using histogram, range of a particular landcover can be identified in aspect of BV.

Frequency Table • To develop a histogram a frequency table is used. • Frequency table: records how often observed values fall within certain intervals or classes.

Constructing a Frequency Distribution • Decide on the number of classes to include in the frequency distribution. • Find the class width as follows: • Determine the range of the data • Divide the range by the number of classes and round up to the next convenient number • Find the class limits: • Start with the lowest value as the lower limit of the first class, add the class width to this to obtain the lower limit for the second class, etc. • Place a mark in the row for the class corresponding to each data point • Count the number of marks in each class.

Frequency Table

Cumulative Frequency Table and Histogram • Cumulative frequency of a class is the sum of the frequency of that class and all previous classes. • The cumulative frequency for the last class is always n.

Cumulative Frequency Tables

Cumulative Histogram

Measure of Location • It provide us the information about where various part (information) of data lies • Center of data can be find by • Mean • Median • Mode • Location of other parts of the data are given by the quantiles

Mean Median Mode • Mean – average of all the data points in the data/distribution • Unique and unbiased • Based on every data point in the dataset • Can be sensitive to outlaying observations • Median – middle value in an ordered array of number. • Unaffected by extremely large and extremely small values. • Mode – the most frequently occurring value in a dataset. • Unlike the mean and median, the mode is not always uniquely defined. • Bimodal – two values having same number of instances in the data • Multimodal – three or more values having same number of occurrences

Univarient Statistics for Image Analysis • The histogram of satellite image can not be the uni-mode data. • Number of mode represents how many land covers exists in the satellite image. • We can’t make decision about transition zone using histogram.

Univarient Statistics for Image Analysis Vegetation Urban Area Frequency (f) Soil Water Band A Vegetation Urban Area Frequency (f) Soil Water Band A

Which Measure is Best? • No clear answer to this question. • The mean can be influenced by outliers while the mode may not be particularly “typical central value”. • Statistical inference based on the median and the mode is difficult.

Percentiles • Divide a group of data into 100 parts • At least n% od data live below the nth percentile, and most (100-n)% of the data lie above the nth percentile. • Example – 90th percentile indicates that at least 90% of the data lie below it, and at most 10% of the data live above it. • The median and the 50% percentile have the same value.

Percentiles (i): Computational Procedure • Organize the data into an ascending ordered array. • Calculated percentile location i • Determine the percentile’s location and its value. • If i is a whole number, the percentile is the average of the value at the i and (i+1) positions. • If i is not a whole number, the percentile is at (i+1) position in the order array.

Percentiles: Example • Raw Data: 14, 12, 19, 23, 5, 13, 28, 17 • Order Array: 5, 12, 13, 14, 17, 19, 23, 28 • Location of 30th percentile i = = 2.4 • The location index, i, is not a whole number; i+1=2.4+1=3.4; the whole number portion is 3; the 30th percentile is at the 30th location of the array; the 30th percentile is 13.

Quartiles

Formulae in EXCEL • Calculating Means: Average(data) • Calculating Median: Median(data) • Calculating Mode: Mode(data) • Calculating Minimum: min(data) • Calculating Maximum: max(data) • Calculating Quartile: QUARTILE(data,quart) • Calculating Percentile: PERCENTILE(array,k)

Measure of Spread/Variation • Measure of variability describe the spread or the dispersion of a dataset. • Common measures of variability • Range • Interquartile Range • Mean Absolute Deviation • Variance • Standard Deviation • Coefficient of Variation

Range • The difference between the largest and the smallest values in a set od data • Simple to compute • Ignore all data points except two extremes • Range = Maximum – Minimum • Range tells us about the spread of data. • Some time range provides us very biased information when outliers exists in data

Interquartile Range • Range of values between the first and third quartiles • Less influenced by extremes • Interquartile Range = Q3 – Q1

Deviation, Variance and Standard Deviation • The deviation of a data entry x in a population data set is the difference between x and population mean µ, i.e. Deviation of x = x - µ • The sum of the deviation over entries is zero.

Mean Absolute Deviation • Average of the absolute deviation from the mean M.A.D. = M.A.D. = = 4.8

Variance • The population variance is the sum of squared deviation over all entries: Population Variance = σ2 =

Population Variance • Average of squared deviation from the arithmetic mean σ2 = M.A.D. = = 26.0 Sample Variance S2 =

Variance for Image Analysis • For variance analysis, we go for comparative analysis. • By comparing variance of all bands we come to know that which band has more dispersion.

Variance for Image Analysis • Less the variance, it depicts that the homogeneity of the data is high. • Outlier can disturb the variance.

Standard Deviation • The population standard deviation is the square root of the population variance i.e. σ = =

Standard Deviation • Square root of the variance σ = σ = = Standard Deviation of Sample σ =

Empirical Rules • Data are normally distributed (or approximately normally distributed)

Univarient & Bivarient Geo-statistical analysis