Understanding Data Distributions and Visualization Techniques

PPAL 6200Research Methods and Info Systems Class 3: Jan 17-18, 2012

Class Outline • Some Key Terms and Thinking about “measurement” • Describing Data “Distributions” • Break • Describing Data with Statistics • Break • A Very Special Distribution: The Normal Distribution

Some Key Conceptsunless noted source is Moore • Data • Numbers with a context (xxiv). The context including how data is collected can alter results. • Variable • An empirical property that can take on two or more values (Frankfort-Nachmias & Nachmias 1996:50) Don’t get suckered in by small and rapid changes, look at the big picture (xxvii) • Case • An individual, event or other thing for which we have data • Measurement • The assignment of numbers to objects, events or variables according to rules (ibid: 156-157)

Levels of Measurement • Nominal, Ordinal, Interval, Ratio • Validity • Are you measuring what you thought you are measuring? • Reliability • Are you measuring it accurately? • Spuriousness • Is there something else involved? Beware the lurking variable (xxvii) • Statistics • The science of learning from data (xxiv)

The Book Title Says It All… • This is a class in the “basic practice of statistics” with a little bit of practical advice thrown in regarding management of information systems • Inside the front cover of the book is a wonderful set of flow through figures that show how one can go about statistical thinking in a disciplined manner and three four step plans to guide your work

Describing Data Distributions with Graphs • As the introductory sections of the book noted, you really cannot go wrong to begin your work by visualizing the individual variables that comprise your data (and on occasion plotting them against another variable such as time). • The distribution tells you what values a variable takes and how often it does so

Ways we can Visualize and Explore Data • Exploratory analysis is not meant to allow us to reach any deep conclusions it is meant to help us better understand the data set and the relationships within it • We want to look both for an overall pattern (consistencies) and deviation from it (often called outliers) • Tables • Tables are effective tools for visualizing data, provided that we do not have too many variables, nor too many cases • At a certain point we need to graphically depict our data to make it understandable as a snapshot

Which Graph? • The graphic depictions we employ are dependent on: • The type of data we have • Level of Measurement • Whether Stationary or Chronological

Some Common Graphs • Pie Chart (good for showing percentages when few categories of a nominal or ordinal variable)

Percentage of Students Picking a Given Major

Bar Charts are equally useful for nominal and ordinal variables but have the benefit of allowing more flexibility

Foreign Born Population of US States by Percentage

Histograms • Histograms can be confusing as they look like Bar Graphs sometimes. In fact you can make them by carefully specifying a Bar Graph. However they are really quite different. • They are meant for use with Interval and Ratio data where there is a lot of variability among cases because there are so many possible values for the data

Therefore we have to “group the data” to a certain extent to allow us to represent it • What a histogram shows is the percentage of cases that have a score within the groups represented by the bars

You will notice that this graph looks a bit different from the one in the book. • This is because the scaling that my software used is a bit different from that used by the person who did the examples in the book.

This brings up a good point • Be careful how you manipulate data as you will see in the next section of the talk. these to graphs portray the same information but one will give us a more interesting result.

Describing a Distribution • Once we get to developing histograms we can start to evaluate the shape of our data in a number of interesting ways (Shape, Centre, Spread) • What is the shape of the plot? Is it single peaked or multi-peaked? • Where is the peak? Is it at the centre or off-centre (skewed)? When the tail of a distribution heads off to one side unevenly we say it is skewed to that side (this is confusing) • What about outliers? Any unusually high or low scores?

As you can see below: Regrouping our Data makes one figure more symmetrical

A stemplot is not so elegant • Granted it is not so elegant but it does allow us to figure out what is happening inside of those bars….

Thinking about these Graphs • When we look at these graphs we have to keep in mind the questions we have started • Shape • Centre (other than time-series) • Outliers

Remember… • I have posted some tips on how to use Excel to make graphs on the course website and you can also find advice in the technical manuals you will find there as well.

Using Descriptive Statistics to Explore your Data • We are continuing our exploration of data. • In the last chapter we graphically depicted data • Now we are going to look at how we can describe data using “summary” statistics • We will look at statistics that provide measures of central tendency • We will also look at statistics that provide measures of dispersion

Sometimes Statistics are So Simple… • Sometimes statistics are so simple we have to do something to make them look fancier than they are. Enter “The Mean”. • The mean simply means taking the average of something. • You all know how to do this. You add up the group, then you divide it by the number of items in the group.

But just to make sure you know I know what I am doing I have a formula

We may talk about these formulas but… • Don’t worry, we may talk about the formulas that mathematically describe statistics so you can get a better understanding of how they work. • I might also hand calculate a few to demonstrate this • But no one today hand calculates real data • Neither should you that is why we have software

The Median • The Median is the mid point of a distribution. Half the observations have values less than the median, half have values more • The formula looks like this • Note the formula gives the location of the median (the observation which has a value equal to the median) not its value

Here is where Stem & Leaf Graphs can come in handy (N=20)

Mean and Median which one? • In general the Mean is more susceptible to distortion by • abnormally large cases, in the language of the book a distribution skewed to the right • or abnormally small cases, in the language of the book a distribution skewed to the left. • For example, one Bill Gates among a thousand people will seriously distort the “Mean” income of this sample. However, it will have little or no impact on the “Median” Income

Level of Measure Matters Also • You cannot take the mean of a categorical variable (one measured at the nominal or ordinal level). • You can however calculate the median of a variable measured at the ordinal level. • This is a good point to stop and remind you about the stupidity of machines. • Unless the variables are tagged in the data set as to level of measure, your computer really won’t care and will happily chug along calculating even meaningless statistics such as the mean of your categorical variables.

One more • The Mode is the measure of central tendency for nominal data. It is simply the category with the largest number of cases.

If all we knew was how well the data clumped together… • Even though the Median is less susceptible to distortion by an abnormally large or small case, it can still provide a very weak description of your data if the observations are widely dispersed. • This is why we are often interested in the Quartiles

Just like the Median only smaller • Quartiles are just like the Median only on a smaller scale. Instead of defining the mid point of the distribution they define the break-point between: • The first quarter and the second quarter • The break between the second quarter and the third quarter (which is the Median by the way) • The break between the third quarter and the fourth quarter

The Five-Number Summary • Moore is very big on the use of the five-number summary to summarily describe data. • Minimum value • Q1 • M • Q3 • Maximum value

Fortunately all the computer programs we are employing can easily generate both the numerical summary and the accompanying box plots SPSS can generate all this and more using its “Frequencies” and “Explore” commands. Excel does the job just as nicely. You can graphically depict this with a box plot

Here is an example of an SPSS Box plot for before tax income for men and women in Ontario from the Survey of Household Spending

Notice on the previous slide how the distance from the first quartile to the median and then to the third quartile is not necessarily symmetrical and then that the whiskers on the box plot are also not symmetrical. This is an indication of skew • Unlike the example in the book my whiskers indicate not max and min value but percentiles,

Here is the five number summary for Men and Women

Spotting outliers • Obviously our box plots provide an excellent way to spot outliers. • A statistic that can also help is the “interquartile range”. This is just the range between quartile one and three. • When an observation lies 11/2 times the Interquartile range above quartile three or below quartile 1, it is often considered to be an outlier.

While I used ratio level data… • While I used ratio level data for my example of the five-number summary, it should be noted that there is nothing here (quartiles, Median, maximum, minimum value) that would not work with data measured at the interval or ordinal level

Range • Along with quartiles (which works when data is at least measured at the ordinal level) we must also remember to look at “Range” which is the only measure of dispersion that works at the nominal level.

Standard Deviation • The best way to describe Standard Deviation (notation S) is that it is the square root of Variance (notation S2) • So why do you need variance? A bit of math if you look at the formula in your book.

The Formula for S2 • Variance is the sum of the squared distances of each observation from the mean over N-1 (N-1 being the degree of freedom).

The Formula for S2 involves a squaring • We have to square these distances as, otherwise -- in a symmetrical distribution -- they would cross cancel and there would be no variance. • The problem with variance is all that squaring produces numbers that are very large and not too intuitive to read on their own (though you will see later that variance is an important tool and even a building block for other things).

Taking the square root produces a much more usable number (S). • Quite simply, when you know and S • You can go up and down a list of numbers and figure out which list is more concentrated about its mean and which is more diffuse

If you want a quick example

But once again, keep in mind… If the mean is susceptible to distortion from extreme variables, S is doubly so due to all those squarings Source for Graphics: Moore 2009

Understanding Data Distributions and Visualization Techniques

Understanding Data Distributions and Visualization Techniques

Presentation Transcript

Research Design and Methods

Research Methods and Measures

RESEARCH METHODS AND SKILLS

Methods and Research

Research Methods and Techniques

RESEARCH METHODS AND SKILLS

Quantitative Research Methods for Information Systems and Management (Info 271B)

Statistics and Research methods

Research Methods and Techniques

Research Methods and Design

Research Methods and Methods for Research

INFO 272. Qualitative Research Methods April 13, 2009

INFO 272. Qualitative Research Methods November 10, 2009

PPAL -6200 Intro to Inference

PPAL 6200 Research Methods and Info Systems

Statistics and Research Methods

Research Design and Methods

Research Methods and Statistics

Rotech Info Systems

Research and survey methods

INFO 272. Qualitative Research Methods

Research Design and Methods