1 / 71

Exploratory Data Analysis

Exploratory Data Analysis. Height and Weight. Data exploration and Statistical analysis. Data checking, identifying problems and characteristics. Data exploration, categorical / numerical outcomes. Data. Analyzing a set of data. Look at the data (initial checks on the data)

dara
Télécharger la présentation

Exploratory Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory Data Analysis

  2. Height and Weight

  3. Data exploration and Statistical analysis • Data checking, identifying problems and characteristics

  4. Data exploration, categorical / numerical outcomes Data

  5. Analyzing a set of data • Look at the data (initial checks on the data) • Downloading data, formatting, data collection, discrepant data, missing data • Visualize the data (exploratory data analysis) • Descriptive statistics, informative tables, well-constructed figures • Analyse the data (definitive analysis) • Formal statistical analysis • Quantify any interesting results • Report the findings

  6. Types of Variables • Often, test to use depends on the type of variable at hand • Two main classes of variables: • Categorical • Numerical • Categorical variables further divided into two sub-classes: • Nominal categorical (example: gender, ethnic groups) • Ordinal categorical (example: size of a car, quality of teaching)

  7. Numerical variables • Distinguish between discrete or continuous numerical variables • Discrete • Integer values (number of male subjects, number of episodes of flu outbreaks) • Continuous • Takes a whole range of values (height, weight) • Continuous variables treated as discrete (age)

  8. Exploratory Data Analysis

  9. EDA • Tabular EDA • Univariate tables, cross-tabulation of categorical variables • Numerical EDA • Location, spread, skewness, covariance and correlation • Graphical EDA • Frequency plots, histograms, boxplots, scatterplots • The precise form of EDA depends on the data at hand.

  10. Tabular EDA • Useful for summarising categorical data. For example, the following table shows the classification of 2,555 students from three schools in a study on the GCSE O-level results in Mathematics: Dunman High HCI RI / RGS Total Dunman High HCI RI / RGS Total No. of students No. of students 6 408 1496 1910 Small counts are problematic in categorical data analysis

  11. Tabular EDA • For two categorical variables: i.e. the distribution of the A, B and others grades between two schools School A B Others Dunman High HCI Question: Appears that Dunman High has proportionally more students scoring A/B grades than HCI. Does this mean anything?

  12. Calculating informative numbers which summarise the dataset • What are the numbers useful for describing the age of 1,059 individuals with diabetes? • Location parameters (mean, median, mode) • Skewness Mean age (54.6 years) 20 30 40 50 60 70 80 AGE Numerical EDA • Spread (range, standard deviation, interquartile range)

  13. Numerical EDA Skewness Median Mean

  14. Normal distribution 68% of the probability, 1 standard deviation away 95% of the probability, 2 SDs away 40 50 60 70 80 Exam marks for Mathematics exam

  15. Numerical EDA • Sample QuartilesQ1: 25th quantile (or value of the 25% ranked data)Q2: 50th quantile (also known as median of data)Q3: 75th quantile (or value of the 75% ranked data) Consider the heights of 1000 people, rank these heights from shortest to tallest. Q1 Q2 Q3

  16. Location and spread • When mean is used as the location parameter, the standard deviation is the appropriate measure for spread • When median is used as the location parameter, the corresponding measure for spread is the interquartile range • Interquartile range (IQR) IQR = Q3 – Q1 • Minimum, Maximum of data (seldom used to quantify spread, but more for data QC)

  17. Numerical EDA • Numbers can be informative to identify potential problems with the data • Example: Suppose the height for 1,496 individuals randomly sampled from the population produces the following summary IQR = Q3 – Q1 = 188 – 172 = 16 Range = Max – Min = 201 – 0 = 201

  18. Correlation • Two numerical variables: height and weight • Questions • Are there any relationship between these variables? • If there is, how do we quantify this relationship? • Covariance and Correlation Measures the degree of association between two numerical variables.

  19. Covariance and Correlation • Covariance is scale-dependent, and correlation is unit-free. • More intuitive to interpret correlation than covariance. • Example: Covariance for height and weight is 2.4 when assessed using metres and kilograms, but 240,000 when assessed using centimetres and grams. Correlation is a constant value at 0.83 for both scenario. • Correlation is unit-free, and always bounded between -1 and 1 inclusive. • Useful for investigating relationships between variables, (e.g. weight and height)

  20. Example

  21. Graphical EDA • Visual summaries of the data • Flagging outliers, obvious relationships, check for distribution

  22. Boxplots • Univariate boxplot: for 1 numerical variable Ends of box: Q1 and Q3 Length of box: IQR White line: Sample median Whiskers: 1.5 times IQR Lines outside whiskers: Outliers Circles: Extreme outliers

  23. Boxplots • Multivariate boxplots: for 1 numerical variable across different levels of a categorical variable • Graphical comparison

  24. Scatterplots • Graphical representation for 2 numerical variables

  25. Scatterplots

  26. Scatterplots

  27. Exploratory Data Analysis in RExcel and SPSS

  28. Comparing height of children • Height data for 30 children, from 3 groups • Interest to compare height of children between groups • Useful (and not useful!) data exploration

  29. Comparing height of children • Height data for 30 children, from 3 groups • Interest to compare height of children between groups • Useful (and not useful!) data exploration

  30. Coding numerical variables as factors Retain numbers as categories, or to define new names for the categories Note the deliberate mistake here! Always know your variables well!

  31. Stratified analysis by group Click on this to define the variable that contains the grouping information for stratification

  32. Boxplots Choose this to produce separate boxplots for the three groups (stratified analysis)

  33. Maximum 2nd quartile 25% Median Interquartile range 25% 1st quartile Minimum

  34. An excellent way to observe graphical/preliminary evidence of any differences between the groups! No comments can be made if the boxes overlap. Only when two boxes (or more) do not overlap can we say there is graphical evidence of a difference between the two (or more) groups

  35. What about SPSS?

  36. Never choose this when plotting a histogram to get a gauge of the distribution of the dataset

More Related