1 / 26

Advanced Data Analytics

Advanced Data Analytics. Lecture 1 - Introduction. Indicative Content. Conducting Statistical Analyses Factor Analysis and Cluster Analysis Effect Size, Sample Size, Power Predictive Analysis Machine Learning Simulation Optimization. Statistics.

overton
Télécharger la présentation

Advanced Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Data Analytics Lecture 1 - Introduction

  2. Indicative Content • Conducting Statistical Analyses • Factor Analysis and Cluster Analysis • Effect Size, Sample Size, Power • Predictive Analysis • Machine Learning • Simulation • Optimization

  3. Statistics • The science of data, involving collecting, classifying, summarizing, organizing, analyzing, presenting, and interpreting numerical data • Descriptive Statistics • Utilizing numerical and graphical methods to look for patterns in a dataset, to summarize information in a dataset, and to present that information in a convenient form • Inferrential Statistics • Utilizing sample data to make estimates, decisions, predictions, and other generalizations about a larger set of data

  4. Fundamental Elements • An experimental (or observational) unit in an object (e.g. person, thing, transaction, or event) about we collect data • A population is a set of units that we are interested in studying • A variable is a characteristic or property of an individual experiment unit in the population. • A sample is a subset of the units of the population

  5. What you should know by now • Descriptive statistics. • Probability. • z-test. • t-test. • Chi-squared goodness of fit. • Simple linear regression. • Weighted moving averages. • One way ANOVA.

  6. What Next • What are the preconditions for the tests we have looked? • Using statistical packages to check some of these preconditions. • Looking at more than one dimension / factor with the ANOVA and regression. • Multiple linear regression • Two-way ANOVA • Dimension reduction and classification. • Factor analysis • Power and effect • Some non-parametric alternatives to the tests we have looked at. • Looking at case-studies / real examples. • Focus on practical concepts. • Focus on using a statistical package rather deep mathematical concepts.

  7. Multivariate • Multivariate analysis consists of a collection of methods that can be used when several measurements are made on each individual or object in one or more samples. • We will refer to the measurements as variables and to the individuals or objects as units (research units, sampling units, or experimental units) or observations. • In practice, multivariate data sets are common, although they are not always analyzed as such. • Historically, the bulk of applications of multivariate techniques have been in the behavioral and biological sciences.

  8. Multivariate Data Examples

  9. Factor Analysis • Factor analysis is a multivariate statistical approach commonly used in psychology, education. • Factor analysis is a multivariate statistical procedure that has many uses two of which will be briefly noted here: • Firstly, factor analysis reduces a large number of variables into a smaller set of variables (also referred to as factors). • Secondly, it establishes underlying dimensions between measured variables and latent constructs, thereby allowing the formation and refinement of theory.

  10. Non-parametric • The classical methods that you know for dealing with a quantitative response variable have always assumed that the response variable has a normal distribution. • Nonparametric statistics plays an important role in the world of data analysis. • Nonparametric techniques can save the day when you can’t use other methods. • You want to compare two populations but you have two small random samples which do not have a normal distribution – • What can you do? • Many hypothesis-testing situations have nonparametric alternatives to be used when the usual assumptions we make are not met. • In other situations, nonparametric methods offer unique solutions to problems at hand.

  11. Normal Distribution

  12. The Test Statistic and Critical Values Sampling Distribution of the test statistic Region of Rejection Region of Rejection Region of Non-Rejection Critical Values In general we want to hit the ‘region of rejection’ (if we are researchers) as we want to reject the null hypothesis.

  13. Normality Tests • There are a couple of ways to help you decide whether a population has a normal distribution, based on your sample: • You can graph the data, using a histogram, and see whether it appears to have a bell shape (a mound of data in the middle, trailing down on each side). • You can make a normal probability plot, which compares your data to that of a normal distribution, using a Q-Q-Plot. • If the data follows a normal distribution, your normal probability plot will show a straight line. • If the data do not follow a normal distribution, the normal probability plot will not show a straight line; it may show a curve off to one side or the other, for example.

  14. Effect Size & Sample • We sometimes find that even very small differences between the sample statistic and the population parameter can be statistically significant if the sample size is large. • The left side of the graph shows a fairly large difference between the sample mean and population mean, but this difference is not statistically significant with a small sample…

  15. Power

  16. Multiple Regression • The idea of regression is to build a model that estimates or predicts one quantitative variable (y) by using at least one other quantitative variable (x). • Simple linear regression uses exactly one x variable to estimate the y variable. • Multiple linear regression, on the other hand, uses more than one x variable to estimate the value of y.

  17. Reporting Results 1. Abstract 2. Introduction a. Background b. Statement of purpose c. Hypotheses 3. Method a. Participants b. Measures c. Procedure d. Statistical plans 4. Results 5. Discussion 6. References

  18. Statistical Packages • R • Python • Excel • We are going to use R & Python (and sometimes excel) as the main statistical package for this semester. • There are many packages available • IBM SPSS • Matlab • Octave

  19. Why R • It’s free. • If you can work with R you will pick up transferable skills and other stats packages won’t be a problem. • It is good thing to have on your CV at the moment • There are many excellent resources online. • You can continue learning and using R in your own time. • There is a community of people using R in Dublin who organise talks and examples.

  20. Common Statistical Mistakes • Failure to check the data for errors • Using incorrect statistical procedure • Failure to check assumptions • Failure to define meaningful significance • Failure to evaluate confidence interval as well as P values • Selectively selecting significant results • Failure to check for spurious significant results • Failure to identify a representative sample • Failure to use all of the basic principles of experiments

  21. Mistake #1 - Failure to check the data for errors • Failing to investigate data for data entry or recording errors. • Failing to graph data and calculate basic descriptive statistics before analyzing data. • Example – Wrong decision due to error Test of mu = 26.000 vs mu not = 26.000 Variable N Mean StDev SE Mean T P With 16 25.625 3.964 0.991 -0.38 0.71 Without 15 24.733 1.792 0.463 -2.74 0.016

  22. Check Your Data • This is not just true for statistics but for data analysis in general, e.g. data mining. • See pdf of chapter 2 of “Guide to Intelligent Data Analysis How to Intelligently Make Sense of Real Data” • http://www.springer.com/cda/content/document/cda_downloaddocument/9781848822597-c2.pdf?SGWID=0-0-45-963492-p173866313

  23. Mistake #2 • Using the wrong statistical procedure in analyzing your data. • Includes failing to check that necessary assumptions are met. • Failing to design your study so that it has high enough power to call meaningful differences “significantly different.” • Includes concluding that the null hypothesis is true. Should be “not enough evidence to say the null is false.” Mistake #3

  24. Mistake #4 • Failing to report a confidence interval as well as the P-value. • P-value tells you if statistically significant. • Confidence interval tells you what the population value might be. • “Fishing” for significant results. That is, performing several hypothesis tests on a data set, and reporting only those results that are significant. • If  = P(Type I) = 0.05, and we perform 20 tests on the same data set, we can expect to make 1 Type I error. (0.05 ×20 = 1). Mistake #5

  25. Mistake #6 • Forgetting that a significant result may be “spurious.” • Overstating the results of an observational study. • That is, suggesting that one variable “caused” the differences in the other variable. • As opposed to correctly saying that the two variables are “associated” or “correlated.” • Using a non-random or unrepresentative sample. • Includes extending the results of an unrepresentative sample to the population. • Failing to use all of the basic principles of experiments, including randomization, blinding, and controlling. Mistake #7 Mistake #8

  26. Summary • What are the preconditions for the tests we have looked? • Using statistical packages to check some of these preconditions. • Looking at more than one dimension / factor with the ANOVA and regression. • Multiple linear regression • Two-way ANOVA… • Dimension reduction and classification. • Factor analysis… • Power and effect. • Some non-parametric alternatives to the tests we have looked at.

More Related