1 / 28

Data Collection and Sampling

Data Collection and Sampling. Chapter 5. Statistics. Data. Information. Recall. Statistics is a tool for converting data into information :. But… Where then does data come from? How is it gathered? How do we ensure its accurate( 正確) ? Is the data reliable( 可靠) ?

hbrunet
Télécharger la présentation

Data Collection and Sampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Collection and Sampling Chapter 5

  2. Statistics Data Information Recall • Statistics is a tool for converting data into information: • But… • Where then does data come from? • How is it gathered? • How do we ensure its accurate(正確)? Is the data reliable(可靠)? • Is it representative(代表性)of the population from which it was drawn? • This chapter explores some of these issues.

  3. 5.1 Methods of Collecting Data • Thereliability and accuracyof the data affect the validity of the results of a statistical analysis. • The reliability and accuracy of the data depend on the method of collection. • Four of the most popular sources of statistical data are: • Published data(公開資料) • Observational studies(觀察) • Experimental studies(實驗) • Surveys(調查)

  4. Published Data • This is often a preferred source of data due to low cost and convenience. • Published data is found as printed material, tapes, disks, and on the Internet. • Data published by the organization that has collected it is called PRIMARY DATA(初級資料). For example: Data published by the US Bureau of Census. • For example: • The Statistical abstracts of the United States, • compiles data from primary sources • Compustat, sells variety of financial data tapescompiled from primary sources • Data published by an organization different than the organization that has collected it is called SECONDARY DATA(次級資料).

  5. Observational and experimental studies • Observational study is one in which measurements representing a variable of interest are observed and recorded, without controlling any factor that might influence their values. • Experimental study is one in which measurements representing a variable of interest are observed and recorded, while controlling factors(控制變數) that might influence their values. • When published data is unavailable, one needs to conduct a study to generate the data.

  6. Surveys • Surveys solicit information from people. e.g. pre-election polls; marketing surveys. • The Response Rate(回收率) (i.e. the proportion of all people selected who complete the survey) is a key survey parameter. • Surveys can be made by means of • personal interview(個人訪談) • telephone interview(電話訪談) • self-administered questionnaire(自我管理調查)

  7. Questionnaire Design(問卷設計) • Key design principles of a good questionnaire: • Keep the questionnaire as short as possible. • Ask short, simple, and clearly worded questions. • Start with demographic questions to help respondents get started comfortably. • Use dichotomous (yes|no) and multiple choice questions. • Use open-ended questions cautiously. • Avoid using leading-questions. • Pretest a questionnaire on a small number of people. • Think about the way you intend to use the collected data when preparing the questionnaire.

  8. 5.2 Sampling(抽樣) • Recall that statistical inference permits us to draw conclusions about a population based on a sample. • Motivation for conducting a sampling procedure: • Costs. (e.g. it’s less expensive to sample 1,000 television viewers than 100 million TV viewers) • Population size. • The possible destructive nature (破壞性)of the sampling process. (e.g. performing a crash test on every automobile produced is impractical). • The sampled population(抽樣母體)and the target population(目標母體)should be similar to one another.

  9. 5.3 Sampling Plans • A sampling plan is just a method or procedure for specifying how a sample will be taken from a population. • We will focus our attention on these three methods: • Simple random sampling(簡單隨機抽樣) • Stratified random sampling(分層隨機抽樣) • Cluster sampling(集群抽樣)

  10. Simple Random Sampling • In simple random sampling all the samples with the same size are equally likely to be chosen. • To conduct random sampling • assign a number to each element of the chosen population (or use already given numbers), • randomly select the sample numbers (members). Use a random numbers table, or a software package.

  11. Simple Random Sampling • Example 5.1 • A government income-tax auditor is responsible for 1,000 tax returns. • The auditor will randomly select 40 returns to audit. • Use Excel’s random number generator to select the returns. • Solution • We generate 50 numbers between 1 and 1000 (we need only 40 numbers, but the extra might be used if duplicate numbers are generated.)

  12. Simple Random Sampling • Example 5.1: A government income tax auditor must choose a sample of 40 of 1,000 returns to audit Extra #’s may be used if duplicate random numbers are generated.

  13. 50 numbers uniformly distributed between 0 and 1 50 random uniformly distributed whole- numbers between 1 and 1000. 50 Random numbers between 0 and 1000, each has a probability of 1/1000 to be selected Simple Random Sampling Round-up X(100) 383 101 597 900 885 959 15 408 864 139 246 . . The auditor should select 40 files numbered 383, 101, ...

  14. Age • under 20 • 20-30 • 31-40 • 41-50 • Sex • Male • Female • Occupation • professional • clerical • blue-collar Stratified Random Sampling • This sampling procedure separates the population into mutually exclusive sets (strata) (互斥的層別), and then draw simple random samples from each stratum.

  15. Stratified Random Sampling • With this procedure we can acquire information about • the whole population • each stratum • the relationships among strata.

  16. If we only have sufficient resources to sample 400 people total, we would draw 100 of them from the low income group… If we are sampling 1000 people, we’d draw 50 of them from the high income group. Stratified Random Sampling • After the population has been stratified, we can use simple random sampling to generate the complete sample. For example, keep the proportion of each stratum in the population.

  17. Cluster Sampling • Cluster sampling is a simple random sample of groups or clusters of elements. • This procedure is useful when • it is difficult and costly to develop a complete list of the population members (making it difficult to develop a simple random sampling procedure. • the population members are widely dispersed geographically. • Cluster sampling may increase sampling error(抽樣誤差), because of probable similarities among cluster members.

  18. Sample Size(樣本數) • Numerical techniques for determining sample sizes will be described later, but suffice it to say that the larger the sample size is, the more accurate we can expect the sample estimates to be.

  19. 5.4 Sampling and Non-Sampling Errors • Two major types of error can arise when a sample of observations is taken from a population: • Sampling error(抽樣誤差)refers to differences between the sample and the population that exist only because of the observations that happened to be selected for the sample. • Nonsampling errors (非抽樣誤差) are more serious and are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly.

  20. Sampling Error • Sampling error refers to differences between the sample and the population that exist only because of the observations that happened to be selected for the sample. • Another way to look at this is: the differences in results for different samples (of the same size) is due to sampling error: • E.g. Two samples of size 10 of 1,000 households. If we happened to get the highest income level data points in our first sample and all the lowest income levels in the second, this delta is due to sampling error. • Increasing the sample size willreduce this type of error.

  21. The sample mean falls here only because certain randomly selected observations were included in the sample. Sampling Errors Population income distribution m ( population mean) Sampling error

  22. Nonsampling Error • Nonsampling errors are more serious and are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly. • Three types of nonsampling errors: • Errors in data acquisition • Nonresponse errors(無反應偏差) • Selection bias(取樣誤差) • Note: increasing the sample size will notreduce this type of error.

  23. Errors in data acquisition • …arises from the recording of incorrect responses, due to: • incorrect measurements being taken because of faulty equipment, • mistakes made during transcription from primary sources, • inaccurate recording of data due to misinterpretation of terms, or • inaccurate responses to questions concerning sensitive issues.

  24. If this observation… …is wrongly recorded here… …then the sample mean is affected Data Acquisition Error Population Sampling error + Data acquisition error Sample

  25. Nonresponse Error • …refers to error (or bias) introduced when responses are not obtained from some members of the sample, i.e. the sample observations that are collected may not be representative of the target population. • As mentioned earlier, the Response Rate (i.e. the proportion of all people selected who complete the survey) is a key survey parameter and helps in the understanding in the validity of the survey and sources of nonresponse error.

  26. Non-Response Error Population No response here... …may lead to biased results here. Sample

  27. Selection Bias • …occurs when the sampling plan is such that some members of the target population cannot possibly be selected for inclusion in the sample.

  28. Selection Bias Population When parts of the population cannot be selected... …the sample cannot represent the whole population. Sample

More Related