Producing Data

Producing Data - Introduction • Statistic is a tool that helps data produce knowledge rather that confusion. As such, it must be concerned with producing data as well as interpreting already available data. • Exploratory data analysis helps reveal information in data. However, alone it can rarely provide convincing evidence for its conclusions. • We may also use data to provide clear answers to specific questions such as what is the average life time of humans? • This lecture is devoted to developing the skills needed to produce trustworthy data and to judge the quality of data produced by others. • The techniques for producing data are among the most important ideas in statistics; they are the basis for formal statistical inference. week5

Collecting data • Available data are the data that were produced in the past for some other purpose but they may help answer a present question. • Statistical designs for producing data rely on either sampling or experiments. • A sample survey collects information about a population by selecting and measuring a sample from the population. • Example: The General Social Survey interviews about 3000 adult residents of US every 2nd year. That is GSS selects a sample of adults to represent the larger population of all adults living in US. • Census is an attempt to contact every individual in the population. week5

Observation versus Experiment • An observational study observes individuals and measures variables of interest but does not attempt to influence the response. • An experiment imposes a treatment on individuals in order to observe their response. • An observational study, even one based on a statistical sample is a poor way to study the effect of a treatment. To see the effect of a treatment we must actually impose the treatment. • When our goal is to understand the cause and effect, experiments are the only source of fully convincing data. week5

Design of experiments • The individuals on which, the experiment is done are the experimental units. • A specific experimental condition applied to the units is called a treatment. • A placebo is a dummy treatment. The response to a dummy treatment is the placebo effect. • The explanatory variables in an experiment are called factors. • The values of a factor are called levels. • Many experiments study the joint effect of several factors. In such an experiment, each treatment is formed by combining a specific value of each of the factors. • In principal, experiments can give good evidence of causation. week5

Example We want to study the effects of aspirin and beta carotene on heart attacks and cancer. Factors: Aspirin (levels: yes, no), Beta carotene (levels: yes, no). Response variables: occurrence of heart attacks and cancer. Treatments are the factor level combinations (4 treatments ). The example above is a factorial (two factor) experiment. week5

Bias • The design of a study is biased if it systematically favors certain outcomes. • An uncontrolled study of a new medical therapy, for example is biased in favor of finding the treatment effective because of the placebo effect. • The group of patients who received a dummy treatment is called a control group, because it enable us to control the effects of outside variables on the outcome. • Control is the first basic principle of statistical design of experiments. Comparisons of several treatments in the same environment is the simplest form of control. • Example 3.9 page 180 in IPS. week5

Randomization • The design of an experiment first describes the response variable or variables, factors (explanatory variables), and the layout of the treatments, with comparison as the leading principle. • The second aspect of design is the rule used to assign experimental units to the treatments. Comparison of the effects of treatments is valid only when all treatments are applied to similar groups of experimental units. • Systematic differences among the groups of experimental units in a comparative experiment cause bias. • The use of chance to divide experimental units into groups is called randomization. • Randomization can be done by the Hat method, random number tables or software. week5

Example • A food company assesses the nutritional quality of a new “instant breakfast” product by feeding it to newly weaned male white rats and measuring their weight gain over a 28-day period. A control group of rats receives a standard diet for comparison. This experiment has a single factor (diet) with two levels. 30 rats were used for this experiment. • The outline of the design is given in the following diagram • The design in the above figure combines comparison and randomization to arrive at the simplest randomized comparative design. week5

Principles of experimental design • Control the effects of lurking variables on the response, simply by comparing two or more treatments. • Randomize - use impersonal chance to assign experimental units to treatments. • Repeat each treatment on many units to reduce chance variation in the results. Statistical Significance • An observed effect so large that it would rarely occur by chance is called statistically significant. week5

How to randomize • The idea of randomization is to assign subjects to treatments by drawing names from a hat. In practice, experimenters use software to carry out randomization. We can randomize without software by using a table of random digits. • A table of random digits is a list of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 that has the following properties: • The digit in any position in the list has the same chance of being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. • The digits in different positions are independent in the sense that the value of one has no influence on the value of any other. week5

Completely randomized design(CRD) • When all experimental units are allocated at random among all treatments, the experimental design is completely randomized. • Example (rats example on slide 8) - Label each rate with a numerical value from 01, …, 30. - Start at line 164 in Table B and read two-digit groups. The first 10 two-digit groups in this line are 11 02 27 91 24 49 52 56 30 78 So the rates labeled 11, 02, 27, 24, 30 go into the experimental group. Run your finger across line 164 (and continue to line 165 if needed) until you have chosen 15 rates. They are the rates labeled 11, 02, 27, 24, 30, 17, 22, 21, 01, 13, 23, 16, 28, 20, 08. week5

Cautions about experimentation • The study of the effects of aspirin and beta carotene on heart attacks and cancer in the example on slide 5, was double-blind - neither the subjects nor the medical personnel who worked with them knew which treatment any subject had received. The double-blind method avoids unconscious bias, e.g. a doctor who doesn’t think that “just a placebo” can benefit a patient. • Lack of realism The subjects or treatment or setting of an experiment may not realistically duplicate the conditions we really want to study. • Example 3.16 page 188 in IPS. week5

Matched pairs designs • Match pairs designs compare just two treatments. We choose blocks of two units that are as closely matched as possible. Alternatively, each block in a matched pairs design may consist of just one subject, who gets both treatments one after the other and serves as his or her own control. • The idea is that matched subjects are more similar than unmatched ones, so that comparing responses within a number of pairs is more efficient than comparing the responses of groups of randomly assigned subjects. • Randomization remains important; which one of the a matched pair receive the first treatment. • Example 3.17 page 189. week5

Block design • A block is a group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. In a randomized block design (RBD), the random assignment of units to treatments is carried out separately within each block. • Example 3.18 page 190 in IPS Progress of a type of cancer differs in women and men. We want to compare 3 therapies. - gender is a blocking variable - two randomizations done, one assigning female subjects to treatments, and the other assigning male subjects. As described in the following diagram week5

week5

Sampling design • A political scientist want to know what percent of the voting age population consider themselves conservatives. He needs to gather information about large group of individuals. • Time, cost and inconvenience forbid contacting every individual. • We gather information about only part of the group in order to draw conclusions about the whole population. • We will not, as in an experiment, impose treatment in order to observe the response. week5

Population and sample • The entire group of individuals that we want information about is called the population. • A sample is a part of the population that we actually examine in order to gather information. • Sample design • The design of a sample refers to the method used to choose the sample from the population. • Poor sample design can produce misleading conclusions. week5

Example • The ABC network program Nightline asked (in a call-in poll) whether the UN should continue to have its headquarters in United States. More than 186000 callers responded ( telephone companies charge for these calls) and 67% said “No”. • People who spend time and money to respond to call-in polls are not representative of the entire adult population. In fact they tend to be the same people who call radio talk shows. • People who feel strongly, especially those with strong negative opinions, are more likely to call. • It is not surprising that a properly designed sample showed that 72% of adults want UN to stay. week5

Voluntary response sample • A voluntary response sample consists of people who choose themselves by responding to a general appeal. • Voluntary response samples are biased because people with strong opinions, especially negative opinions are most likely to respond. • Random selection of a sample eliminates bias giving all individuals an equal chance to be chosen. week5

Simple Random Sample • A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected. • How to select an SRS? Hat method, Random number tables or software. • Example 3.24 page 200 in IPS. week5

Stratified Random Sampling • To select a stratified random sample, first divide the population into groups of similar individuals, called strata. • Then choose a separate SRS in each stratum and combine these SRSs to form the full sample. • Example 3.26 page 203 in IPS. week5

Multistage sampling design - Example • Data on employment/ unemployment are gathered by the Gov.’s Current Population survey, which conducts interviews in about 55000 households each month. • Its not practical to maintain a list of all US household from which to select a SRS. Cost of sending interviewers to the widely scattered households in an SRS would be too high. So use multistage design. • The Current Population Survey sampling design is: Stage 1. Divide US into 2007 geographical areas called primary sampling units (PSU). Select a sample of 754 PSUs. Stage 2. Divide each PSU selected into smaller areas called “blocks”. Stratify blocks using ethnic and other information and take a stratified sample of the blocks in each PSU Stage 3. Sort the housing units in each block into clusters of 4 nearby units. Interview the households in a random sample of these clusters. week5

Systematic random samples– Example We want to choose 4 addresses from a list of 100. • divide the list into 4 smaller lists each of 100/4 = 25 addresses. • Choose one of the first 25 at random (using random number tables) and then choose every 25th address. • E. g. If 13 is the random number selected, the sample consists of the addresses numbered 13, 38, 63, 88. week5

Cautions about sample surveys • Undercoverage Sample surveys require an accurate and complete list of the population (sampling frame). Because such lists are rarely available, most samples suffer from some degree of undercoverage, which occurs when some groups in the population are left out of the process of choosing the sample. • Examples: (i) A sample survey of households will miss homeless people, prison inmates, students in dormitories. (ii) An opinion poll conducted by telephone will miss the 6% of American households without residential phones. • Nonresponse occurs when an individual chosen for the sample can’t be contacted or doesn’t cooperate. week5

Response bias • The behavior of the respondent or the interviewer can cause response bias in sample results. • Respondents may lie, especially if asked about illegal or unpopular behavior. The sample then underestimates the occurrences of such behavior in the population. • Answers to questions that ask the respondent to recall past events are often inaccurate because of faulty memory. • Wording of questions Confusing or leading questions can introduce a strong bias in a sample survey and even minor changes in wording can change a survey’s outcome. week5

Statistical inference - Parameters and statistics • A parameter is a number that describes the population. It is a fixed number, but in practice we do not know its value. • A statistic is a number that describes a sample. The value of a statistic is known when we have taken a sample, but it can change from sample to sample. • We often use a statistic to estimate an unknown parameter. week5

Sampling distribution • The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. • Example 3.33 page 214 in IPS We simulate drawing SRSs of size 100 from the population of all adult US residents. Suppose that in fact 60% of the population find shopping frustrating. Then the true value of the parameter we want to estimate is p = 0.6. The following diagrams describe the sampling distribution of the statistics for different sample size. week5

week5

Bias and Variability • A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated. • The variability of a statistic is described by the spread of its sampling distribution. • The spread is determined by the sampling design and the sample size n. Statistics from larger probability samples have smaller spreads. • Managing Bias and Variability. • To reduce bias, use SRS. • To reduce the variability of a statistic from an SRS, use larger samples. week5

Question - Final Dec 2001 • Two drugs A and B, used to the treatment of glaucoma, were tested for effectiveness on 10 diseased dogs. Drug A was administered to one eye of each dog and drug B to the other eye. Pressure measurements were taken 1 hour later on both eyeballs of each dog. Which of the following statements are true? (a) This is an example of a matched pairs design. (b) This is an example of a CRD. (c) This is an example of a RBD. • Re the above study which of the following is the most important. (a) We need to randomize the assignment of dogs to drugs. (b) We need to randomize the assignment of drugs to eyes. (c) We need to select the dogs randomly from a bigger population. (d) We need to stratify the dogs before assigning the drugs. (e) We need to pair the dogs based on some relevant criteria related to the response. week5

Question - Dec 2001 • A list enumeration areas in Ontario is made. From this list we pick every 10th one after a random start. For the selected areas, we obtain maps. For each map we number the blocks, from 1 to N (N = number of blocks in that area). Using a RN table, we select two distinct numbers between 1 and N and include the corresponding blocks in our sample. On each selected block, we start at the northeast corner, and walk around the block, selecting every 5th household into our sample (from a random start). The types of sampling methods used here (in no particular order) are (a) stratified, SRS, systematic (b) systematic, multistage, stratified (c) multistage, SRS, systematic (d) multistage, SRS, stratified (e) SRS, systematic week5

Question - Summer 2001 test-2 a) In order to study various aspects of child abuse 15 ‘Child Welfare Service Areas’ (CWSA) are randomly selected, from all those across Canada. From each selected CWSA, 10% of cases are chosen, by taking every 10th file from a cabinet. i) Is this an observational study or an experiment? ii) Describe the design (in statistical terminology) b) You want to determine the best colour for attracting cereal leaf beetles to boards on which they will be trapped. You will compare three colours: Blue, green, Yellow. The response variable is the count of beetles trapped. You will mount one board on each of 9 poles evenly spaced in a square field, with 3 poles in each row as shown below. You will proceed with a completely randomized experiment in order to compare the colours. Randomly assign colours to poles, and mark on the field sketch, the colours assigned to each pole. Indicate exactly how you assigned the colours to the poles. week5

c)In the cigarette smoking and cancer video, there was one study in which smokers and non-smokers were matched up w. r. t. 30 different variables making them ‘as like as possible’ in the words of the speaker. Cancer rates differed substantially between the smoker and non-smokers. i) Is this an observational study or a randomized block design? ii) Why does this or why does this not prove smoking cases cancer? d) Increasing the sample size is one method for reducing bias. True or false? week5

Question - Term test Summer 99 Suppose that we want to select a sample of students from sta220 class (150 students in total) a) If we assign each student in the class a number from 001-150, and then use a RN table to pick 2 distinct RNs from 001-150, and then take the corresponding students, what do we call this type of sample? b) If we select the 5th student, after ordering the students in some fashion, what do we call this type of sampling design? c) If we select randomly 4 students from the centre section, and then 2 at random from the section on the left side and finally 2 randomly from the section on the right side, what type of sampling design is this? d) If we select randomly 5 rows in the classroom, then 2 students randomly from each selected row, what do we call this type of sampling design? week5

Question - Term Test summer 2000 For each of the following studies, i) Indicate whether it is an observational study or a controlled experiment. ii) if an observational study: (a) Describe precisely the sampling design utilized. Use appropriate statistical terminology. (b) Indicate the source of bias, if any are present. ii) If an experiment, identify (a) the experimental unit(s) and the response variable(s). (b) the factors, treatments and the number of treatments. week5

A city has 2000 city-blocks in each of 4 geographical areas (NE, NW, SE, SW). Five blocks will be selected at random from each geographical area. For each selected block, 20% of households will be selected, by having the interviewer walk around the block, and take every 5th household, starting with the house at the Northwest corner. When the interviewer arrives at a household, one of the adults present is randomly selected to be interviewed. week5

(B) In order to investigate the effect of repeated exposure to an advertising message, a number of undergraduate students viewed a 40 minute TV program that included ads for a digital camera. Some of the students saw a 30 second commercial: other a 90 second version. The same commercial was repeated either 1, 3 or 5 times during the program. After viewing, all of the subjects answered questions about their recall of the ad, their attitude toward the camera, and their intention to purchase it. week5

Producing Data - Introduction

Producing Data - Introduction

Presentation Transcript

Producing Data: Experiments

GATHERING AND PRODUCING DATA

Producing Data: Sampling

Producing data: experiments

Producing Data

Chapter 3 Producing Data

Chapter 5: Producing Data

Chapter 5: Producing Data

Chapter 5 Producing Data

Chapter 5.1 Producing Data

Producing Data

Chapter 5 Producing Data

Chapter 5: Producing Data

Chapter 5: Producing Data

CHAPTER 8: Producing Data Sampling

Declaratively Producing Data Mash-ups

Producing Data Chapter 5

Chapter 5: Producing Data

Methods of Producing Data

GATHERING AND PRODUCING DATA

Chapter 5: Producing Data