120 likes | 232 Vues
Hypothesis Testing. Coke vs. Pepsi. Hypothesis: tweets reflect market share (people tweet as much as they drink) Market share: 67% vs. 33% From tweets: 71% vs. 29% Happened by chance? Or people tend to talk more about Coke than they drink it?. A simpler hypothesis testing.
E N D
Coke vs. Pepsi • Hypothesis: tweets reflect market share (people tweet as much as they drink) • Market share: • 67% vs. 33% • From tweets: • 71% vs. 29% • Happened by chance? Or people tend to talk more about Coke than they drink it?
A simpler hypothesis testing • Claim: I can distinguish Coke and Pepsi just by tasting. • How do you verify my claim?
It's like a court judgment • If you want to prove something, you have to assume the opposite, and find evidence that contradicts it. • In a court, you want to prove a defendant guilty. You assume he/she is innocent.
You conducted an experiment… • And have some outcome • 62 out 100 correct • Assuming I cannot distinguish them, I did it just by random guessing, is the result possible? • Of course possible, if I'm lucky, I can get 100 out 100. But is the result surprising?
How do we define surprising-ness? • Let's play random guess game one million times. If it turns out, 4 of 1 million times someone manages to score 62 or more, then we can say you have to be very super duper lucky to do that. Actually 0.000004% lucky. • And we are 99.999996% sure, that you can't get 62 in one game just by luck • Thus I am actually be able to distinguish Coke and Pepsi to some extent.
But we can't play this game that many times… • Or can we? • Open Excel • In cell B1, type = rand() • Can you make B1 say 0 if the random number is less than 0.5 and 1 otherwise? • You just flipped a coin in Excel!
Random Guessing Game in Excel • Flip the coin 100 times, in the same column • Find out how many heads you had in cell B101 • We've just played the random guessing game one time. • Can you do it 10 times?
Histogram • We want to find out how many times we scored 62 or higher. • It's also interesting to look at how the scores are distributed, i.e. which are more likely • It's called a histogram • Let's create one by hand • Then in Excel
Now do it 50 times! (or more… doesn't have to be exact) • Does the histogram look better? • What about 500 times? Look at the histogram
How probable is a score of 62? • You can calculate it from the histogram • Let's play the game in Python for as many times as we want! • Here are the steps: • flip a coin 100 times, and record the number of heads (I'll show you how to flip coins in Python) • Do it 1,000 times. Record all the scores (numbers of heads) • Find out how many of them is greater than 62. What's the percentage? • Now calculate this percentage for 2,000 games. 5,000 games, 10,000 and 50,000 games. What about the score 57 or higher? 54? 50? • Ahuh, may be you want to write a function…