LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 3 1/16/2013

Recommended reading • Manning & Schütze chapter 2, Mathematical Foundations • 2.1.1-2.1.7 (probability) • 2.2.1 (entropy)

Office Hours • M 12-2 by appointment • T 1-2 • Th 1-3 by appointment • F 1-3

Outline • Topic modeling • Probability: joint and conditional probability • Apply conditional probability to topic modeling • Probability: expected value, entropy • Apply entropy to topic modeling • Programming assignment #1 • Optional material: codes, tf-idf

Topic modeling • Determine the topic of a document • Determine the topic(s) of words • Applications: • Information retrieval: • Given a user query, return relevant documents • Document classification: • You have a collection of documents whose topics are known. Given a new document, assign it to a topic.

Fisher corpus • http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2004S13 • http://papers.ldc.upenn.edu/LREC2004/LREC2004_Fisher_Paper.pdf • Telephone conversations between pairs of American English speakers • Each pair is asked to talk about a specific topic • 16,000 conversations, 2,000 hours, 100,000,000 million words • Audio files, transcribed into text

List of topics in Fisher corpus Affirmative Action Airport Security Arms Inspections in Iraq Bioterrorism Censorship Comedy Computer games Corporate Conduct in the US Current Events Drug testing Education Family Family Values Food Foreign Relations Friends Health and Fitness Hobbies Holidays Hypothetical Situations. An Anonymous Benefactor Hypothetical Situations. One Million Dollars to leave the US Hypothetical Situations. Opening your own business Hypothetical Situations. Perjury Hypothetical Situations. Time Travel Illness Issues in the Middle East Life Partners Minimum Wage Movies Outdoor Activities Personal Habits Pets Professional Sports on TV Reality TV September 11 Smoking Strikes by Professional Athletes Televised Criminal Trials Terrorism US Public Schools

Example conversation. Topic: airport security A (( my name is sharon )) B my name is deborah A (( hello )) B [noise] A hello B [noise] hello is this sharon A yes this is deborah B hi my name is yeah it is [laughter] did you hear the question A [laughter] A no B [laughter] i was gonna say i think it said something about um B something about terrorism in the airport but i couldn't get the A yeah ii think it i think it asked if the new security measures are um B wh- A are B mhm A gonna reduce terrorism i guess [sigh] B oh was that it i wasn't quite sure what it said [noise] but um what do you think about it

Example conversation. Topic: airport security A i don't know you know i haven't flown since after september eleventh or since september eleventh happened so B oh i see A um i haven't i mean but i've heard from other people that um B mhm A up until recently it was um A it took a really long time to get on planes and most people thought it was kind of B (( mm )) A um A kind of a pain to to wait through it all B okay well i flew october right after a month after um the nine eleven incident A [noise] uh-huh B and i flew from connecticut to pittsburghpennsylvania and i was not delayed one bit with the security not one bit A really B so my although when i was coming when i was returning i was on a shuttle headed to the uh parking parking garage and i heard B one of the other passengers on the shuttle said that they were quite inconvenient um because they were th- they select you at random to go through your luggage

Example conversation. Topic: drug testing A (( [mn] hello )) B (( hey how you doing )) A what's going on my name is danny B hey dannyi'm frank where you calling from A (( [mn] )) A uh studio city california B oh okay i'm in uh B in orange county A orange county B yeah A all right um so we're supposed to talk about drug testing i guess B yeah have you ever had that done A no i have not B no what they do now yeah they A (( how about yourself )) B i've had to apply for some jobs B where they require B um different kinds of drug tests now one of the most recent

B [mn] that i've heard of is they uh take hair samples B and then they uh analyze the hair and determine B whether you have used any kind of drugs or pot or whatever and even alcohol 'cause it stays in the hair longer yeah and um A oh really B i've had that so they had to shave my arms they weren't gonna touch my legs i said you ain't touching my legs [laughter] A uh-huh yeah i guess A so they can do it to B or they can take A they can do it through hair samples oh that that's interesting B yeah they get it through hair samples now A you know like uh you know what the the span is as far as like how long that lasts like you i know with urine it's like you have to wait like a month or or whatnot but i've never heard it even done with like hair samples like B and uh it A how how long does it does it take like uh to clear out of your blood stream i guess B (( can )) B it'll stay in your hair for about a year

Probability distribution, most-frequent words;(a bit different from distribution for written text) i 0.04381 you 0.03122 and 0.02867 the 0.02541 to 0.01977 know 0.01972 a 0.01960 yeah 0.01945 that 0.01789 it 0.01541 like 0.01413 of 0.01360 )) 0.01267 (( 0.01267 in 0.01096 so 0.01032 they 0.01006 it's 0.00995 but 0.00950 um 0.00940 [laughter] 0.00933 [noise] 0.00813 have 0.00801 is 0.00768 was 0.00747 oh 0.00746 just 0.00743 don't 0.00719 uh 0.00699 right 0.00670 that's 0.00649 think 0.00639 do 0.00634 i'm 0.00607 what 0.00588 my 0.00587 for 0.00553 or 0.00532 we 0.00518 well 0.00511 on 0.00511 not 0.00495 be 0.00473 really 0.00467 if 0.00458 mean 0.00456 are 0.00455 mhm 0.00427

Topic modeling • Determine the topic of a document • Determine the topic(s) of words • Find out what words are strong associated with a topic • Today: look at conditional probability and entropy • Applications: • Information retrieval: • Given a user query, return relevant documents • Document classification: • You have a collection of documents whose topics are known. Given a new document, assign it to a topic.

Multiple random variables • Complex data can be described as a combination of values of multiple random variables • Example: 2 random variables • COLOR ∈ { blue, red } • SHAPE ∈ { square, circle } • Frequency of events: • count(COLOR=blue, SHAPE=square) = 1 • count(COLOR=red, SHAPE=square) = 2 • count(COLOR=red, SHAPE=circle) = 3 • count(COLOR=blue, SHAPE=circle) = 2

Probability dist. over events that are combinations of random variables p(COLOR=blue, SHAPE=square) = 1 / 8 p(COLOR=red, SHAPE=square) = 2 / 8 p(COLOR=red, SHAPE=circle) = 3 / 8 p(COLOR=blue, SHAPE=circle) = 2 / 8 Sum = 8 / 8 = 1.0 Joint probability distribution

May omit name of random variableif it’s understood • Joint probability distribution p: • p( blue, square ) = 1 / 8 = .125 • p( red, square ) = 2 / 8 = .250 • p( red, circle ) = 3 / 8 = .375 • p( blue, circle ) = 2 / 8 = .250 • Sum = 8 / 8 = 1.0

Conditional probability • Example: • You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • Read as “probability of pink given puppy” • In conditional probability: • the probability calculation is restricted to a subset of events in the joint distribution • that subset is determined by the values of the random variables being conditioned on

Conditional probability • Sample space for probability calculation is restricted to particular events in the joint distribution • p( SHAPE = square | COLOR = red ) = 2 / 5 • p( SHAPE = circle | COLOR = red ) = 3 / 5 • p( COLOR = blue | SHAPE = square ) = 1 / 3 • p( COLOR = red | SHAPE = square ) = 2 / 3 • p( COLOR = blue | SHAPE = circle ) = 2 / 5

Compare to unconditional probability • Unconditional probability: sample space for probability calculation is unrestricted • p( SHAPE = square) = 3/ 8 • = p( SHAPE = square | COLOR=blue or COLOR=red) = 3 / 8 • p( SHAPE = circle ) = 5 / 8 • p( COLOR = blue ) = 3 / 8 • p( COLOR = red) = 5 / 8

Marginal (unconditional) probability • Probability for a subset of the random variable(s), ignoring other random variable(s) • If you know only the joint distribution, you can calculate the marginal probability of a random variable • Sum over values of all other random variables:

Marginal probability: example • p(COLOR=blue) = ? • Calculate by counting blue objects: 3/8 • Calculate through marginal probability: p(COLOR=blue) = p(COLOR=blue,SHAPE=circle) + p(COLOR=blue,SHAPE=square) = 2/8 + 1/8 = 3/8

Why it’s called “marginal probability”:margins of the joint prob. table • Sum probs. in each row and column to get marginal probs p(COLOR=blue) p(COLOR=red) Total probability: p(COLOR, SHAPE) p(SHAPE=square) p(SHAPE=circle)

Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

Some problems • You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • I have two children. What is the probability that both are girls? • I have two children. At least one of them is a girl. What is the probability that both are girls?

How to solve • What are the random variables and their values? • What probability am I asking for? • What is the sample space? • Consider reduction in sample space due to conditioning • What are the counts or probabilities of events? • What is the answer?

You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • What are the random variables and their values? COLOR ∈ { pink, blue } ANIMAL ∈ { kitty, puppy } • What probability am I asking for? p(pink | puppy) • What is the sample space? • Before conditioning: the set of all possible events = { pink kitty, pink puppy, blue kitty, blue puppy} • After conditioning: {pink puppy, blue puppy} • What are the counts or probabilities of events? • What is the answer? 4 / 6

I have two children. What is the probability that both are girls? • What are the random variables and their values? CHILD1 ∈ { girl, boy } CHILD2 ∈ { girl, boy } • What probability am I asking for? p(CHILD1=girl, CHILD2=girl) = p(girl, girl) • What is the sample space? No conditioning: { <girl, girl>, <girl, boy>, <boy, girl>, <boy, boy> } • What are the counts or probabilities of events? • Each event is equally likely, so 1/4 • What is the answer? 1 / 4

I have two children. At least one of them is a girl. What is the probability that both are girls? • What are the random variables and their values? C1 ∈ { girl, boy } C2 ∈ { girl, boy } • What probability am I asking for? p(C1=girl, C2=girl | C1=girl or C2=girl or (C1=girl & C2=girl) ) • What is the sample space? • Before conditioning: { <girl, girl>, <girl, boy>, <boy, girl>, <boy, boy> } • After conditioning: { <girl, girl>, <girl, boy>, <boy, girl> } • What are the counts or probabilities of events? All are equally likely, so 1/3 • What is the answer? 1/3

List of topics in Fisher corpus Affirmative Action Airport Security Arms Inspections in Iraq Bioterrorism Censorship Comedy Computer games Corporate Conduct in the US Current Events Drug testing Education Family Family Values Food Foreign Relations Friends Health and Fitness Hobbies Holidays Hypothetical Situations. An Anonymous Benefactor Hypothetical Situations. One Million Dollars to leave the US Hypothetical Situations. Opening your own business Hypothetical Situations. Perjury Hypothetical Situations. Time Travel Illness Issues in the Middle East Life Partners Minimum Wage Movies Outdoor Activities Personal Habits Pets Professional Sports on TV Reality TV September 11 Smoking Strikes by Professional Athletes Televised Criminal Trials Terrorism US Public Schools

Highest p(topic|word) forguns, handsome, and game guns Airp:0.23322 Sept:0.11131 Pets:0.10247 US_P:0.10071 Arms:0.06714 Biot:0.05300 Cens:0.04240 Fore:0.03534 Terr:0.03357 Issu:0.02473 Smok:0.01943 Mini:0.01943 Comp:0.01767 Come:0.01767 Prof:0.01237 Life:0.01060 Hobb:0.01060 Corp:0.01060 HS.Perj:0.00883 HS.One_:0.00883 Tele:0.00707 Curr:0.00707 Drug:0.00530 Real:0.00353 Outd:0.00353 Illn:0.00353 HS.Open:0.00353 HS.An_A:0.00353 handsome Life:0.31579 Real:0.26316 HS.Time:0.07895 HS.An_A:0.07895 Pets:0.05263 Come:0.05263 Stri:0.02632 Pers:0.02632 Holi:0.02632 Curr:0.02632 Biot:0.02632 Airp:0.02632 game Prof:0.32156 Comp:0.25089 Stri:0.13381 Real:0.06653 Educ:0.02500 Outd:0.02236 Hobb:0.01635 Come:0.01353 Life:0.01165 Corp:0.00827 HS.Open:0.00733 Curr:0.00733 HS.Time:0.00714 Sept:0.00695 Arms:0.00677 Mini:0.00658 HS.An_A:0.00658 Fami:0.00620 Heal:0.00601 Food:0.00601 Movi:0.00564 Holi:0.00564 Pets:0.00507 US_P:0.00470 Biot:0.00413 Airp:0.00413 Pers:0.00357 Smok:0.00338

Highest p(word|topic) for Airport Security Airport_Security i:0.04172 you:0.03337 the:0.02798 and:0.02647 know:0.02352 to:0.02144 that:0.01999 a:0.01824 yeah:0.01684 it:0.01622 they:0.01598 of:0.01292 like:0.01271 )):0.01156 ((:0.01156 in:0.01060 but:0.00998 so:0.00968 um:0.00924 it's:0.00903 have:0.00813 [noise]:0.00808 was:0.00788 think:0.00778 just:0.00771 don't:0.00750 right:0.00742 uh:0.00738 is:0.00722 [laughter]:0.00708 do:0.00655 oh:0.00629 on:0.00623 mean:0.00589 what:0.00558 if:0.00553 that's:0.00540 or:0.00539 not:0.00518 be:0.00515 i'm:0.00509 people:0.00506 we:0.00505 mhm:0.00490 for:0.00461 well:0.00453 are:0.00441 my:0.00433 really:0.00427 there:0.00414 security:0.00412 about:0.00409 with:0.00398 they're:0.00384 this:0.00382 all:0.00377 because:0.00345 no:0.00331 get:0.00330 at:0.00324 going:0.00309 would:0.00302 ah:0.00302 go:0.00301 had:0.00298 more:0.00277 one:0.00273 me:0.00271 can:0.00271 as:0.00270 airport:0.00269 then:0.00266 when:0.00263 something:0.00259 now:0.00258 how:0.00258 okay:0.00255 out:0.00242 from:0.00241 were:0.00235 through:0.00235 up:0.00234 he:0.00233 some:0.00226

Not what we want… • Find topic-specific words • Conditional probability heuristic: • Highest p(word|topic) includes words that are highly frequent in general spoken English, and not specific to the topic • We want words that topic-specific • Words that have a very skewed distribution, and do not appear in many different topics • Next: look at entropy

Expected value • Roll the die, get these results: p( X = roll 1) = 3 / 20 p( X = roll 4) = 2 / 20 p( X = roll 2) = 2 / 20 p( X = roll 5) = 1 / 20 p( X = roll 3) = 4 / 20 p( X = roll 6) = 8 / 20 • On average, if I roll the die, how many dots will there be? • Answer is not ( 1 + 2 + 3 + 4 + 5 + 6 ) / 6 = 1.83 • Need to consider the probability of each event

Expected value of a random variable • The expected value of a random variable X is a weighted sum of the values of X. • i.e., for each event x in the sample space for the random variable X, multiply the probability of each event by the value of the event, and sum these • The expected value is not necessary equal to one of the events in the sample space.

Expected value: example • The expected value of a random variable X is a weighted sum of the values of X. • Example: the average number of dots that I rolled Suppose: p( X = roll 1) = 3 / 15 p( X = roll 4) = 2 / 15 p( X = roll 2) = 2 / 15 p( X = roll 5) = 1 / 15 p( X = roll 3) = 4 / 15 p( X = roll 6) = 3 / 15 • E[X] = (3/15)*1 + (2/15)*2 + (4/15)*3 + (2/15)*4 + (1/15)*5 + (3/15)*6 = 3.33

Information theory • http://en.wikipedia.org/wiki/Information_theory • Applications: NLP, learning, signal processing, data compression, networks, etc. • Quantify the “information” in a probability distribution • Quantify the average uncertainty in a random variable • Demonstrate through 3 examples: • # of guesses it takes to guess a number or horse • Uncertainty about a probability distribution • Average # of bits in an optimal code (see optional section)

1. Guess my number • I’m thinking of a number between 1 and 64 • You make a guess • I will tell you “higher” or “lower”

Best strategy: binary search • Cuts search space in half in each iteration • Suppose there are N values initially. • Make a guess at middle point of interval. • If incorrect, search space is now of size N/2 • If next guess incorrect, search space is size N/4 • If next guess incorrect, search space is size N/8 • etc. • Keep cutting in half until only one possible value • Maximum number of guesses: log2 N

Binary search: at each iteration, range to be searched is cut in half Input range: length N • Height of tree: log2 N • e.g. since 24 = 16, log2 16 = 4 • In worst case, binary search requires log2 N iterations N/2 N/2 N/4 N/8 N/16

Maximum number of guesses • N = 1 0 guesses • N = 2 1 guess • N = 4 2 guesses • N = 8 3 guesses • N = 16 4 guesses • N = 32 5 guesses • N = 64 6 guesses • …

Fewer guesses if I give you additional information • I’m thinking of a number between 1 and 64 • Takes at most log2 64 = 6 guesses • Initially, the probability of each number is 1/64 • I’m thinking of a even number between 1 and 64 • Will now take at most log2 32 = 5 guesses • Uncertainty has been reduced • the probabilities are: • Odd numbers: probability is 0 • Even numbers: probability is 1/32

2. Uncertainty about a random variable • Example: compare these probability distributions over five events 1. [ .2, .2, .2, .2, .2 ] 2. [ .1, .1, .1, .35, .35 ] 3. [ 0, 0, 1.0, 0, 0 ] • For which distributions are you more/less uncertain about what event(s) are likely to occur? • How do you quantify your degree of uncertainty?

Entropy of a probability distribution • Entropy is a measure of the average uncertainty in a random variable. • The entropy of a random variable X with probability mass function p(x) is: • which is the weighted sum for the number of bits to specify p(x) • bit: binary digit, a number with two values

Properties of entropy • Minimum value: H(X) = 0 • Outcome is fully certain • For example, if one event has 100% probability, and all others have 0% probability • Maximum value: H(X) = log2 |sample space| • Occurs when all events are equiprobable • No knowledge about which events are more/less likely • Maximizes uncertainty

Entropy for a single variableExample: toss a weighted coin, let p be p(heads) Entropy is maximized when p = .5 (can’t predict how coin lands)

- log2 function for probabilities • Let x be an event. Assume 0.0 < p(x) <= 1.0 • - ∞ < log2 (x) <= 0.0 • ∞ > - log2 (x) >= 0.0 ∞ > log2 (x-1) >= 0.0 ∞ > log2 (1/x) >= 0.0 • If x has high probability, -log2 is low • If x has low probability, -log2 is high Blue: log2 Red: - log2

Entropy as an expected value pointwise entropy: # of bits for single event x expected value or weighted average over all events = average # of bits to specify an event in the prob dist

LING / C SC 439/539 Statistical Natural Language Processing