1 / 40

LING 408/508: Computational Techniques for Linguists

LING 408/508: Computational Techniques for Linguists. Lecture 18 10/3/2012. Outline. Applications of dictionaries Phone book Word frequencies and Zipf’s Law References, mutability, and dictionary values. Structure of dictionary for phone book problem. 2 people named “Sarah Connor”

kaya
Télécharger la présentation

LING 408/508: Computational Techniques for Linguists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 408/508: Computational Techniques for Linguists Lecture 18 10/3/2012

  2. Outline • Applications of dictionaries • Phone book • Word frequencies and Zipf’s Law • References, mutability, and dictionary values

  3. Structure of dictionary forphone book problem • 2 people named “Sarah Connor” • “Kyle Reese” and “Arnold Schwarz” have the same phone number Key: (string, string) tuple Value: list of integers ('Mitt','Romney') [9997777] ('Sarah','Connor') [1234567, 1010101] ('Kyle','Reese') [7654321] ('Arnold','Schwarz') [7654321] ('Arnold','Ventura') [2233444]

  4. Other queries(DON’T LOOK AHEAD!!! SPOILERS FOLLOW) # pb is a dictionary # keys: (string, string) # values: list of integers 1. How many names are listed in the phone book? 2. What are the distinct first names listed in the phone book? 3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number. 4. How many phone numbers are there in the phone book? 5. How many phone numbers are there such that at least 2 people share that phone number?

  5. Other queries # pb is a dictionary # keys: (string, string) # values: list of integers 1. How many names are listed in the phone book? len(pb.keys())

  6. Other queries # pb is a dictionary # keys: (string, string) # values: list of integers 2. What are the distinct first names listed in the phone book? set([first for (first, last) in pb.keys()])

  7. Other queries # pb is a dictionary # keys: (string, string) # values: list of integers 3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number. Example: pb = {('Bob', 'Barker'):[1234567], ('Bobby', 'Barker'):[1234567], ('Sue', 'Parker'):[3333333, 4444444]}

  8. 3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number. ppl = [] for (name, ph_nums) in pb.items(): for num in ph_nums: ppl.append((name, num)) num_distinct_ppl = len(ppl) # >>> ppl # [(('Sue', 'Parker'), 3333333), # (('Sue', 'Parker'), 4444444), # (('Bob', 'Barker'), 1234567), # (('Bobby', 'Barker'), 1234567)]

  9. Other queries # pb is a dictionary # keys: (string, string) # values: list of integers 4. How many phone numbers are there in the phone book? all_nums = set() for nums in pb.values(): all_nums.update(nums) num_phone_nums = len(all_nums)

  10. Other queries # pb is a dictionary # keys: (string, string) # values: list of integers 5. How many phone numbers are there such that at least 2 people share that phone number? • Need to get all numbers with >= 2 names • Construct a dictionary mapping a number to a list of names

  11. # key: number # value: list of names num_to_names = {} for (name, numlist) in pb.items(): for num in numlist: names = num_to_names.get(num, []) names.append(name) num_to_names[num] = names # necessary num_shared = 0 for (num, names) in num_to_names.items(): if len(names) >= 2: num_shared += 1 print('quantity of phone numbers that are shared:') print(num_shared)

  12. Using a list comprehension num_shared = 0 for (num, names) in num_to_names.items(): if len(names) >= 2: num_shared += 1 # x refers to (num, names) tuples shared = len([x for x in num_to_names.items() if len(x[1])>=2])

  13. Outline • Applications of dictionaries • Phone book • Word frequencies and Zipf’s Law • References, mutability, and dictionary values

  14. Frequencies of words in a corpus: types and tokens • Brown corpus of English: • 1160743 word tokens • 49680 word types • Type: a distinct word • “with” • Token: an individual occurrence of a word • “with” occurs 7270 times

  15. Word frequency • Write a program that reads a text file and writes an output file listing each word type and its token frequency, in order of decreasing frequency. • Output format: 1000 hello 800 fish 40 lion • Plot the frequency distribution.

  16. # dictionary to store word frequencies # maps a string to an integer w_to_freq = {} # read file one line at a time, # break each line into individual tokens, # and count the tokens for line in open('C:/brown-corpus.txt', 'r'): tokens = line.split() for tok in tokens: w_to_freq[tok] = w_to_freq.get(tok,0) + 1

  17. # w.items() returns list of (string, int) tuples # convert to list of (int, string) tuples # so that we can sort by frequency freqs_words = [] for (w, freq) in w_to_freq.items(): freqs_words.append((freq, w)) # sort by decreasing frequency freqs_words.sort(reverse=True) # write output file of = open('C:/output.txt', 'w') for (freq, w) in freqs_words: of.write('{:8d}\t{:s}\n'.format(freq, w)) of.close()

  18. Using a list comprehension # freqs_words = [] # for (w, freq) in w_to_freq.items(): # freqs_words.append((freq, w)) # same, using a list comprehension freqs_words = [(f,w) for (w,f) in w_to_freq.items()]

  19. Frequency and rank • Sort words by decreasing frequency • Rank = order in sorted list • Rank 1: most-frequent word • Rank 2: second most-frequent word • etc. • Plot word frequencies by rank

  20. Plotting • Download and install matplotlib • http://matplotlib.sourceforge.net/

  21. import matplotlib.pyplot as plt # some code omitted # read word frequencies from a corpus # # counts is a list of integers in # decreasing order counts = [freq for (freq,w) in freqs_words] plt.plot(counts) plt.xlabel('word rank') plt.ylabel('word frequency') plt.title('Word frequency vs. rank') plt.show()

  22. Plot of word frequencies, linear scale(Out-of-date figure: forgot to include labels and titles) Frequency ( in 10,000s ) Rank ( in 10,000s)

  23. Plot of word frequencies, zoom in

  24. import matplotlib.pyplot as plt # some code omitted # read word frequencies from a corpus # counts is a list of integers in # decreasing order plt.plot(counts) plt.xscale('log') # logarithmic scales plt.yscale('log') # for x- and y- axes plt.xlabel('word rank') plt.ylabel('word frequency') plt.title('Word frequency vs. rank') plt.show()

  25. Plot of word frequencies, log-log scale Log1010 = 1 Log10100 = 2 Log101000 = 3

  26. Plot of word frequencies, log-log scale ~10 types with freq. > 10,000 Log1010 = 1 Log10100 = 2 Log101000 = 3 ~100 types 1,000 < freq < 10,000 ~1,000 types 100 < freq < 1,000 ~10,000 types 10 < freq < 100 10,000s types 1 < freq < 10

  27. Word frequency distributions in language • There are a few common words • A large, but not huge number of medium frequency words • Very, very many low frequency words

  28. Word frequencies exemplify Zipf’s law • Power law distribution • The frequencyF of a word w is inversely proportional to the rankR of w: F  1 / R i.e., F x R = k, for some constant k • Example: 50th most common word type should occur three times as freq. as 150th most common word type freq. at rank 50:  1 / 50 freq. at rank 150:  1 / 150 ( 1 / 50 ) / ( 1 / 150 ) = 3

  29. Near-linear relationship between freq. and rank in log-log scale Log1010 = 1 Log10100 = 2 Log101000 = 3

  30. What kind of words are frequent and infrequent?

  31. Most-frequent words: function words(perform grammatical functions)

  32. Least-frequent words: content words (express meaning)

  33. Outline • Applications of dictionaries • Phone book • Word frequencies and Zipf’s Law • References, mutability, and dictionary values

  34. References and mutability in the context of dictionary values • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted? >>> d = {'Anna': 104, 'Jack':[303, 304]} >>> anna_room = d['Anna'] >>> anna_room = 105 >>> d['Anna'] = anna_room # 1 >>> jack_rooms = d['Jack'] >>> jack_rooms.append(305) >>> d['Jack'] = jack_rooms # 2 >>> d {'Anna': 105, 'Jack':[303, 304, 305]}

  35. References and mutability in the context of dictionary values • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted? >>> d = {'Anna': 104, 'Jack':[303, 304]} >>> anna_room = d['Anna'] >>> anna_room = 105 >>> # d['Anna'] = anna_room # 1 >>> jack_rooms = d['Jack'] >>> jack_rooms.append(305) >>> # d['Jack'] = jack_rooms # 2 >>> d {'Anna': 104, 'Jack':[303, 304, 305]}

  36. (simplifying representation of keys) d = {'Anna': 104, 'Jack':[303, 304]} Type: Integer Data: 104 Key: 'Anna' Ref: <address1> d Key: 'Jack' Ref: <address2> Type: List Data: [303, 304]

  37. d = {'Anna': 104, 'Jack':[303, 304]} anna_room = d['Anna'] anna_room = 105 d['Anna'] = anna_room # 1 Type: Integer Data: 104 Key: 'Anna' Ref: <address1> Key: 'Anna' Ref: <address3> d Key: 'Jack' Ref: <address2> Type: List Data: [303, 304] Name: anna_room Ref: <address3> Name: anna_room Ref: <address1> Type: Integer Data: 105

  38. d = {'Anna': 104, 'Jack':[303, 304]} jack_rooms = d['Jack'] jack_rooms.append(305) # d['Jack'] = jack_rooms # NOT NEEDED Type: Integer Data: 104 Key: 'Anna' Ref: <address1> d Key: 'Jack' Ref: <address2> Type: List Data: [303, 304, 305] Type: List Data: [303, 304] Name: jack_rooms Ref: <address2>

  39. But suppose you begin with an empty dictionary. • Assignment is necessary because upon first encountering a key, an empty list is assigned to the variable rooms. This empty list is not yet a value for the key name. • Subsequently, when encountering a key already in the dictionary, the execution of the last line does not change the structure of the dictionary. a = [('Jack', 303), ('Jack', 304)] d = {} for (name, room) in a: rooms = d.get(name, []) rooms.append(name) d[name] = rooms # necessary

More Related