1 / 67

Programming for Linguists

Programming for Linguists. An Introduction to Python 08/12/2011. Ex 1) Write a script that reads 5 words that are typed in by a user and tells the user which word is shortest and longest.

evonne
Télécharger la présentation

Programming for Linguists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for Linguists An Introduction to Python08/12/2011

  2. Ex 1) Write a script that reads 5 words that are typed in by a user and tells the user which word is shortest and longest

  3. Ex. 1)def word_length( ):count=5 list1 = [ ]whilecount > 0: s= raw_input( "Please enter a word ”) list1.append(s)count= count-1longest= list1[0]shortest= list1[0]for word in list1:iflen(word) > len(longest):longest=wordeliflen(word) < len(shortest):shortest=word print shortest,"is the shortest word.” print longest,"is the longest word."

  4. Ex 2) Write a function that takes a sentence as an argument and calculates the average word length of the words in that sentence

  5. Ex 2)def awl(sent):wlist = [ ] sentence = sent.split( ) for word in sentence:wlist.append(len(word)) mean = sum(wlist)/float(len(wlist)) print “The average word length is ”,meanawl(“this is a test sentence”)

  6. Ex 3) Take a short text of about 5 sentences. Write a script that will split up the text into sentences (tip: use the punctuation as boundaries) and calculates the average sentence length, the average word length and the standard deviation for both values

  7. Ex 3)import redefmean(list):mean = sum(list)/float(len(list)) return meandef SD(list):devs = [ ]for item in list:std = (item – mean(list))**2devs.append(std) SD = (sum(devs) / float(len(devs))**0.5 return SD

  8. defstatistics(sent):asl = [ ]awl= [ ]sentences = re.split(r ‘[.!?]’, sent) forsentence in sentences[:-1]:sentence = re.sub(r ‘\W+’, ‘ ’,sentence) tokens = sentence.split( )asl.append(len(tokens))for token in tokens:awl.append(len(token)) print mean(asl), SD(asl) print mean(awl), SD(awl) statistics(“sentences”)

  9. Dictionaries • Like a list, but more general • In a list the index has to be an integer, e.g. words[4] • In a dictionary the index can be almost any type • A dictionary is like a mapping between 2 sets: keys and values • function:dict( )

  10. To create an empty list:list = [ ] • To create an empty dictionary:dictionary = { } • For example a dictionary containing English and Spanish words:eng2sp = { }eng2sp['one'] = 'uno’print eng2sp{'one': 'uno'}

  11. In this case both the keys and the values are of the string type • Like with lists, you can create dictionaries yourselves, e.g.eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}print eng2sp • Note: in general, the order of items in a dictionary is unpredictable

  12. You can use the keys to look up the corresponding values, e.g.print eng2sp['two'] • The key ‘two’ always maps to the value ‘dos’ so the order of the items does not matter • If the key is not in the dictionary you get an error message, e.g.print eng2sp[‘ten’]KeyError: ‘ten’

  13. The len( ) function returns the number of key-value pairslen(eng2sp) • The in operator tellsyouwhethersomethingappearsas a key in the dictionary‘one’ in eng2spTrue • BUT‘uno’ in eng2spFalse

  14. To see whethersomethingappears as a value in a dictionary, youcanuse the values( ) function, which returns the values as a list, and thenuse the in operator, e.g.‘uno’ in eng2sp.values( )True

  15. A Dictionary as a Set of Counters • Suppose you want to count the number of times each letter occurs in a string, you could: • create 26 variables, traverse the string and, for each letter, add 1 to the corresponding counter • create a dictionary with letters as keys and counters as the corresponding values

  16. def frequencies(sent): freq_dict = { }for let in sent: if let not in freq_dict:freq_dict[let] = 1 else:freq_dict[let] += 1 return freq_dict frequencies(“abracadabra”)

  17. The first line of the function creates an empty dictionary • The for loop traverses the string • Each time through the loop, if the letter is not in the dictionary, we create a new key item with the initial value 1 • If the letter is already in the dictionary we add 1 to its corresponding value

  18. Write a function that counts the word frequencies in a sentence instead of the letter frequencies using a dictionary

  19. def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist: if word not in word_freq:word_freq[word] = 1 else:word_freq[word] += 1return word_freq words(“this is is a a test sentence”)

  20. Reverse Lookup • Given a dictionary “word_freq” and a key “is”, finding the corresponding value: word_freq[“is”] • This operation is called a lookup • What if you know the value and want to look up the corresponding key?

  21. Previous example: def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist: if word not in word_freq:word_freq[word] = 1 else:word_freq[word] += 1return word_freq w_fr = words(“this is is a a test sentence”)

  22. Write a function which takes as argument the variable w_fr and the nr number of times a word occurs in the sentence and returns a list of words which occur nr times or returns “There are no words in the sentence that occur nr times”.

  23. defreverse_lookup(w_fr, nr):list1 = [ ]for word in w_fr:ifw_fr[word] == nr: list1.append(word)iflen(list1) > 0: return list1else: print "There are no words in the sentence that occur ”, nr, “times.”

  24. Sorting a Dictionary According to its Values • First you need to import itemgetter:from operator import itemgetter • To go over each item in a dictionary you can use .iteritems( ) • To sort the dictionary according to the values, you need to use key = itemgetter(1) • To sort it decreasingly: reverse = True

  25. from operator import itemgetterdefwords(s):w_fr = { }wordlist = s.split( )for word in wordlist:if word not in w_fr:w_fr[word] = 1else:w_fr[word] += 1 h = sorted(w_fr.iteritems( ), key = itemgetter(1), reverse =True)return h

  26. Inverting Dictionaries • It could be useful to invert a dictionary: keys and values switch placedef invert_dict(d): inv = { } for key in d: value = d[key] if value not in inv: inv[value] = [key] else: inv[value].append(key) return inv

  27. But: lists can be values, but never keys!

  28. GettingStartedwith NLTK • In IDLE: import nltknltk.download()

  29. SearchingTexts • Start your script withimporting all texts in NLTK: fromnltk.book import * • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis • text4: Inaugural Address Corpus • text5: Chat Corpus • text6: Monty Python and the Holy Grail • text7: Wall Street Journal • text8: Personals Corpus • text9: The Man Who Was Thursday by G . K . Chesterton 1908

  30. Any time you want to find out about these texts, just enter their names at the Python prompt:>>> text1<Text: Moby Dick by Herman Melville 1851> • A concordance view shows every occurrence of a given word, together with some context:e.g. “monstrous” in Moby Dicktext1.concordance(“monstrous”)

  31. Try looking up the context of “lol” in the chat corpus (text 5) • If you have a corpus that contains texts that are spread over time, you can look up how some words are used differently over time:e.g. the InauguralAddress Corpus (dates back to 1789): words like “nation”, “terror”, “God”…

  32. You can also examine what other words appear in a similar context, e.g. text1.similar(“monstrous”) • common_contexts( ) allows you to examine the contexts that are shared by two or more words, e.g.text1.common_contexts([“very”, “monstrous”])

  33. You can also determine the location of a word in the text • This positional information can be displayed using a dispersion plot • Each stripe represents an instance of a word, and each row represents the entire text, e.g. text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

  34. Counting Tokens • To count the number of tokens (words + punctuation marks), just use the len( ) function, e.g. len(text5) • To count the number of unique tokens, you have to make a set, e.g.set(text5)

  35. If you want them sorted alfabetically, try this:sorted(set(text5)) • Note: in Python all capitalized words precede lowercase words (you can use .lower( ) first to avoid this)

  36. Now you can calculate the lexical diversity of a text, e.g. the chat corpus (text5): 45010 tokens 6066 unique tokens or typesThe lexical diversity = nr of types/nr of tokens • Use the Python functions to calculate the lexical diversity of text 5

  37. len(set(text5))/float(len(text5))

  38. Frequency Distributions • To find n most frequent tokens: FreqDist( ), e.g.fdist = FreqDist(text1)fdist[“have”] 760all_tokens = fdist.keys( )all_tokens[:50] • The function .keys( ) combined with the FreqDist( ) also gives you a list of all the unique tokens in the text

  39. Frequency distributions can be informative, BUT the most frequent words usually are function words (the, of, and, …) • What proportion of the text is taken up with such words? Cumulative frequency plotfdist.plot(50, cumulative=True)

  40. If frequent tokens do not give enough information, what about infrequent tokens?Hapaxes= tokens which occur only oncefdist.hapaxes( ) • Without their context, you do not get much information either

  41. Fine-grained Selection of Tokens • Extract tokens of a certain minimum length:tokens = set(text1)long_tokens = [ ]for token in tokens: if len(token) >= 15: long_tokens.append(token) ORlong_tokens = list(token for token in tokens if len(token) >= 15)

  42. BUT: very long words are often hapaxes • You can also extract frequently occurring long words of a certain length:words = set(text1)fdist = FreqDist(text1)freq_long_words = list(word for word in words if len(word) >= 7 and fdist[word] >= 7)

  43. Collocations and Bigrams • A collocation is a sequence of words that occur together unusually often, e.g. “red whine” is a collocation, “yellow whine” is not • Collocations are essentially just frequent bigrams (word pairs), but you can find bigrams that occur more often than is to be expected based on the frequency of the individual words:text8.collocations( )

  44. Some Functions for NLTK's Frequency Distributions fdist = FreqDist(samples) fdist[“word”]  frequency of “word” fdist.freq(“word”)  frequency of “word” fdist.N( )  total number of samples fdist.keys( )  the samples sorted in order of decreasing frequency for sample in fdist:  iterates over the samples in order of decreasing frequency

  45. fdist.max( )  sample with the greatest count fdist.plot( )  graphical plot of the frequency distribution fdist.plot(cumulative=True)  cumulative plot of the frequency distribution fdist1 < fdist2  tests if the samples in fdist1 occur less frequently than in fdist2

  46. Accessing Corpora • NLTK also contains entire corpora, e.g.: • Brown Corpus • NPS Chat • Gutenberg Corpus • …A complete list can be found on http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

  47. Each of these corpora contains dozens of individual texts • To see which files are e.g. in the Gutenberg corpus in NLTK:nltk.corpus.gutenberg.fileids() • Do not forget the dot notation nltk.corpus. This tells Python the location of the corpus

  48. You can use the dot notation to work with a corpus from NLTK or you can import a corpus at the beginning of your script:from nltk.corpus import gutenberg • After that you just have to use the name of the corpus and the dot notation before a functiongutenberg.fileids( )

  49. If you want to examine a particular text, e.g. Shakespeare’s Hamlet, you can use the .words( ) functionHamlet = gutenberg.words(“shakespeare-hamlet.txt”) • Note that “shakespeare-hamlet.txt” is the file name that is to be found using the previous .fileids( ) function • You can use some of the previously mentioned functions (corpus methods) on this text, e.g.fdist_hamlet = FreqDist(hamlet)

  50. Some Corpus Methods in NLTK • brown.raw( )  raw data from the corpus file(s) • brown.categories( )  fileids( ) grouped per predefinedcategories • brown.words( )  a list of words and punctuationtokens • brown.sents( )  words( ) groupedintosentences

More Related