Programming for Linguists

Programming for Linguists An Introduction to Python13/12/2012

Dictionaries • Like a list, but more general • In a list the index has to be an integer, e.g. words[4] • In a dictionary the index can be almost any type • A dictionary is like a mapping between 2 sets: keys and values

To create an empty list:list = [ ] • To create an empty dictionary:dictionary = { } e.g. a dictionary containing English and Spanish words:>>>eng2sp = { }>>>eng2sp['one'] = 'uno’>>>print eng2sp{'one': 'uno'}

In this case both the keys and the values are of the string type • Like with lists, you can create dictionaries yourselves, e.g.eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}print eng2sp • Note: in general, the order of items in a dictionary is unpredictable

You can use the keys to look up the corresponding values, e.g.>>>print eng2sp['two'] • The key ‘two’ always maps to the value ‘dos’ so the order of the items does not matter • If the key is not in the dictionary you get an error message, e.g.>>>print eng2sp[‘ten’]KeyError: ‘ten’

The len( ) function returns the number of key-value pairslen(eng2sp) • The in operator tellsyouwhethersomethingappearsas a key in the dictionary>>>‘one’ in eng2spTrue • BUT>>>‘uno’ in eng2spFalse

To see whethersomethingappears as a value in a dictionary, youcanuse the values( ) function, which returns the values as a list, and thenuse the in operator, e.g.>>>‘uno’ in eng2sp.values( )True • Lists can be values, but never keys!

Default dictionary Trythis: words = [‘een’, ‘twee’, ‘drie’]frequencyDict = { }for w in words:frequencyDict[w] += 1

Possible solution: for w in words:if w in frequencyDict:frequencyDict[w] += 1else:frequencyDict[w] = 1

The easy solution: >>>fromcollections import defaultdict>>>frequencyDict = defaultdict(int)>>>for w in words:frequencyDict[w] += 1 • youcanuse int, float, str,… in the defaultdict

A Dictionary as a Set of Counters • Suppose you want to count the number of times each letter occurs in a string, you could: • create 26 variables, traverse the string and, for each letter, add 1 to the corresponding counter • create a dictionary with letters as keys and counters as the corresponding values

def frequencies(sent): freq_dict = defaultdict(int)for let in sent: freq_dict[let] += 1 return freq_dict dictA = frequencies(“abracadabra”) list_keys = dictA.keys( ) list_values = dictA.values( ) z_value = dictA[‘z’]

The first line of the function creates an empty default dictionary • The for loop traverses the string • Each time through the loop, we create a new key item with the initial value 1 • If the letter is already in the dictionary we add 1 to its corresponding value

Write a function that counts the word frequencies in a sentence instead of the letter frequencies using a dictionary

def words(sent):word_freq = defaultdict(int)wordlist = sent.split( )for word in wordlist:word_freq[word] += 1return word_freq words(“this is is a a test sentence”)

Dictionary Lookup • Given a dictionary “word_freq” and a key “is”, finding the corresponding value: word_freq[“is”] • This operation is called a lookup • What if you know the value and want to look up the corresponding key?

Sorting a Dictionary According to its Values • First you need to import itemgetter:from operator import itemgetter • To go over each item in a dictionary you can use .iteritems( ) • To sort the dictionary according to the values, you need to use key = itemgetter(1) • To sort it decreasingly: reverse = True

>>>from operator import itemgetter>>>defgetValues(sent):w_fr = defaultdict(int)wordlist = sent.split( )for word in wordlist:w_fr[word] += 1byVals = sorted(w_fr.iteritems( ), key = itemgetter(1), reverse =True) return byVals>>>getValues(‘this is a aasentence’)

Write a functionthat takes a sentence as an argument and returns allwordsthatoccuronlyonce in the sentence.

defgetHapax(sent):words = sent.split( )freqs = defaultdict(int)for w in words:freqs[w] += 1 hapaxlist = [ ]for item in freqs:value = freqs[item]ifvalue == 1:hapaxlist.append(item) return hapaxlist

GettingStartedwith NLTK • In IDLE: import nltknltk.download()

SearchingTexts • Start your script withimporting all texts in NLTK: fromnltk.book import * • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis • text4: Inaugural Address Corpus • text5: Chat Corpus • text6: Monty Python and the Holy Grail • text7: Wall Street Journal • text8: Personals Corpus • text9: The Man Who Was Thursday by G . K . Chesterton 1908

Any time you want to find out about these texts, just enter their names at the Python prompt:>>> text1<Text: Moby Dick by Herman Melville 1851> • A concordance view shows every occurrence of a given word, together with some context:e.g. “monstrous” in Moby Dicktext1.concordance(“monstrous”)

Try looking up the context of “lol” in the chat corpus (text 5) • If you have a corpus that contains texts that are spread over time, you can look up how some words are used differently over time:e.g. the InauguralAddress Corpus (dates back to 1789): words like “nation”, “terror”, “God”…

You can also examine what other words appear in a similar context, e.g. text1.similar(“monstrous”) • common_contexts( ) allows you to examine the contexts that are shared by two or more words, e.g.text1.common_contexts([“very”, “monstrous”])

You can also determine the location of a word in the text • This positional information can be displayed using a dispersion plot • Each stripe represents an instance of a word, and each row represents the entire text, e.g. text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Counting Tokens • To count the number of tokens (words + punctuation marks), just use the len( ) function, e.g. len(text5) • To count the number of unique tokens, you have to make a set, e.g.set(text5)

If you want them sorted alfabetically, try this:sorted(set(text5)) • Note: in Python all capitalized words precede lowercase words (you can use .lower( ) first to avoid this)

Now you can calculate the lexical diversity of a text, e.g. the chat corpus (text5): 45010 tokens 6066 unique tokens or typesThe lexical diversity = nr of types/nr of tokens • Use the Python functions to calculate the lexical diversity of text 5

len(set(text5))/float(len(text5))

Frequency Distributions • To find n most frequent tokens: FreqDist( ), e.g.fdist = FreqDist(text1)fdist[“have”] 760all_tokens = fdist.keys( )all_tokens[:50] • The function .keys( ) combined with the FreqDist( ) also gives you a list of all the unique tokens in the text

Frequency distributions can be informative, BUT the most frequent words usually are function words (the, of, and, …) • What proportion of the text is taken up with such words? Cumulative frequency plotfdist.plot(50, cumulative=True)

If frequent tokens do not give enough information, what about infrequent tokens?Hapaxes= tokens which occur only oncefdist.hapaxes( ) • Without their context, you do not get much information either

Fine-grained Selection of Tokens • Extract tokens of a certain minimum length:tokens = set(text1)long_tokens = [ ]for token in tokens: if len(token) >= 15:long_tokens.append(token) #OR shorter:long_tokens= list(token for token in tokens if len(token) >= 15)

BUT: very long words are often hapaxes • You can also extract frequently occurring long words of a certain length:words = set(text1)fdist = FreqDist(text1)#short versionfreq_long_words= list(word for word in words if len(word) >= 7 and fdist[word] >= 7)

Collocations and Bigrams • A collocation is a sequence of words that occur together unusually often, e.g. “red whine” is a collocation, “yellow whine” is not • Collocations are essentially just frequent bigrams (word pairs), but you can find bigrams that occur more often than is to be expected based on the frequency of the individual words:text8.collocations( )

Some Functions for NLTK's Frequency Distributions fdist = FreqDist(samples) fdist[“word”]  frequency of “word” fdist.freq(“word”)  frequency of “word” fdist.N( )  total number of samples fdist.keys( )  the samples sorted in order of decreasing frequency for sample in fdist:  iterates over the samples in order of decreasing frequency

fdist.max( )  sample with the greatest count fdist.plot( )  graphical plot of the frequency distribution fdist.plot(cumulative=True)  cumulative plot of the frequency distribution fdist1 < fdist2  tests if the samples in fdist1 occur less frequently than in fdist2

Accessing Corpora • NLTK also contains entire corpora, e.g.: • Brown Corpus • NPS Chat • Gutenberg Corpus • …A complete list can be found on http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Each of these corpora contains dozens of individual texts • To see which files are e.g. in the Gutenberg corpus in NLTK:nltk.corpus.gutenberg.fileids() • Do not forget the dot notation nltk.corpus. This tells Python the location of the corpus

You can use the dot notation to work with a corpus from NLTK or you can import a corpus at the beginning of your script:from nltk.corpus import gutenberg • After that you just have to use the name of the corpus and the dot notation before a functiongutenberg.fileids( )

If you want to examine a particular text, e.g. Shakespeare’s Hamlet, you can use the .words( ) functionHamlet = gutenberg.words(“shakespeare-hamlet.txt”) • Note that “shakespeare-hamlet.txt” is the file name that is to be found using the previous .fileids( ) function • You can use some of the previously mentioned functions (corpus methods) on this text, e.g.fdist_hamlet = FreqDist(hamlet)

Some Corpus Methods in NLTK • brown.raw( )  raw data from the corpus file(s) • brown.categories( )  fileids( ) grouped per predefinedcategories • brown.words( )  a list of words and punctuationtokens • brown.sents( )  words( ) groupedintosentences

brown.tagged_words( )  a list of (word,tag) pairs • brown.tagged_sents( ) tagged_words( ) groupedintosentences • treebank.parsed_sents( )  a list of parse trees

defstatistics(corpus):forfileidincorpus.fileids( ): nr_chars = len(corpus.raw(fileid)) nr_words = len(corpus.words(fileid)) nr_sents = len(corpus.sents(fileid)) nr_vocab = len(set([word.lower() for word in corpus.words(fileid)])) print fileid, “average word length: ”, nr_chars/nr_words, “average sentencelength: ”, nr_words/nr_sents, “lexicaldiversity: ”, nr_words/nr_vocab

Some corpora contain several subcategories, e.g. the Brown Corpus contains “news”, “religion”,… • You can optionally specify these particular categories or files from a corpus, e.g.:from nltk.corpus import brown brown.categories( ) brown.words(categories='news') brown.words(fileids=['cg22']) brown.sents(categories=['news', 'editorial', 'reviews'])

Some linguistic research: comparing genres in the Brown corpus in their usage of modal verbs fromnltk.corpus import browncfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories( )for word in brown.words(categories =genre)) #Do not press enter to type in the for #statements!

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor’] modal_verbs = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions=genres, samples=modal_verbs)

can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 • A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition” • The condition is usually the category of the text (news, religion,…)

Loading Your Own Text or Corpus • Make sure that the texts/files of your corpus are in plaintext format (convert them, do not just change the file extensions from e.g. .docx to .txt) • Make a map with the name of your corpus which contains all the text files

Programming for Linguists