1 / 28

Programming for Linguists

Programming for Linguists. An Introduction to Python 22/12/2011. Feedback.

jaxon
Télécharger la présentation

Programming for Linguists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for Linguists An Introduction to Python22/12/2011

  2. Feedback • Ex. 1)Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time?

  3. import nltk from nltk.corpus import state_union cfd = nltk.ConditionalFreqDist((fileid, word) forfileid in state_union.fileids( ) for word in state_union.words(fileids = fileid)) fileids = state_union.fileids( ) search_words = ["men", "women", "people"] cfd.tabulate(conditions = fileids, samples = search_words)

  4. Ex 2)According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts.

  5. import nltk fromnltk.book import * texts = [text1, text2, text3, text4, text5] fortext in texts: print text.concordance("however")

  6. Ex 3)Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,…Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words.

  7. import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = “/Users/claudia/my_corpus” #corpus_root = “C:\Users\...” my_corpus = PlaintextCorpusReader (corpus_root, '.*’) words = my_corpus.words( ) cfd = nltk.ConditionalFreqDist((fileid, word) forfileid in my_corpus.fileids( ) for word in my_corpus.words(fileid))

  8. fileids = my_corpus.fileids( )modals = ['can', 'could', 'may', 'might', 'must', 'will’cfd.tabulate(conditions = fileids, samples = modals)fd = nltk.FreqDist(words)all_tokens = fd.keys( )for t in all_tokens:if re.match(r'[^a-zA-Z0-9]+', t):all_tokens.remove(t)most_frequent=all_tokens[:10]most_frequent.plot( )

  9. Ex 1)Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly). • Ex 2) Write the raw text of the text in the previous exercise to an output file.

  10. import nltk import re url= “website” fromurllib import urlopen htmltext= urlopen(url).read( ) rawtext= nltk.clean_html(htmltext) rawtext2= rawtext.lower( ) tokens= nltk.wordpunct_tokenize(rawtext2) my_text= nltk.Text(tokens) wordlist_ing= [wforw in tokensifre.search(r'^.*ing$',w)]

  11. freq_dict= { } for word in wordlist_ing: if word not in freq_dict: freq_dict[word] = 1 else: freq_dict[word] = freq_dict[word]+1 from operator import itemgetter sorted_wordlist_ing = sorted(freq_dict.iteritems(), key= itemgetter(1), reverse=True)

  12. Ex 2) output_file = open(“dir/output.txt","w") output_file.write(str(rawtext2)+"\n") output_file.close( )

  13. Ex 3)Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

  14. Some Mentioned Issues • Loading your own corpus in NLTK with no subcategories: import nltk from nltk.corpus import PlaintextCorpusReader loc = “/Users/claudia/my_corpus” #Mac loc = “C:\Users\claudia\my_corpus” #Windows 7 my_corpus = PlaintextCorpusReader(loc, “.*”)

  15. Loading your own corpus in NLTK with subcategories: import nltk from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” #Mac loc=“C:\Users\claudia\my_corpus” #Windows 7 my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern=r'(cat1|cat2)/.*')

  16. Dispersion Plot • determinethe location of a word in the text: howmanywordsfrom the beginningitappears

  17. Exercises • Write a program thatreads a file, breaks eachlineintowords, strips whitespace and punctuationfrom the text, and converts the words to lowercase.Youcanget a list of all punctuationmarksby:import stringprint string.punctuation

  18. import nltk, string def strip(filepath):f = open(filepath, ‘r’) text = f.read( ) tokens = nltk.wordpunct_tokenize(text) for token in tokens: token = token.lower( ) if token in string.punctuation:tokens.remove(token) return tokens

  19. If you want to analyse a text, but filter out a stop list first (e.g. containing “the”, “and”,…), you need to make 2 dictionaries: 1 with all words from your text and 1 with all words from the stop list. Then you need to subtract the 2nd from the 1st. Write a function subtract(d1, d2) which takesdictionaries d1 and d2 and returns a newdictionarythatcontains all the keysfrom d1 that are not in d2. Youcan set the values to None.

  20. defsubtract(d1, d2): d3 = { }forkey in d1.keys(): ifkeynot in d2: d3[key] = None return d3

  21. Let’s try it out: import nltk from nltk.book import * from nltk.corpus import stopwords d1 = { } for word in text7: d1[word] = None

  22. wordlist = stopwords.words(“english”) d2 = { } for word in wordlist: d2[word] = None rest_dict = subtract(d1, d2) wordlist_min_stopwords=rest_dict.keys( )

  23. Questions?

  24. Evaluation Assignment • Deadline = 23/01/2012 • Conversation in the week of 23/01/12 • If you need any explanation about the content of the assignment, feel free to e-mail me

  25. Further Reading • Since this was only a short introduction to programming in Python, if you want to expand your programming skills further: see chapters 15 – 18 about object-oriented programming

  26. Think Python. How to Think Like a Computer Scientist? • NLTK book • Official Python documentation:http://www.python.org/doc/ • There is a newerversion of Python available, butit is not (yet) compatible with NLTK

  27. Our research group:CLiPS: Computational Linguistics and Psycholinguistics Research Centerhttp://www.clips.ua.ac.be/ • Ourprojects:http://www.clips.ua.ac.be/projects

  28. Happy holidays and success with your exams

More Related