1 / 38

Programming for Linguists

Programming for Linguists. An Introduction to Python 15/12/2011. Tuples. A sequence of values They are similar to lists: the values can be any type they are indexed by integers Syntactically a tuple is a comma-separated list of values : t = 'a', ' b ', ' c ', ' d ', ' e '.

avari
Télécharger la présentation

Programming for Linguists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for Linguists An Introduction to Python15/12/2011

  2. Tuples • A sequence of values • They are similar to lists: • the values can be any type • they are indexed by integers • Syntactically a tuple is a comma-separated list of values:t = 'a', 'b', 'c', 'd', 'e'

  3. Althoughit is notnecessary, it is common to enclosetuples in parenthesest = ('a', 'b', 'c', 'd', 'e’) • To create a tuplewith a single element, you have to include a finalcomma:t1 = 'a’,type(t1)

  4. Note: a value in parentheses is not a tuple !t2 = (‘a’)type(t2) • Withno argument, the tuple ( ) functioncreates a newemptytuplet = tuple( )

  5. If the argument is a sequence (string, list or tuple), the result is a tuplewith the elements of the sequence:t = tuple(‘lupins’)print t • Most list operators alsoworkontuples:print t[0]print t[1:3]

  6. BUT ifyoutry to modifyone of the elements of the tuple, you get an error messaget[0] = ‘A’ • Youcan’tmodify the elements of a tuple: a tuple is immutable !

  7. Youcanreplaceonetuplewithanothert = ('A',) + t[1:]print t

  8. Tuple Assignment • It is oftenuseful to swap the values of two variables, e.g. swap “a” with “b”temp=a a=bb=temp

  9. More elegant with a tuple assignmenta,b = b,a • The number of variables on the left and the number of valueson the right have to be the same !a, b = 1,2,3ValueError: toomanyvalues to unpack

  10. For example: split an email addressinto a user name and a domainaddress = ‘joske@ua.ac.be’username, domain = address.split('@')print usernameprint domain • The return valuefrom split(‘@’) is a list withtwoelements • The first element is assigned to username, the second to domain.

  11. Tuples as Return Values • Strictlyspeaking, a functioncanonly return onevalue • If the value is a tuple, the effect is the same as returning multiple values

  12. For example: defmin_max(t): return min(t), max(t) • max( ) and min( ) are built-in functionsthatfind the largest and smallestelements of a sequence • min_max(t) computesboth and returns a tuple of twovalues

  13. Dictionaries and Tuples • .items ( ) functionusedondictionaries we saw last week actually returns a list of tuples, e.g.>>> d = {'a':0, 'b':1, 'c':2}>>> d.items( )[('a', 0), ('c', 2), ('b', 1)]

  14. This way you can easily access both keys and values separately:d = {'a':0, 'b':1, 'c':2}for letter, number in d.items( ): print letter print number

  15. Example: sorting a list of wordsbytheir word lengthdefsort_by_length(words): list1=[ ]for word in words: list1.append((len(word), word)) list1.sort(reverse=True)ordered_list=[ ] forlength, word in list1:ordered_list.append(word) return ordered_list

  16. NLTK and the Internet • A lot of text on the web is in the form of HTML documents • To accessthem, youfirstneed to specify the correct locationurl = “http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html” • Thenuse the urlopen( ) functionfromurllib import urlopenhtmltext = urlopen(url).read( )

  17. NLTK provides a functionnltk.clean_html( ), which takes an HTML string and returns rawtext, e.g.rawtext = nltk.clean_html(htmltext) • In order to useother NLTK methods, youcanthentokenizethe rawtext tokens=nltk.wordpunct_tokenize(rawtext)

  18. NLTK’s WordPunctTokenizer takes as an argument raw text and returns a list of tokens (words + punctuation marks) • If you want to use the functions we used on the texts from nltk.book on your own texts, use the nltk.Text( ) functionmy_text = nltk.Text(tokens)my_text.collocations( )

  19. Note: ifyou are used to workingwithcharacters in a particularlocalencoding (ë, è,…), youneed to include the string '# -*- coding: <coding> -*-' as the firstorsecondline of your script, e.g.# -*- coding: utf-8 -*-

  20. Writing Results to a File • It is oftenuseful to write output to files • First you have to open/create a file foryour output output_file = open(‘(path)/output.txt’,‘w’)output_file = open(‘(path)/output.txt’,‘a’)

  21. Now you have to write your output to the file you just openedlist = [1, 2, 3]output_file.write(str(list) + "\n”) • Whenyouwritenon-text data to a file you must convertit to a stringfirst • Do notforget to close the file whenyou are doneoutput_file.close( )

  22. NLTK and automatic text classification • Classificationis the computationaltaskof choosing the correct class label for a giveninput text, e.g. • decidingwhetheran email is spam or not • decidingwhat the topic of a newsarticleis (e.g. sports, politics, financial,…) • authorshipattribution

  23. Framework (1) • Gather a training corpus: • in which a categorization is possibleusingmetadata, e.g. • information about the author(s): name, age, gender, location • information about the texts’ genre: sports, humor, romance, scientific

  24. Framework (2) • Gather a training corpus: • forwhichyouneedtoadd the metadatayourself, e.g. • annotation of content-specific information: add sentiment labelstoutterances • annotation of linguistic features: addPOS tags totext • Result: a dataset withpredefinedcategories

  25. Framework (3) • Pre-processing of the dataset, e.g. tokenization, removing stop words • Feature selection: which features of the textcouldbeinformativeforyourclassificationtask, e.g. • lexical features: words, word bigrams,... • character features: n-grams • syntactic features: POS tags • semantic features: rolelabels • others: readability scores, TTR, wl, sl,…

  26. Framework (4) • Divideyour dataset in a training set and a test set (usually 90% vs 10%) • Feature selectionmetrics: • based on frequencies: most frequent features • based on frequencydistributions per category: most informative features • in NLTK: Chi-square, Student'st test, Pointwise Mutual Information, Likelihood Ratio, Poisson-Stirling, Jaccardindex, Information Gain • usethemonly on training data! (overfitting)

  27. Framework (5) • For document classification: each document in the dataset is representedby a separate instancecontaining the features extractedfrom the training data • The format of yourinstancesdepends on the classifieryou want touse • Select yourclassifier: in NLTK: NaiveBayes, Decision Tree, Maximum Entropy, link toWeka

  28. Framework (6) • Train the classifierusing the training instancesyoucreated in the previous step • Test yourtrained model on previouslyunseen data: the test set • Evaluateyourclassifier’s performance: accuracy, precision, recalland f-scores, confusion matrix • Perform error analysis

  29. A Case Study Classification task: classifying movie reviews into positive and negative reviews • Import the corpus from nltk.corpus import movie_reviews • Create a list of categorized documentsdocuments = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories( ) for fileid in movie_reviews.fileids(category)]

  30. print documents[:2] 3. Shuffle your list of documents randomlyfrom random import shuffleshuffle(documents) 4. Divide your data in training en testtrain_docs = documents[:1800]test_docs = documents[1800:] 5. We only consider word unigram features here, so make a dictionary of all (normalized) words from the training data

  31. train_words = { }for (wordlist, cat) in train_docs: for w in wordlist: w = w.lower( ) if w not in train_words:train_words[w] = 1 else:train_words[w] += 1print len(train_words)

  32. 6. Define a feature extraction functiondefextract_features(wordlist): document_words= set(wordlist) features = { } for word in document_words: word = word.lower( ) if word in train_words: features[word] = (word in document_words) return featuresprint extract_features(movie_reviews.words('pos/cv957_8737.txt'))

  33. 7. Use your feature extraction function to extract all features from your training and test set train_feats= [(extract_features(wordlist), cat) for (wordlist,cat)in train_docs]test_feats= [(extract_features(wordlist), cat) for (wordlist,cat) in test_docs]

  34. 7. Train e.g. NLTK’s Naïve Bayes classifier on the training set from nltk.classify import NaiveBayesClassifierclassifier = NaiveBayesClassifier.train(train_feats)predicted_labels = classifier.batch_classify([fs for (fs, cat) in test_feats]) 8. Evaluate the model on the test set print nltk.classify.accuracy(classifier, test_feats)classifier.show_most_informative_features(20)

  35. For Next Week • Feedback on the past exercises • Some extra exercises • If you have additional questions or problems, please e-mail me by Wednesday • The evaluation assignment will be announced

  36. Ex 1)Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly). • Ex 2) Write the raw text of the text in the previous exercise to an output file.

  37. Ex 3)Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

  38. Thank you

More Related