440 likes | 546 Vues
Explore the use of machine learning for text analysis, including term weights, linguistic tools, and document retrieval. Learn how to evaluate document retrieval and compute term weights for relevance.
E N D
Machine Learning in PracticeLecture 12 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Plan for the Day • Announcements • Assingment 5 handed out – Due next Thur • Note: Readings for next two lectures on Blackboard in Readings folder • See syllabus for specifics • Feedback on Quiz 4 • Homework 4 Issues • Midterm assigned Thur, Oct 21!!! • More about Text • Term Weights • Start Linguistic Tools
Assignment 5 * 2 examples, but there are many more
TA Office Hours • Possibly moving to Wednesdays at 3 • Note that there will be a special TA session before the midterm for you to ask questions
? How are we doing on pace and level of detail? 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check
Feedback on Quiz 4 • Nice job overall!!! • I could tell you read carefully! • Note that part-of-speech refers to grammatical categories like noun, verb, etc. • Named entity extractors locate noun phrases that refer to people, organizations, countries, etc. • Some people skipped they why and how parts of questions • Some people over-estimated the contribution of pos-tagging
Error Analysis • If I sort by different features, I can see whether rows of a particular color end up in a specific region • If I want to know which features to do this with, I can start with the most predictive features • Another option would be to use machine learning to predict which cell an instance would end up in within the confusion matrix
Other Suggestions • Use RemoveMisclassified • unsupervised attribute filter • Separates correctly classified instances from incorrectly classified instances • Works in a similar way to the remove folds filter • Only need to use it twice rather than 20 times for 10-fold cross validation • Doesn’t give you as much information
Computing Confidence Intervals • 90% confidence interval corresponds to z=1.65 • 5% chance that a data point will occur to the right of the rightmost edge of the interval • f = percentage of successes • N = number of trials • • f=75%, N=1000, c=90% -> [0.727,0.773]
Document Retrieval/ Inverted Index Doc# Freq Word Positions Stemmed Tokens DocFreq Total Frequ … ** Easy to find all documents that have terms in common with your query. ** Stemming allows you to retrieve Morphological variants (run, runs, running, runner) ** Word positions allow you to specify that you want two terms to appear within N words of each other
Evaluating Document Retrieval • Use standard measures: precision, recall, f-measure • Retrieving all documents that share words with your query will both over and under generate • If you have the whole web to select from, then under-generating is much less of a concern than over generating • Does this apply to your task?
Common Vocabulary is Not Enough • You’ll get documents that mention other senses of the term you mean • River bank versus financial institution • Word sense disambiguation is an active area of computational linguistics! • You won’t get documents that discuss other related terms
Common Vocabulary is Not Enough • You’ll get documents that mention a term but are not about that term • Partly get around this by sorting by relevance • Term weights approximate a measure of relevance • Cosine similarity between Query vector and Document vector computes relevance – then sort documents by relevance
Computing Term Weights • A common vector representation for text is to have one attribute per word in your vocabulary • Notice that Weka gives you other options
Why is it important to think about term weights? • If term frequency or salience matters for your task, you might lose too much information if you just consider whether a term ever occurred or not • On the other hand, if term frequency doesn’t matter for your task, a simpler representation will most likely work better • Term weights are important for information retrieval because in large documents, just knowing a term occurs at least once does not tell you whether that document is “about” that term
Basics of Computing Term Weights • Assume occurrence of each term is independent so that attributes are orthogonal • Obviously this isn’t true! But it’s a useful simplifying assumption • Term weight functions have two basic components • Term frequency: How many times did that term occur in the current document • Document frequency: How many times did that term occur across documents (or how many documents did that term occur in)
Basics of Computing Term Weights • Inverse document frequency: a measure of the rarity of a term • Idft = log(N/nt) where t is the term, N is the number of documents, and nt is the number of documents where that term occurred at least once • Note that inverse document frequency is 0 in the case that a term occurs in all documents • It approaches 1 for very rare terms
TF.IDF – Term Frequency/Inverse Document Frequency • A scheme that combines term frequency with inverse document frequency • Wt,d = tft,d X idft,d • Weka also gives you the option for normalizing for document length • Since terms are more likely to occur in longer documents just by chance • You can then compute the cosine similarity between the vector representation for a text and that of a query
Computing Term Weights • Notice how to set options for different types of term weights
Trying Different Term Weights • Predicting Class1, 72 instances • Note that the number of levels for each separate feature is identical for Word Count, Term Frequency, and TF.IDF.
Trying Different Term Weights • What is different is the relative weight of the different features. • Whether this matters depends on the learning method
Basic Anatomy: Layers of Linguistic Analysis • Phonology: The sound structure of language • Basic sounds, syllables, rhythm, intonation • Morphology: The building blocks of words • Inflection: tense, number, gender • Derivation: building words from other words, transforming part of speech • Syntax: Structural and functional relationships between spans of text within a sentence • Phrase and clause structure • Semantics: Literal meaning, propositional content • Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness) • Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis
Sentence Segmentation • Breaking a text into sentences is a first step for processing • Why is this not trivial? • In speech there is no punctuation • In text, punctuation may be missing • Punctuation may be ambiguous (i.e., periods in abbreviations) • Alternative approaches • Rule based/regular expressions • Statistical models
Tokenization • Segment a data stream into meaningful units • Each unit is called a token • Simple rule: a token is any sequence of characters separated by white space • Leaves punctuation attached to words • But stripping out punctuation would break up large numbers like 5,235,064 • What about words like “school bus”
Automatic Segmentation • Run a sliding window 3 symbols wide across the text • Some features from outside the window used also for prediction • Each position classified as a boundary or not • Boundary between 1st and 2nd position
Automatic Segmentation • Features used for prediction: • 3 symbols • Punctuation • Whether there have been at least two capitalized words since the last boundary • Whether there have been at least three non-capitalized words since the last boundary • Whether we have seen fewer than half the number of symbols as the average segment length since the last boundary • Whether we have seen fewer than half the average number of symbols between punctuations since the last punctuation mark
Automatic Segmentation • Model trained with decision tree learning algorithm • Percent accuracy: 96% • Agreement: .44 Kappa • Precision: .59 • Recall: .37 • We assign 66% as many boundaries as the gold standard
Stemmers and Taggers • Stemmers are simple morphological analyzers • Strip a word down to the root • Run, runner, running, runs: all the same root • Next week we will use the Porter stemmer, which just chops endings off • Taggers assign syntactic categories to tokens • Words assigned potential POS tags in the lexicon • Context also plays a role
Wrap-Up • Feature space design affects classification accuracy • We examined two main ways to manipulate the feature space representation of text • One way is through alternative types of term weights • Also using linguistic tools to identify features of texts beyond just the words that make them up • Part-of-speech taggers can be customized with different tag sets • Next time we’ll talk about the tag set you will use in the assignment • We will also talk about parsers • You can use parsers to create features for classification