Practical Application of Machine Learning in Text Analysis

Machine Learning in PracticeLecture 12 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day • Announcements • Assingment 5 handed out – Due next Thur • Note: Readings for next two lectures on Blackboard in Readings folder • See syllabus for specifics • Feedback on Quiz 4 • Homework 4 Issues • Midterm assigned Thur, Oct 21!!! • More about Text • Term Weights • Start Linguistic Tools

Assignment 5

Assignment 5 * 2 examples, but there are many more

TA Office Hours • Possibly moving to Wednesdays at 3 • Note that there will be a special TA session before the midterm for you to ask questions

? How are we doing on pace and level of detail? 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check

Feedback on Quiz 4

Feedback on Quiz 4 • Nice job overall!!! • I could tell you read carefully!  • Note that part-of-speech refers to grammatical categories like noun, verb, etc. • Named entity extractors locate noun phrases that refer to people, organizations, countries, etc. • Some people skipped they why and how parts of questions • Some people over-estimated the contribution of pos-tagging

Homework 4 Issues

Error Analysis

Error Analysis • If I sort by different features, I can see whether rows of a particular color end up in a specific region • If I want to know which features to do this with, I can start with the most predictive features • Another option would be to use machine learning to predict which cell an instance would end up in within the confusion matrix

Other Suggestions • Use RemoveMisclassified • unsupervised attribute filter • Separates correctly classified instances from incorrectly classified instances • Works in a similar way to the remove folds filter • Only need to use it twice rather than 20 times for 10-fold cross validation • Doesn’t give you as much information

Computing Confidence Intervals • 90% confidence interval corresponds to z=1.65 • 5% chance that a data point will occur to the right of the rightmost edge of the interval • f = percentage of successes • N = number of trials •  • f=75%, N=1000, c=90% -> [0.727,0.773]

Term Weights

Document Retrieval/ Inverted Index Doc# Freq Word Positions Stemmed Tokens DocFreq Total Frequ … ** Easy to find all documents that have terms in common with your query. ** Stemming allows you to retrieve Morphological variants (run, runs, running, runner) ** Word positions allow you to specify that you want two terms to appear within N words of each other

Evaluating Document Retrieval • Use standard measures: precision, recall, f-measure • Retrieving all documents that share words with your query will both over and under generate • If you have the whole web to select from, then under-generating is much less of a concern than over generating • Does this apply to your task?

Common Vocabulary is Not Enough • You’ll get documents that mention other senses of the term you mean • River bank versus financial institution • Word sense disambiguation is an active area of computational linguistics! • You won’t get documents that discuss other related terms

Common Vocabulary is Not Enough • You’ll get documents that mention a term but are not about that term • Partly get around this by sorting by relevance • Term weights approximate a measure of relevance • Cosine similarity between Query vector and Document vector computes relevance – then sort documents by relevance

Computing Term Weights • A common vector representation for text is to have one attribute per word in your vocabulary • Notice that Weka gives you other options

Why is it important to think about term weights? • If term frequency or salience matters for your task, you might lose too much information if you just consider whether a term ever occurred or not • On the other hand, if term frequency doesn’t matter for your task, a simpler representation will most likely work better • Term weights are important for information retrieval because in large documents, just knowing a term occurs at least once does not tell you whether that document is “about” that term

Basics of Computing Term Weights • Assume occurrence of each term is independent so that attributes are orthogonal • Obviously this isn’t true! But it’s a useful simplifying assumption • Term weight functions have two basic components • Term frequency: How many times did that term occur in the current document • Document frequency: How many times did that term occur across documents (or how many documents did that term occur in)

Basics of Computing Term Weights • Inverse document frequency: a measure of the rarity of a term • Idft = log(N/nt) where t is the term, N is the number of documents, and nt is the number of documents where that term occurred at least once • Note that inverse document frequency is 0 in the case that a term occurs in all documents • It approaches 1 for very rare terms

TF.IDF – Term Frequency/Inverse Document Frequency • A scheme that combines term frequency with inverse document frequency • Wt,d = tft,d X idft,d • Weka also gives you the option for normalizing for document length • Since terms are more likely to occur in longer documents just by chance • You can then compute the cosine similarity between the vector representation for a text and that of a query

Computing Term Weights • Notice how to set options for different types of term weights

Trying Different Term Weights • Predicting Class1, 72 instances • Note that the number of levels for each separate feature is identical for Word Count, Term Frequency, and TF.IDF.

Trying Different Term Weights • What is different is the relative weight of the different features. • Whether this matters depends on the learning method

Linguistic Tools

Basic Anatomy: Layers of Linguistic Analysis • Phonology: The sound structure of language • Basic sounds, syllables, rhythm, intonation • Morphology: The building blocks of words • Inflection: tense, number, gender • Derivation: building words from other words, transforming part of speech • Syntax: Structural and functional relationships between spans of text within a sentence • Phrase and clause structure • Semantics: Literal meaning, propositional content • Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness) • Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis

Sentence Segmentation • Breaking a text into sentences is a first step for processing • Why is this not trivial? • In speech there is no punctuation • In text, punctuation may be missing • Punctuation may be ambiguous (i.e., periods in abbreviations) • Alternative approaches • Rule based/regular expressions • Statistical models

Tokenization • Segment a data stream into meaningful units • Each unit is called a token • Simple rule: a token is any sequence of characters separated by white space • Leaves punctuation attached to words • But stripping out punctuation would break up large numbers like 5,235,064 • What about words like “school bus”

Automatic Segmentation

Automatic Segmentation • Run a sliding window 3 symbols wide across the text • Some features from outside the window used also for prediction • Each position classified as a boundary or not • Boundary between 1st and 2nd position

Automatic Segmentation

Automatic Segmentation • Features used for prediction: • 3 symbols • Punctuation • Whether there have been at least two capitalized words since the last boundary • Whether there have been at least three non-capitalized words since the last boundary • Whether we have seen fewer than half the number of symbols as the average segment length since the last boundary • Whether we have seen fewer than half the average number of symbols between punctuations since the last punctuation mark

Automatic Segmentation • Model trained with decision tree learning algorithm • Percent accuracy: 96% • Agreement: .44 Kappa • Precision: .59 • Recall: .37 • We assign 66% as many boundaries as the gold standard

Stemmers and Taggers • Stemmers are simple morphological analyzers • Strip a word down to the root • Run, runner, running, runs: all the same root • Next week we will use the Porter stemmer, which just chops endings off • Taggers assign syntactic categories to tokens • Words assigned potential POS tags in the lexicon • Context also plays a role

Wrap-Up • Feature space design affects classification accuracy • We examined two main ways to manipulate the feature space representation of text • One way is through alternative types of term weights • Also using linguistic tools to identify features of texts beyond just the words that make them up • Part-of-speech taggers can be customized with different tag sets • Next time we’ll talk about the tag set you will use in the assignment • We will also talk about parsers • You can use parsers to create features for classification

Practical Application of Machine Learning in Text Analysis

Practical Application of Machine Learning in Text Analysis

Presentation Transcript

Information Extraction Lecture 12 – More Machine Learning

Machine Learning in Practice Lecture 9

Machine Learning in Practice Lecture 3

Machine Learning in Practice Lecture 18

Machine Learning in Practice Lecture 19

Machine Learning – Lecture 4

Machine Learning in Practice MidTerm Review

Machine Learning in Practice Lecture 14

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 5

Machine Learning in Practice Lecture 8

Machine Learning: Lecture 6

Machine Learning: Lecture 5

Machine Learning: Lecture 12

Machine Learning in Practice Lecture 26

Machine Learning in Practice Lecture 27

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 6

Machine Learning: Lecture 6