When Bad is Good: Effect of Ebonics on Computational Language Processing

When Bad is Good: Effect of Ebonics on Computational Language Processing Tamsin Maxwell5 November 2007

Acknowledgment • IP notice - some background slides thanks to: • Dan Jurafsky, Jim Martin, Bonnie Dorr (Linguist180, 2007) • Mark Wasson, LexisNexis (2002) • Theresa Wilson (PhD work 2006) • Janyce Wiebe, U. Pittsburgh (2006)

Overview Part 1 • Comp Ling and ‘free text’ • Ebonics in lyrics • Comp Ling approach to language • Example - part of speech (POS) tagging • Application - sentiment analysis Part 2 • Methodology • Results • The broader picture

Comp Ling and ‘free text’ • Just like people, Comp Ling relies on having lots of examples to learn from • Most advances based on large text collections • Brown corpus: 1 million words, 15 genres (1961) • GigaWord corpus…trillion word corpus • Training data is generally ‘clean’ but real language is messy • Can we successfully use techniques developed on clean data to understand ‘free text’?

The trouble with text • We’re dealing with language! • Text/speech may lack structure that traditional processing and mining techniques can exploit • Information within text may be implicit • Ambiguity • Disfluency • Error, sarcasm, invention, creativity…. • Contrast with spreadsheets, databases, etc. • Well-defined structure

Back to basics • “Not enough focus on the data” • Collection • Cleansing • Scale • Completeness, including non-traditional sources • Structure • “Too much focus on algorithms” • Mark Wasson (LexisNexis):

What is ebonics? • Ebonics, or AAVE (African American Vernacular English), differs from ‘regular’ English • Pronunciation: “tief” - steal / “dem” - them or those • Grammar: “he come coming in here” - he comes in here • Vocabulary: “kitchen” - kinky hair at nape of neck • Slang: “yard axe” - preacher of little ability • Many ebonics words, phrases, pronunciations have become common parlance • ain't, gimme, bro, lovin, wanna Green, Lisa J. (2002). “African American English: A Linguistic Introduction”, Cambridge University Press

Source of ebonics • Rooted in African tradition but very much American • Southern rural during slavery • Slang 1900-1960 sinner-man black musician • Street culture, rap and hip-hop • Working class

Lyrics data • 2392 popular song lyrics from 2002 • 496 artists, mostly pop/rock and some rap/hip-hop • From user-upload public websites, so very messy! • Misspelling (e.g. dancning) • Phonetic spelling (e.g. cit-aa-aa-aa-aaay) • Annotation (e.g. feat xxx, [Dre]) • Abbreviation (e.g. chorus x2) • Too much punctuation or none e.g. no line breaks • Foreign language

Ebonics lexicon Approximately 15% of all tokens Ebonics • Phonetic translations (pronunciation) • Slang • Abbreviations Also includes • Spelling errors • Unintended word conjunctions • Named entities (people, places, brands)

Lexicon examples

Ebonics frequency Ebonics in red ‘Regular’ English in blue Word count Dictionary word number

Ebonics in lyrics • Red in curve suggests many frequent and differentiating words are ebonics terms • These terms have largely replaced their regular English counterparts • Cos, coz, cuz and cause for because • Wanna for want to • Red streaks in the tail suggest certain artists have very high proportions of ‘ebonics’ • Others may use only very common terms

Comp Ling ABC • Computer programmes have three basic tasks: • Check conditions (if x=y, then…) • Perform tasks in order (procedure) • Repeat tasks (iterate) • This underlies Comp Ling analysis e.g. • Break text into chunks (words, clauses, sentences) • Check if chunks meet specified conditions • Represent conditions with numbers (e.g. counts, binary) • Repeat for different conditions • Use maths to find the most likely solution to all conditions

An example: POS tagging

An example • Break text into chunks (words, clauses, sentences) • Check if chunks meet specified conditions • Represent results with numbers (e.g. counts, binary) • Repeat for different conditions • Use maths to find the most likely solution to all conditions

Tokenise the data • Each line of lyrics tokenised into unique tokens (words) • Punctuation split off by default • Where are you?  Where / are / you / ? • It’s about time  It / ‘s / about / time • Very tricky: how to reattach punctuation? • U.S.A.  U / . / S / . / A / . • Good lookin’  Good / lookin / ’ • Get / in / ‘ / cuz / we / ’re / ready / to / go  • “Get in because” ? • “Getting because” ?

Focus the problem • Convert language into metadata • In this example, convert tokens into POS tags • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • Tags constrained by conditions: • Condition: what tags are possible? • Condition: what is the preceding/following word?

Most frequent tag • Baseline statistical tagger • Create a dictionary with each possible tag for a word • Take a tagged corpus • Count the number of times each tag occurs for that word • Pick the most frequent tag independent of surrounding words (“unigram”) • Around 90% accuracy on news text

Most frequent tag • Which POS is more likely in a corpus (1,273,000 tokens)? NN VB Total race400 600 1000 • P(NN|race) = P(race&NN) / P(race) by the definition of conditional probability • P(race)  1000/1,273,000 = .0008 • P(race&NN)  400/1,273,000 =.0003 • P(race&VB)  600/1,273,000 = .0005 • And so we obtain: • P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.375 • P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .625

POS tagging as sequence classification • We are given a sentence ( a “sequence of observations”) • Secretariat is expected to race tomorrow • What is the best sequence of tags which corresponds to this sequence of observations? • Probabilistic view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observations

Disambiguating “race”

Hidden Markov Models • We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest • This equation is guaranteed to give us the best tag sequence • Hat ^ means “our estimate of the best one” • Argmaxx f(x) means “the x such that f(x) is maximized”

But hang on a minute… • Most-frequent-tag approach has a problem! • What about words that don’t appear in the training set? • E.g. Colgate, goose-pimple, headbanger • New words added to (newspaper) language 20+ per month • Plus many proper names … • “Out of vocabulary” (OOV) wordsincrease tagging error rates

Handling unknown words • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. • Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

The bottom line • POS tagging is widely considered a solved problem • Statistical taggers achieve about 97% per-word accuracy - the same as inter-annotator agreement Ongoing issues • Incorrect tags affect subsequent tagging decisions - more likely to find strings of errors • OOV words (in the training corpus) cause most errors • Statistical taggers are state-of-the-art but may not be as portable to new domains as rule-based taggers

What is POS tagging good for? • First step for vast number of Comp Ling tasks • Speech synthesis: • How to pronounce “lead”? • INsult inSULT • OBject obJECT • OVERflow overFLOW • Word prediction in speech recognition and etc • Possessive pronouns (my, your, her) followed by nouns • Personal pronouns (I, you, he) likely to be followed by verbs

What is POS tagging good for? • Parsing: Need to know POS in order to parse S NP VP S PP NP VP DT JJ NN VBZ IN DT NN VP NP NP PP The representative put chairs on the table DT NN VBD NNS IN DT NN The representative put chairs on the table

What is POS tagging good for? • Chunking - dividing text into noun and verb groups • E.g. rules that permit optional adverbials and adjectives in passive verb groups • Was probably sent or was sent away • Require that the tag for was be VBN or VBD (since POS taggers cannot reliably distinguish the two) • Sentiment analysis - deciding when a word reflects emotion • “I like the way you walk” • “Beat me like Better Crocker cake mix”

Sentiment detection Opinion mining • Opinions, evaluations, emotions, speculations are private states • Find relevant words, phrases, patterns that can be used to express subjectivity • Determine the polarity of subjective expressions • Expressions directed towards anobject Emotion detection • All text reflects personal emotion • May be undirected

Polarity • Focus on positive/negative and strong/weak emotions, evaluations, stances  sentiment analysis I’m ecstatic the Steelers won! She’s against the bill.

Lexicon abhor: negative acrimony: negative . . . cheers: positive . . . beautiful:positive . . . horrid: negative . . . woe: negative wonderfully:positive “Cheers to Timothy Whitfield for the wonderfullyhorridvisuals.” “Cheers to Timothy Whitfield for the wonderfullyhorridvisuals.” Prior polarity (out of context) Prior and Contextual Polarity

Shifting polarity • Negation - flip or intensify • Not good • Not only good but amazing • Diminishers - flip • Little truth • Lack of understanding • Word sense - shift • The old school was condemned in April • The election was condemned for being rigged

Lexicon Step 1 Step 2 All Instances Polar Instances Neutral or Polar? Contextual Polarity? Corpus Turney (2002)* • What happens if we take a simple approach? • Classified reviews thumbs up/down • Sum of polarities for all words with affective meaning in a text • 74% accuracy (66 - 84% various domains) *Turney, P.D. (2002). Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 417-424.

Sentiment Lexicon • Developed by Wilson et al (2005) • First two dimensions of Charles Osgood’s Theory of Semantic Differentiation • Evaluation or prior polarity (positive/negative/both/neutral) • Potency or reliability (strong/weak subjective) • Does not include activity (passive/active) • Over 8,000 words • Both manually and automatically identified • Positive/negative words from General Inquirer and Hatzivassiloglou and McKeown (1997) T. Wilson, J. Wiebe and P.Hoffmann (2005), “Recognizing contextual polarity in phrase-level sentiment analysis”. HLT '05, p 347--354. C.E. Osgood, G.J. , P.H. Tannenbaum (1957), “The Measurement of Meaning”, University of Illinois Press

Word entries • Adjectives1 (7.5% all text) • Positive: honest, important, mature, large, patient • Negative: harmful, hypocritical, inefficient, insecure • Subjective:curious, peculiar, odd, likely, probable • Verbs2 • positive:praise, love • negative: blame, criticize • subjective:predict • Nouns2 • positive: pleasure, enjoyment • negative: pain, criticism • subjective:prediction • Obscenities • Five slang words added to avoid obvious error Hatzivassiloglou & McKeown 1997, Wiebe 2000, Kamps & Marx 2002, Andreevskaia & Bergler 2006 Turney & Littman 2003, Esuli & Sebastiani 2006

Word features Modification features Structure features Sentence features Document feature Lexicon Step 1 Step 2 All Instances Polar Instances Neutral or Polar? Contextual Polarity? Corpus • Word token terrifies • Word part-of-speech VB • Context that terrifies me • Lemma terrify • Prior Polarity negative • Reliability strongsubj

Where does ebonics fit in? • Standardising ebonics terms in ‘free’ language may improve accuracy of language processing tools e.g. POS taggers • Also benefit subsequent analysis e.g. sentiment detection • Possible application to: • User-generated web content (e.g. in chat rooms) • Email • Speech • Creative writing

Methodology - ebonics • Check every token against the CELEX dictionary • If not found, check the suffix • attempt correction and re-check CELEX (no errors) • If still not found, write to ebonics lexicon • Write dictionary translation by hand • Include stress pattern for rhythm analysis • Keep number of tokens and POS the same if possible in / in’  ing runnin becomes running a / a' (word length > 2)  er brotha becomes brother er  re center becomes centre or  our color becomes colour

Methodology - features • Extract features: POS, language, sentiment, repetition • POS tags • Number of words for each UPenn tag • Number of words in tag categories {NN, VB, JJ, RB, PRP} • Language • Number of regular abbreviations (e.g. don't) • Number of frontal contractions (e.g. 'cause) • Number of modal, passive and active verbs • Average, minimum and maximum line lengths • Language formality = % formal words (NN, JJ, IN, DT) versus informal words (PRP, RB, UH, VB) • Word variety = type to token ratio

Methodology - features • Extract features: POS, language, sentiment, repetition • Sentiment • Number of words strong/weak positive • Number of words strong/weak negative • Repetition • Number of phrase repetitions • Phrase length • Number of lines repeated in full • Number of unique (non-repeated) lines • Combination • Combined features from above categories, edited to remove correlations

Methodology - corpus size • Subsequent analyses use neural networks that prefer more data • But ebonics lexicon is labor intensive • Want to observe impact of reduced data set • Standardise ebonics for approximately half the data(1037 songs out of 2392)

Ebonics and POS • Stability within major tag categories • Nouns most shift NN to NNP / NNPS to NNP • High frequency of named entities • “be my Yoko Ono”, “served like Smuckers jam” • Some shift from NN to VBG/VBZ tags • “what's happ'ning” • Increase in comparative adjectives • “a doper flow” • Ebonics does not appear to be important for POS tagging unless text is from traditional ebonics source

Results

When Bad is Good: Effect of Ebonics on Computational Language Processing