CS4705
E N D
Presentation Transcript
CS4705 Corpus Linguistics and Machine Learning Techniques CS 4705
Review • What do we know about so far? • Words (stems and affixes, roots and templates,…) • Ngrams (simple word sequences) • POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …)
Some Additional Things We Could Find • Named Entities • Persons • Company Names • Locations • Dates
What useful things can we do with this knowledge? • Find sentence boundaries, abbreviations • Find Named Entities (person names, company names, telephone numbers, addresses,…) • Find topic boundaries and classify articles into topics • Identify a document’s author and their opinion on the topic, pro or con • Answer simple questions (factoids) • Do simple summarization/compression
But first, we need corpora… • Online collections of text and speech • Some examples • Brown Corpus • Wall Street Journal and AP News • ATIS, Broadcast News • TDTN • Switchboard, Call Home • TRAINS, FM Radio, BDC Corpus • Hansards’ parallel corpus of French and English • And many private research collections
Next, we pose a question…the dependent variable • Binary questions: • Is this word followed by a sentence boundary or not? • A topic boundary? • Does this word begin a person name? End one? • Should this word or sentence be included in a summary? • Classification: • Is this document about medical issues? Politics? Religion? Sports? … • Predicting continuous variables: • How loud or high should this utterance be produced?
Finding a suitable corpus and preparing it for analysis • Which corpora can answer my question? • Do I need to get them labeled to do so? • Dividing the corpus into training and test corpora • To develop a model, we need a training corpus • overly narrow corpus: doesn’t generalize • overly general corpus: don't reflect task or domain • To demonstrate how general our model is, we need a test corpus to evaluate the model • Development test set vs. held out test set • To evaluate our model we must choose an evaluation metric • Accuracy • Precision, recall, F-measure,… • Cross validation
Then we build the model… • Identify the dependent variable: what do we want to predict or classify? • Does this word begin a person name? Is this word within a person name? • Is this document about sports? The weather? International news? ??? • Identify the independent variables: what features might help to predict the dependent variable? • What is this word’s POS? What is the POS of the word before it? After it? • Is this word capitalized? Is it followed by a ‘.’? • Does ‘hocky’ appear in this document? • How far is this word from the beginning of its sentence? • Extract the values of each variable from the corpus by some automatic means
An Example: Finding Caller Names in Voicemail SCANMail • Motivated by interviews, surveys and usage logs of heavy users: • Hard to scan new msgs to find those you need to deal with quickly • Hard to find msg you want in archive • Hard to locate information you want in any msg • How could we help?
Caller SCANMail Architecture SCANMail Subscriber
Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours; 10K msgs; 2500 speakers • Gender balanced: 12% non-native speakers • Mean message duration 36.4 secs, median 30.0 secs • Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos) • Also recognized using ASR engine
Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [ .hn ] well J2 actually offered to take J home with her and then would she
would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]
SCANMail Demo http://www.avatarweb.com/scanmail/ Audix extension: demo Audix password: (null)
Information Extraction (Martin Jansche and Steve Abney) • Goals: extract key information from msgs to present in headers • Approach: • Supervised learning from transcripts (phone #’s, caller self-ids) • Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules • Two stage approaches
Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)
Telephone Number Identification • Rules convert all numbers to standard digit format • Predict start of phone number with rules • This step over-generates • Prune with decision-tree classifier • Best features: • Position in msg • Lexical cues • Length of digit string • Performance: • .94 F on human-labeled transcripts • .95 F on ASR)
Caller Self-Identifications • Predict start of id with classifier • 97% of id’s begin 1-7 words into msg • Then predict length of phrase • Majority are only 2-4 words long • Avoid risk of relying on correct speech recognition for names • Best cues to end of phrase are a few common words • ‘I’, ‘could’, ‘please’ • No actual names: they over-fit the data • Performance • .71 F on human-labeled • .70 F on ASR