490 likes | 773 Vues
URL : http:// www.cslu.ogi.edu/~sproatr/Courses/TextNorm /. CS506/606: Text Normalization Richard Sproat , Steven Bedrick TA: Emily Tucker- Prud’hommeaux Fall 2011 Introduction. Course Outline. This course will consist of a combination of a (few) lectures,
E N D
URL: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/ CS506/606: Text NormalizationRichard Sproat, Steven BedrickTA: Emily Tucker-Prud’hommeauxFall 2011Introduction
Text Normalization Course Outline • This course will consist of a combination of • a (few) lectures, • discussion of papers from the literature, • a lab component where the class as a team will build a set of modules for text normalization using the Thrax open-source finite-state grammar toolkit. • For most classes, there will be a combination of reading discussion, and discussion of progress on the project.
Text Normalization Text Normalization • Conversion of text that includes ‘non-standard’ words like numbers, abbreviations, misspellings . . . into normal words. • Abbreviation expansion (including novel abbreviations) • Expansion of numbers into ‘number names’ • Correction of misspellings • Disambiguation in cases where there is ambiguity
Text Normalization Where is normalization needed? • Very little in cases like this: Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’ So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
Text Normalization Where is normalization needed? • A lot in cases like this:
Text Normalization Humans are pretty good at this: can you read this? fucn rd thsthnurdngbtrthnny autmtc txt nrmlztionprgrmcn do.
Text Normalization How about this? Aoccdrnig to a rscheearch at CmabrigdeUinervtisy, it deosn’tmttaer in what oredr the ltteers in a wrod are, the olnyiprmoetnttihng is taht the frist and lsatltteer be at the rghitpclae. The rset can be a total mses and you can sitllraed it wouthitporbelm. Tihs is bcuseae the huamnmniddeos not raederveylteter by istlef, but the wrod as a wlohe.
Text Normalization Or this? Goccdrnia to a hscheearcr at EmabrigdcYinervtisu, it teosn’drttaem in tahw rredo the stteerl in a drow are, the ylnotprmoetnigihnt is taht the trisf and tsal rtteel be at the tghireclap. The tser can be a lotatssem and you can litlsdaer it touthiwmorbelp. Siht is ecuseab the nuamhdnimseod not daeryrveertetel by fstlei, but the drow as a elohw.
Text Normalization Two components of text normalization • Given a string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it. • Which of those is right for the particular context?
Text Normalization An illustration He has goats Lotus for Windows 123 I live at King Avenue.
Text Normalization Two components of text normalization • A component that gives you the set of possibilities: • 123 = one hundred (and) twenty three • 123 = one twenty three • 123 = one two three • A component that tells you which one(s) are appropriate to a particular context.
Text Normalization A concrete example of finite-state methods in textnormalization: digit to number name translation • Factor digit string: • 123 → 1 · 102 + 2 · 101 + 3 • Translate factors into number names: • 102 → hundred • 2 · 101 → twenty • 1 · 101 + 3 → thirteen • Languages vary on how extensive these lexicons are. Some (e.g. Chinese) have very regular (hence very simple) number name systems; others (e.g. Urdu/Hindi) have a large set of number names with a name for almost every number from 1 to 100. • Each of these steps can be accomplished with FSTs
Urdu (Hindi) Number Names Text Normalization
Text Normalization Digit string factoring transducer (fragment)
Text Normalization Germanic “decade flop” zwanzig vier 24 und
Text Normalization 70’s
Text Normalization Digit-string to number name translation: German • Factor digit string: • 123 → 1 · 102 + 2 · 101 + 3 • Flip decades and units: 2 · 101 + 3 → 3 + 2 · 101 • Translate factors into number names: • 102 → hundert • 2 · 101 → zwanzig • 1 · 101 + 3 → dreizehn
Text Normalization German number grammar (fragment)
Text Normalization Concrete example from English Consider a machine that maps between digit strings and their reading as number names in English. 30,294,005,179,018,903.56 → thirty quadrillion, two hundred and ninety four trillion, five billion, one hundred seventy nine million, eighteen thousand, nine hundred three, point five six
Text Normalization 566 states and 1492 arcs
Text Normalization NSW Classification
Text Normalization Introduction to Thrax • The OpenGrmThrax tools compile grammars expressed as regular expressions and context-dependent rewrite rules into weighted finite-state transducers. It makes use of functionality in the OpenFst library to create, access and manipulate n-gram models. It is named after Dionysius Thrax (ΔιονύσιοςὁΘρᾷξ) (170 BC – 90 BC), the reputed first Greek grammarian. • http://www.openfst.org/twiki/bin/view/GRM/Thrax
Text Normalization Reading Assignment • Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. "Normalization of non-standard words." Computer Speech and Language, 15(3), 287-333, 2001.