A Random Text Model for the Generation of Statistical Language Invariants

A Random Text Model for the Generation of Statistical Language Invariants Chris BiemannUniversity of Leipzig, Germany HLT-NAACL 2007, Rochester, NY, USA Monday, April 23, 2007

Outline • Previous random text models • Large-scale measures for text • A novel random text model • Comparison to natural language text

Necessary property: Zipf‘s Law • Zipf: Ordering words in a corpus by descending frequency, the relation between the frequency of a word at rank r and its rank is given by f(r) ~ r-z, where z is the exponent of the power-law that corresponds to the slope of the curve in a log plot. For word frequencies in NL, z  1 • Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower frequencies for very high ranks

Previous Random Text Models B. B. Mandelbrot (1953) • Sometimes called the “monkey at the typewriter” • With a probability w, a word separator is generated at each step, • with probability (1-w)/N, a letter from an alphabet of size N is generated H. A. Simon (1955) • No alphabet of single letters • at each time step, a previously unseen new word is added to the stream with a probability , whereas with probability (1-), the next word is chosen amongst the words at previous positions. • frequency distribution that follows a power law with exponent z=(1-). • Modified by Zanette and Montemurro (2002): - sublinear growth for higher exponents- Zipf-Mandelbrot law by maximum probability threshold

Critique on Previous Models • Mandelbrot: All words with the same length are equiprobable, as all letters are equiprobableFerrer i Cancho and Solé (2002): Initialisation with letter probabilities obtained from natural language text solves this problem, but where do these letter frequencies come from? • Simon: No concept of „letter“ at all. • Both: • no concept of sentence • no word order restrictions: Simon = bag of words, Mandelbrot does not take into account generated stream at all

Large-scale Measures for Text • Zipf‘s law and lexical spectrum: rank-frequency plot should follow a power law with z1, frequency-spectrum (probability of frequencies) should follow a power law with z2 (Pareto distribution) • Word length: Should be distributed like in natural language text, according to a variant of the gamma distribution (Sigurd et al. 2004) • Sentence length: Should also distributed like in NL, same gamma distribution • Significant neighbour-based co-occurrence graph: Should be a similar in terms of degree distribution and connectivity in random text and NL.

A Novel Random Text Model Two parts: • Word Generator • Sentence Generator Both follow the principle of beaten tracks: • Memorize what has been generated before • Generate with higher probability if generated before more often Inspired by Small World network generation, especially (Kumar et al. 1999).

Word Generator • Initialisation: • Letter graph of N letters. • Vertices are connected to themselves with weight 1. • Choice: • When generating a word, the generator chooses a letter x according to its probability P(x), which is computed as the normalized weight sum of outgoing edges: • Parameter: • At every position, the word ends with a probability w(0,1) or generates a next letter according to the letter production probability as given above. • Update: • For every letter bigram, the weight of the directed edge between the preceding and current letter in the letter graph is increased by one. • Effect: self-reinforcement of letter probabilities: • the more often a letter is generated, the higher its weight sum will be in subsequent steps, • leading to an increased generation probability. with

Word Generator Example The small numbers next to edges are edge weights. The probability for the letters for the next step are P(A)=0.4 P(B)=0.4 P(C)=0.2

Measures on the Word Generator • Word Generator fulfills measures much better than the Mandelbrot model. • For other measures, we need something extra...

Sentence Generator I • Initialisation: • Word graph is initialized with a begin-of-sentence (BOS) and an end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS to EOS. • Word Graph: (directed) • Vertices correspond to words • edge weights correspond to the number of times two words were generated in a sequence. • Generation: • random walk on the directed edges starts at the BOS vertex. • With a new word probability (1-s), an existing edge is followed from the current vertex to the next vertex • the probability of choosing endpoint X from the endpoints of all outgoing edges from the current vertex C is given by

Sentence Generator II • Parameter: • With probability s (0,1), a new word is generated by the word generator model • next word is chosen from the word graph in proportion to its weighted indegree: the probability of choosing an existing vertex E as successor of a newly generated word N is given by • Update: • For each sequence of two words generated, the weight of the directed edge between them is increased by 1

Sentence Generator Example • In the last step, the second CA was generated as a new word from the word generator. • The generation of empty sentences happens frequently. These are omitted in the output.

Comparison to Natural Language • Corpus for comparison: The first 1 million words of BNC, spoken English. • 26 letters, uppercase, punctuation removed  same in word generator • 125,395 sentences  set s=0.08, remove first 50K sentences • average sentence length: 7.975 words • Average word length: 3.502 letters  w=0.4 OOH OOH ERM WOULD LIKE A CUP OF THIS ER MM SORRY NOW THAT S NO NO I DID NT I KNEW THESE PEWS WERE HARD OOH I DID NT REALISE THEY WERE THAT BAD I FEEL SORRY FOR MY POOR CONGREGATION

Word Frequency • Zipf-Mandelbrot distribution • Smooth curve • Similar to English

Word Length • More 1-letter words in the sentence generator • Longer words in the sentence generator • Curve is similar • Gamma distribution here:f(x)~x1.50.45x

Sentence Length • Longer sentences in English • More 2-word sentences in english • Curve is similar

Neighbor-based Co-occurrence Graph • Min. cooc. freq=2, min. log likelihood ratio=3.84 • NB-graph is a small world • Qualitatively, English and sentence generator are similar • Word generator shows much much less co-occurrences • Factor 2 in clustering coefficient and number of vertices

Formation of Sentences • Word graph grows and contains the full vocabulary used so far for generating in every time step. • Random walks starting from BOS always end in EOS. • Sentence length slowly increases: random walk has more possibilities before finally arriving at the EOS vertex. • Sentence length is influenced by both parameters of the model: • the word end probability w in the word generator • the new word probability s in the sentence generator.

Conclusion Novel random text model • obeys Zipf‘s law • obeys word length distribution • obeys sentence length • shows similar nb-cooccurrence data First model that: • produces smooth lexical spectrum without initial letter probabilities • incorporates notion of a sentence • models word order restrictions

Sentence generator at work Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF . XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U . G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U . RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF . R . Z U . Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC . G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY . FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ. OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN . TA KV XJP O EGV J HQY KMQ U .

Questions? Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte dank we u trew wel wwd muchas werwe ewr gracias werwe rew merci mille werew re ew ee ew grazie d fsd ffs df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm

A Random Text Model for the Generation of Statistical Language Invariants

A Random Text Model for the Generation of Statistical Language Invariants

Presentation Transcript

Random Number Generation

Random Number Generation

Natural Language Generation and Data-To-Text

Random Number Generation

Statistical Translation Language Model

Natural Language Generation and Data-To-Text

Building a Statistical Language Model Using CMUCLMTK

Statistical Properties for Text

Statistical package 4 th generation programming language

Natural Language Generation and Data-To-Text

Statistical Modeling of Text

A Text Processing Tool for the Romanian Language

Binarization of Low Quality Text Using a Markov Random Field Model

Random- Variate Generation

Random-Number Generation

Random-Number Generation

Random-Variate Generation

Random Number Generation

Random Number Generation

Towards a Text Generation Template Language for Modelica

Model Programs for Preserving Composite Invariants

Language-Model Based Text-Compression