Evidence Reinforcement for “a disponer de los”  “to leave the”

Evidence Reinforcementfor “a disponer de los” “to leave the” A disponer de los .95 Yuval Marton, Dissertation Defense

Complexity • Suffix array (Manber and Myers 1993) • All suffixes of the text in lex order • Phrase pattern matching implementation (Lopez 2007) • Find all occurrences of phrase phr in text T: O(|phr| + log|T|) • Can be done in optimal O(m) (Abouelhoda et al. 2004) • Find all occurrences of gapped phrase LXR in text T: O(|L|+|R|+log|T| + c(L)c(R)) • Can be improved with stratified tree: O(|L|+|R|+log|T| + (|L|+|R|)loglog|T|) (Emde Boas et al 1977) • Or hash table: O(|L|+|R|+log|T| + c(L)+c(R)) (Lopez 2007)

Complexity II • Find all paraphrases of phr: • Find all occurrences of phr: O(|phr| + log|T|) • Find all adequate contexts L_R: O(c(phr)) • Sample phr if too frequent: c(phr) > st=10,000 • Find all occur. of each LXR: O(|L|+|R|+log|T| + c(L)+c(R)) • Choose L,R long enough so that max(c(L),c(R)) < mcc=2000. in practice, it will be rare that max(|L|,|R|) > 5 • There are up to c(phr) such L_R contexts, each occurring c(LXR) times < max(c(L),c(R)). Total occur. of contexts: c(phr)max(c(L),c(R)) • Compare all X,phr • Total: O(|phr| + log|T| + c(phr) + c(phr)[|L|+|R|+log|T| + c(L)+c(R)] + c(phr)max(c(L),c(R)) ) • With limits: O(|phr| + st[log|T|+mcc] )

Comparison with Pivoting Method • Resource availability: • Pivoting uses parallel text, limited. • Distributional paraphrases use monolingual text, abundant. • Top candidate challenges: • Pivoting: function words (due to “promiscuity” + frequency?); translational shift (due to double translation step) • Distributinal: antonyms, co-hypernyms • Semantic Similarity measure: • Pivoting: probability p(f1|e) p(e|f2) • Distributional: vector sim. Can plug in any similarity measure Yuval Marton, Dissertation Defense

Unified Model (Soft Semantic Constraints)semantic distance of word e in sense s from word e’ in sense s’: cos(es,e’s’) = where: fSense(e,wi) fWord(e,wi) ≈cos sense-proportional cross-terms cross-terms ≈cos pure corpus-based Yuval Marton, Dissertation Defense

Future Challenges • Quality: • Use antonym detection to penalize antonymous candidates. E.g., Mohammad, Dorr and Dunne (2009) • How to benefit from POS and syntactic info e.g, Callison-Burch (2008) • How to benefit from semantic info / WSDe.g., Erk & Pado (2008) • Scaling: • How? (Hadoop + Map/Reduce?) • Find out why larger monolingual training set doesn’t always improve paraphrase contribution to SMT • Beat pivoting method (Callison-Burch et al. (2006)) • Get gains on larger SMT sets (for resource-rich languages!) Yuval Marton, Dissertation Defense

Main Contributions • Showing the advantage of fine-grained linguistic soft constraints in SMT, relative to “pure” corpus-based baseline and coarse-grained soft constraints. • Showing the advantage of fine-grained linguistic soft constraints in lexical semantics and paraphrase generation, relative to “pure” corpus-based baseline, hard constraints and coarse-grained soft constraints. • Showing statistically significant gains in state-of-the-art end-to-end phrase-based SMT systems of both syntactic and semantic (paraphrastic) contributions. • Introducing a novel paraphrase generation engine, using a monolingual corpus-based distributional approach, independent of parallel texts (a limited resource). • Introducing a novel evidence reinforcement component for scoring translation rules in paraphrase-augmented translation models. • Tunable (task-specific optimization) unified linear statistical NLP model. Yuval Marton, Dissertation Defense

Evidence Reinforcement for “a disponer de los”  “to leave the”