130 likes | 251 Vues
This paper discusses a novel approach to learning Stochastic Context-Free Grammars (SCFGs) using the Inside-Outside algorithm applied to partially bracketed corpora. It aims to enhance grammar inference by leveraging constituent boundary information and improving training efficiency. The methodology is evaluated through experiments on artificial languages, including palindrome recognition, and real-world data from the ATIS corpus, demonstrating significant improvements in convergence and bracketing accuracy. The results highlight the advantages of using partially bracketed text for more efficient grammar learning.
E N D
Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b 김병창 NLP Lab. 1998. 10. 29
Contents • Motivation • Partially Bracketed Text • Grammar Reestimation • The Inside-Outside Algorithm • The Extended Algorithm • Complexity • Experimental Evaluation • Inferring the Palindrome Language • Experiments on the ATIS Corpus • Conclusions and Further Work NLP Lab., POSTECH
Motivation I • Very simple method for learning SCFGs [Charniak] • Generate all possible SCFG rules • Assign some initial probabilities • Run the training algorithm on a sample text raw text • remove those rules with zero probabilities • Difficulties in using SCFGs • Time complexity - O(n3|w|3) • n : the number of non-terminals w : training sentence • cf. O(s2|w|) : training an HMM with s states • Bad convergence properties • The larger number of non-terminals, the worse. • Inferred only by chance NLP Lab., POSTECH
Motivation II • Extension of the Inside-Outside algorithm • Inferring grammars from a partially parsed corpus • Advantages • constituent boundary information in grammar • reduced number of iteration for training • better time complexity NLP Lab., POSTECH
Partially Bracketed Text • Example • (((VB(DT NNS(IN((NN)(NN CD)))))).) • (((List (the fares(for((flight)(number 891)))))).) • Notations • Corpus C = { c | c = ( w, B) }, w : string, B : bracketing of w • w=w1w2wiwi+1 wj w|w| • (i,j) delimits iwj • consistent : no overlapping in a bracketing • compatible : union of two bracketing is consistent • valid : a span is compatible with a bracketing • span in derivation 01m=w • if j=m, span of wi in j is (i-1,i) • if j<m, j=A, j+1=X1Xk, span A in j is (i1,jk) NLP Lab., POSTECH
Grammar Reestimation • Using reestimation algorithm • parameter estimates for a SCFG derived by other means • grammar inferring from scratch • Grammar inferring • Given set N of Non-terminals, set of terminals • n=|N|, t=|| • N={A1,,An}, ={b1,,bt} • CNF SCFG over N, : n3+nt probabilities • Bp,q,r on ApAqAr: n3 • Up,m on Apbm: nt • Meaning of rule probabilities : intuition of context freeness NLP Lab., POSTECH
S i 1 s-1 s t t+1 T The Inside-Outside Algorithm • Definition of inner (e) and outer (f) probabilities i Inner probability S i Outer probability Special thanks to ohwoog NLP Lab., POSTECH
The Extended Algorithm • Compatible function • Extended algorithm • Table 1. 참조 • Inside probabilities : (1), (2) ; (2)에 compatible function 사용. • Outside probabilities : (3), (4); (4)에 compatible function 사용. • Parameter reestimation : (5), (6) ; original algorithm과 같음. • Stopping criterior • When the cross entropy estimate becomes negligible. NLP Lab., POSTECH
Complexity • Complexity of original algorithm : O(|w|3) for each sentence • computation of inside probability, computation of outside probability and rule probability reestimation : 각각 O(|w|3) for each sentence • Complexity of extended algorithm : O(|w|) at best case • In the case of full binary bracketing B of a string w • O(|w|) spans in B • Only one split point for each (i,k) • Each valid span must be a member of B. • Preprocessing • Enumerating valid spans and split points NLP Lab., POSTECH
Experimental Evaluation • Two experiments • Artificial Language ; Palindrome • Natural Language ; Penn Treebank • Evaluation • Bracketing accuracy • proportion of phrases that are compatible NLP Lab., POSTECH
Inferring the Palindrome Language • L={wwR|w{a,b}*} • Initial grammar : 135 rules ( =53+5*2 ) • Training with 100 sentences • Inferred grammar : correct palindrome language grammar • Bracketing accuracy : above 90% (100% in several cases) • In the unbracketing training : 15% - 69% NLP Lab., POSTECH
Experiments on the ATIS Corpus • ATIS(Air Travel Information System) corpus ; 770 sentences (7812 words) • 700 training set, 70 test set (901 words) • Initial grammar : 4095 rules ( =153+15*48) • 15 nonterminals, 48 terminal symbols for POS tags • Bracketing accuracy : 90.36% after 75 iteration • In the unbracketing training : 37.35% • In the case (A) • (Delta flight number) : not compatible • (the cheapest) : linguistically wrong ; lack of information • 16 incompatibles in GR • In the case (B) • fully compatible • 9 incompatibles in GR NLP Lab., POSTECH
Conclusions and Further Work • The use of partially bracketed corpus can • reduce the number of iterations for convergence • find good solution • infer grammars specifying linguistically reasonable constituent boundaries • reduce time complexity (linear in the best case) • More Extensions • determination of sensitivity to the • initial probability assignments • training corpus • lack or misplacement of brackets. • larger terminal vocabularies NLP Lab., POSTECH