340 likes | 554 Vues
Generation. Aims of this talk. Discuss MRS and LKB generation Describe larger research programme: modular generation Mention some interactions with other work in progress: RMRS SEM-I. Outline of talk. Towards modular generation Why MRS? MRS and chart generation Data-driven techniques
E N D
Aims of this talk • Discuss MRS and LKB generation • Describe larger research programme: modular generation • Mention some interactions with other work in progress: • RMRS • SEM-I
Outline of talk • Towards modular generation • Why MRS? • MRS and chart generation • Data-driven techniques • SEM-I and documentation
Modular architecture Language independent component Meaning representation Language dependent realization string or speech output
Desiderata for a portable realization module • Application independent • Any well-formed input should be accepted • No grammar-specific/conventional information should be essential in the input • Output should be idiomatic
Architecture (preview) External LF SEM-I Internal LF specialization modules Chart generator control modules String
Why MRS? • Flat structures • independence of syntax: conventional LFs partially mirror tree structure • manipulation of individual components: can ignore scope structure etc • lexicalised generation • composition by accumulation of EPs: robust composition • Underspecification
An excursion: Robust MRS • Deep Thought: integration of deep and shallow processing via compatible semantics • All components construct RMRSs • Principled way of building robustness into deep processing • Requirements for consistency etc help human users too
Extreme flattening of deep output some every y dog1 every some x cat x y chase cat y dog1 chase x y x e x y e x y lb1:every_q(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat_n(x), lb5:dog_n_1(y), lb4:some_q(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase_v(e),ARG1(lb3,x), ARG2(lb3,y), h9 qeq lb2,h8 qeq lb5
Extreme Underspecification • Factorize deep representation to minimal units • Only represent what you know • Robust MRS • Separating relations • Separate arguments • Explicit equalities • Conventions for predicate names and sense distinctions • Hierarchy of sorts on variables
Chart generation with the LKB • Determine lexical signs from MRS • Determine possible rules contributing EPs (`construction semantics’: compound rule etc) • Instantiate signs (lexical and rule) according to variable equivalences • Apply lexical rules • Instantiate chart • Generate by parsing without string position • Check output against input
Lexical lookup for generation • _like_v_1(e,x,y) – return lexical entry for sense 1 of verb like • temp_loc_rel(e,x,y) – returns multiple lexical entries • multiple relations in one lexical entry: e.g., who, where • entries with null semantics: heuristics
Instantiation of entries • _like_v_1(e,x,y) & named(x,”Kim”) & named(y,”Sandy”) • find locations corresponding to `x’s in all FSs • replace all `x’s with constant • repeat for `y’s etc • Also for rules contributing construction semantics • `Skolemization’ (misleading name ...)
Lexical rule application • Lexical rules that contribute EPs only used if EP is in input • Inflectional rules will only apply if variable has the correct sort • Lexical rule application does morphological generation (e.g., liked, bought)
Chart generation proper • Possible lexical signs added to a chart structure • Currently no indexing of chart edges • chart generation can use semantic indices, but current results suggest this doesn’t help • Rules applied as for chart parsing: edges checked for compatibility with input semantics (bag of EPs)
Root conditions • Complete structures must consume all the EPs in the input MRS • Should check for compatibility of scopes • precise qeq matching is (probably) too strict • exactly same scopes is (probably) unrealistic and too slow
Generation failures due to MRS issues • Well-formedness check prior to input to generator (optional) • Lexical lookup failure: predicate doesn’t match entry, wrong arity, wrong variable types • Unwanted instantiations of variables • Missing EPs in input: syntax (e.g., no noun), lexical selection • Too many EPs in input: e.g., two verbs and no coordination
Improving generation via corpus-based techniques • CONTROL: e.g. intersective modifier order: • Logical representation does not determine order • wet(x) & weather(x) & cold(x) • UNDERSPECIFIED INPUT: e.g., • Determiners: none/a/the/ • Prepositions: in/on/at
Constraining generation for idiomatic output • Intersective modifier order: e.g., adjectives, prepositional phrases • Logical representation does not determine order • wet(x) & weather(x) & cold(x)
Adjective ordering • Constraints / preferences • big red car • * red big car • cold wet weather • wet cold weather (OK, but dispreferred) • Difficult to encode in symbolic grammar
Corpus-derived adjective ordering • ngrams perform poorly • Thater: direct evidence plus clustering • positional probability • Malouf (2000): memory-based learning plus positional probability: 92% on BNC
Underspecified input to generation We bought a car on Friday Accept: pron(x) & a_quant(y,h1,h2) & car(y) & buy(epast,x,y) & on(e,z) & named(z,Friday) and: pron(x) & general_q(y,h1,h2) & car(y) & buy(epast,x,y) & temploc(e,z) & named(z,Friday) And maybe: pron(x1pl) & car(y) & buy(epast,x,y) & temp_loc(e,z) & named(z,Friday)
Guess the determiner • We went climbing in _ Andes • _ president of _ United States • I tore _ pyjamas • I tore _ duvet • George doesn’t like _ vegetables • We bought _ new car yesterday
Determining determiners • Determiners are partly conventionalized, often predictable from local context • Translation from Japanese etc, speech prosthesis application • More `meaning-rich’ determiners assumed to be specified in the input • Minnen et al: 85% on WSJ (using TiMBL)
Preposition guessing • Choice between temporal in/on/at • in the morning • in July • on Wednesday • on Wednesday morning • at three o’clock • at New Year • ERG uses hand-coded rules and lexical categories • Machine learning approach gives very high precision and recall on WSJ, good results on balanced corpus (Lin Mei, 2004, Cambridge MPhil thesis)
SEM-I: semantic interface • Meta-level: manually specified `grammar’ relations (constructions and closed-class) • Object-level: linked to lexical database for deep grammars • Definitional: e.g. lemma+POS+sense • Linked test suites, examples, documentation
SEM-I development • SEM-I eventually forms the `API’: stable, changes negotiated. • SEM-I vs Verbmobil SEMDB • Technical limitations of SEMDB • Too painful! • `Munging’ rules: external vs internal • SEM-I development must be incremental
Role of SEM-I in architecture • Offline • Definition of `correct’ (R)MRS for developers • Documentation • Checking of test-suites • Online • In unifier/selector: reject invalid RMRSs • Patching up input to generation
Goal: semi-automated documentation [incr tsdb()] and semantic test-suite Lex DB ERG Documentation strings Object-level SEM-I Auto-generate examples semi-automatic Documentation examples, autogenerated on demand Meta-level SEM-I autogenerate appendix
Robust generation • SEM-I an important preliminary • check whether generator input is semantically compatible with grammars • Eventually: hierarchy of relations outside grammars, allowing underspecification • `fill-in’ of underspecified RMRS • exploit work on determiner guessing etc
Architecture (again) External LF SEM-I Internal LF specialization modules Chart generator control modules String
Interface • External representation • public, documented • reasonably stable • Internal representation • syntax/semantics interface • convenient for analysis • External/Internal conversion via SEM-I
Guaranteed generation? • Given a well-formed input MRS/RMRS, with elementary predications found in SEM-I (and dependencies) • Can we generate a string? with input fix up? negotiation? • Semantically bleached lexical items: which, one, piece, do, make • Defective paradigms, negative polarity, anti-collocations etc?
Next stages • SEM-I development • Documentation and test suite integration • Generation from RMRSs produced by shallower parser (or deep/shallow combination) • Partially fixed text in generation (cogeneration) • Further statistical modules: e.g., locational prepositions, other modifiers • More underspecification • Gradually increase flexibility of interface to generation