410 likes | 411 Vues
This research project focuses on creating an elicitation corpus that covers various semantic categories and constructions, with the aim of studying how languages form different constructions and important semantic distinctions. The corpus will be used for learning translation rules and navigating through language features.
E N D
Designing an Elicitation Corpus with Semantic Representations Simon Fung Advisor: Lori Levin August 2006
Overview • motivation • elicitation corpus • constraints • issue: definiteness • status
Corpus example Was there an apple? Wasn't there an apple? Will there be an apple? Won't there be an apple? There was an apple. There was not an apple. There will be an apple. There will not be an apple. ...
What is an elicitation corpus for? • Make a small parallel corpus that can be used for learning translation rules
Motivation • how do languages form various constructions (e.g. relative clauses)? • The student who I saw • <chinese>
what semantic distinctions are important in different languages? • He is falling. <chinese> <french> • They are falling. <chinese> <french> • I ate an apple. <chinese> <french> • I ate apples. <chinese> <french>
Elicitation Corpus • sentences covering various semantic categories/constructions • e.g. number, gender, relative clauses • to be translated into language under study
The MILE (MInor Language Elicitation) Corpus • 10,000-20,000 words • translations done by one person • 7 languages per year for next 5 years • E.g., Thai, Bengali, Punjabi • May have a lot of speakers, but fewer electronic resources
Constraints • maximize range of semantic categories and constructions • minimize corpus size
Constraints • different languages complex in different areas • only one corpus, for this project • ultimate goal: dynamically navigate through features
Method • create semantic representations first (instead of starting with English) • write English sentences based on them • translate sentences into various languages
Method • create semantic representations first (instead of starting with English) • write English sentences based on them • translate sentences into various languages
Example: feature structure srcsent: Who will break windows? context: "Who" refers to two men; spoken to a co-worker; ((actor((np-function fn-actor)(np-general-type interrogative-type) (np-person person-third)(np-number num-dual) (np-biological-gender bio-gender-male)(np-animacy anim-human)(np-pronoun-antecedent antecedent-n/a) (np-specificity specificity-neutral)(np-identifiability identifiability-neutral) (np-distance distance-neutral)(np-pronoun-exclusivity inclusivity-n/a))) (undergoer ((np-person person-third)(np-identifiability unidentifiable)(np-number num-pl) (np-specificity non-specific)(np-animacy anim-inanimate)(np-biological-gender bio-gender-n/a)(np-function fn-undergoer)(np-general-type common-noun-type)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral))) (c-polarity polarity-positive) (c-v-absolute-tense future) (c-general-type open-question)(c-question-gap gap-actor)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a))
Example: feature structure srcsent: Who will break windows? context: "Who" refers to two men; spoken to a co-worker; ((ACTOR ((NP-FUNCTION FN-ACTOR) (NP-GENERAL-TYPE INTERROGATIVE-TYPE))) (UNDERGOER ((NP-PERSON PERSON-THIRD) (NP-IDENTIFIABILITY UNIDENTIFIABLE)(NP-NUMBER NUM-PL) (NP-SPECIFICITY NON-SPECIFIC))) (C-POLARITY POLARITY-POSITIVE) (C-V-ABSOLUTE-TENSE FUTURE))
Example: feature structure srcsent: Who will break windows? context: "Who" refers to two men; spoken to a co-worker; ((ACTOR ((NP-FUNCTION FN-ACTOR) (NP-GENERAL-TYPE INTERROGATIVE-TYPE))) (UNDERGOER ((NP-PERSON PERSON-THIRD) (NP-IDENTIFIABILITY UNIDENTIFIABLE)(NP-NUMBER NUM-PL) (NP-SPECIFICITY NON-SPECIFIC))) (C-POLARITY POLARITY-POSITIVE) (C-V-ABSOLUTE-TENSE FUTURE)) Feature name
Example: feature structure srcsent: Who will break windows? context: "Who" refers to two men; spoken to a co-worker; ((ACTOR ((NP-FUNCTION FN-ACTOR) (NP-GENERAL-TYPE INTERROGATIVE-TYPE))) (UNDERGOER ((NP-PERSON PERSON-THIRD) (NP-IDENTIFIABILITY UNIDENTIFIABLE)(NP-NUMBER NUM-PL) (NP-SPECIFICITY NON-SPECIFIC))) (C-POLARITY POLARITY-POSITIVE) (C-V-ABSOLUTE-TENSE FUTURE)) Feature name value
Using semantic representation • Advantage: • specify grammatical features more precisely than text
Method • create semantic representations first (instead of starting with English) • write English sentences based on them • translate sentences into various languages
Corpus example Was there an apple? Wasn't there an apple? Will there be an apple? Won't there be an apple? There was an apple. There was not an apple. There will be an apple. There will not be an apple. ...
Method • create semantic representations first (instead of starting with English) • write English sentences based on them • translate sentences into various languages
1. Naturalness • naturalness of sentences vs. holding lexical items constant • minimal pairs ideal (A tree fell/The tree fell) • but also want natural sentences • natural → easier to translate → less mistakes • e.g. She hurt herself. It ____ itself. • sentences are hand-written • vs using natural language generators (GenKit)
2. Restrictions • need to find restrictions on combinations of features • some combinations invalid/unnatural • e.g. inclusive and third-person male
3. Definition of values • use language-independent semantic categories • e.g. specificity, identifiability • writers need to agree on definitions of values • Intercoder agreement (informal experiment) • each coder needs to be consistent • writers agreed on English forms to use
avoid language-specific grammatical features • Suppose you want to know about definiteness in the minor language • You think that you can find out about definiteness by checking how they translate “the”, “a” and null determiner • You get the following data from French • <French data here> • It’s a mess • Sometimes “the” is translated as “le/la” • Sometimes “le/la” occur where English has a null determiner • etc. • You don’t know why it’s a mess
Avoid language specific… • Have to break it down by function: • Indefinite quantity • Generic • Predicate nominal • A definite noun phrase • Etc.
Definiteness • example of a problem in design of features and values • how to define definiteness, • while avoiding using English definiteness categories?
Criteria for definiteness Lyons (1999): • uniqueness • familiarity • identifiability • specificity • inclusiveness
Definiteness You and I are in a room. I say “The chair is on fire!”
Criteria for definiteness Why did I say “the chair”? • identifiability • I know that you know what chair I’m talking about • specificity • a chair you can single out among chairs you’re imagining
Grammatical feature: specificity “John wants to marry a Norwegian.” Feature: np-specificityValues • specific • John wants to marry a (specific) Norwegian. • non-specific • John wants to marry someNorwegian. • specificity-neutral • She is a Norwegian.
Grammatical feature: specificity • Turkish direct objects: Ali bir kitap okudu. Ali one book read Ali read a book. Ali bir kitab-ı okudu. Ali one book-acc read Ali read a (specific) book.
Definiteness: corner cases • e.g. Who will be the manager? • Not about a specific manager, but it is about a specific role • e.g. She is a teacher. • identifiable-neutral, specificity-neutral • no article here in French • e.g. A dog has four legs. • identifiability-generic, specificity-neutral
Criteria for definiteness chose the most important criteria: • identifiability • specificity
Layout of Corpus 1. Clause types, negation, and formality 2. Discourse setting/Speaker-hearer features 3. Basic NP features 4. Verbal Tense and Aspect 5. Evidentiality and Modality 6. Causatives 7. Comparatives 8. Modifiers 9. Conjunctions 10. Clause-combining
Layout of Corpus • combine feature values systematically • why combine • some features interact • e.g. Will the woman be happy?(interrogative, future tense) • what to combine? • some features known to interact • e.g. person, number (I am, we are, he is)
Status • delivered (as of two weeks ago)!
e.g. definite feature corresponding to English “the” • definiteness in other languages different • definiteness → familiarity, uniqueness, identifiability, etc.
Steps in corpus creation • Define features and values • tricky to define meanings • e.g. semantics of definiteness • Uniqueness: the president • Familiarity: • Identifiability: • Specificity: • languages have subtly different categories • e.g. definiteness