170 likes | 348 Vues
Spanish FrameNet Project. Autonomous University of Barcelona Marc Ortega. Spanish FrameNet Project. Spanish FrameNet is a research project which is sponsored by the Department of Education of Spain (Grant No. TSI2005-01200) from December 2005 to December 2006.
E N D
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega
Spanish FrameNet Project • Spanish FrameNet is a research project which is sponsored by the Department of Education of Spain (Grant No. TSI2005-01200) from December 2005 to December 2006. • A new grant proposal has been submitted to the Spanish Department of Education for the period 2007-2009 • SFN is developed at the Autonomous University of Barcelona (Spain) and the International Computer Science Institute (Berkeley, CA) in cooperation with the FrameNet Project. • PI: Carlos Subirats, System Analyst: Marc Ortega, 2 linguist
SFN Goals • The Spanish FrameNet Project is creating an online lexical resource for Spanish, based on frame semantics and supported by corpus evidence. • SFN will be available to the public by July 2007 • SFN will contain at least 1,000 lexical items aprox. -verbs, predicative nouns, and adjectives, adverbs, prepositions and entities- representative of a wide range of semantic domains. • The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses
Frame Semantics • Spanish FrameNet (SFN) is using, adapting and changing FrameNet Frames in order to adapt them to Spanish • Some SFN Frames are the same as English FN (with Spanish examples) • Some SFN Frames have the same English FN name but they are different (slightly different definition, different FE’s, or different core sets) • To adapt FN to Spanish we defined some new frames and some FN frames are not used (new frames use the same FN format), like: • Cause_to_halt • Change_emotional_state • Collapse • Inventing • Motion_backwards, Motion_interruption, Motion_manner, Motion_medium, Motion_up_downwards • Return • Social_interaction • Think_up
Current Project Status • Frames Defined: 92 • Lexical Units: 624 • Annotated: 413 • Subcorporated: 130 • Created but without subcorporation: 23
Spanish FrameNet Corpus and Tools • Spanish FrameNet is using a 350 million word corpus • It includes both European and New World Spanish (40% and 60%) • The SFN Corpus has been developed by the SFN research team, since there are no (large) public domain Spanish corpora available • The SFN Corpus is lemmatized and tagged with a set of in-house tools • FNDesktop • Web Reports • Sato Tool
The SFN tagging and chunking system • The SFN Corpus is tagged and lemmatized by using: • An electronic dictionary of Spanish of 600,000 forms, which is expanded from a dictionary of 93,000 lemmas: • 66,000 single-word lexical units, like unir (unite), inmoralidad (immorality), allí (there), etc.; • 26,000 multi-word lexical units (MWLU), like muerte cerebral (brain death),etc., which are automatically expanded in 55,000 inflected MWLU forms. • Plain text to Deterministic Finite State Automata (FSA) corpus tagger • 2,000 Finite State Transducers (FST) transducers of multi-word verbs • Transducers of head of verbal phrases (compound verbal tenses)
The SFN tagging and chunking system • The POS tagging process gives to corpus formats: • Automata Corpus • IMS-CWB (Institut für Maschinelle Sprachverarbeitung -Corpus Workbench)
DFSA of the sentence Al habérselo propuesto a tiempo DFSA of the sentence Al habérselo propuesto a tiempo FST for compound verb form tagging FST for compound verb form tagging Transduced DFSA of the sentence Al habérselo propuesto a tiempo Transduced DFSA of the sentence Al habérselo propuesto a tiempo Automata Corpus • Lexical tagging (part-of-speech, lemma) • Word ambiguities are represented in deterministic finite state automata (DFSAs) as different possible transitions between two consecutive states • Allows efficient word disambiguation • Allows extended lexical tagging using automata transduction • Compound verbal forms tagging • Multi-word verb recognition • Very efficient process rates • Human access is almost impossible
CWB Corpus • Lexical tagging (part-of-speech, lemma) • Text DSFA are disambiguated and converted to XML format • Unambiguous corpus • Allows human access to corpus contents • Allows human corpus search • Corpus contents are codified and indexed for an efficient corpus search
DFSA of the sentence Le hacían siempre el vacío en la empresa before the transduction Output DFSA of the sentence after the intersection and transduction Subsequential FST that detects the multi-word verb hacer el vacío Multi-word verb recognition • Inflectional morphological propertiesare kept • the siempre adverb is detected between the core verb and idiom
Subcorporation Process • Internal tools GramCreator and XQS are used to create subcorporation grammar # Request: solicitud # N-de-GN-de # <PALABRA>* = 4 { <%NPRED%> ( <APRED> + <PALABRA>* ) <de.PREP> ( (<PRON> + ( ( <E> + <PREDET> ) ( <E> + <DET> + <APOS> ) ( <E> + <APRED> + <VPRED:PP> ) )) <N> + (<NPROP> ( <E> + <NPROP> )) ) <de.PREP> } Solicitud grammar example: the syntactic structure N-de-GN-de is detected
Subcorporation Process • Each grammar (regular expression) is converted to a Finite State Transducer • LU’s subcorpora is transduced with a set of grammar’s FST to produce a set of subcorpora • The transduction process allows very efficient process rates (100 transductions per second) • The subcorporation set is converted to XML and imported to FNDesktop
Subcorporation Process N-de-GN-de structure detection
Annotation Tool • SFN uses the FN annotation tool (FNDesktop) to add semantic annotation to the LU subcorporation sets • The FNClassifier has been adapted to Spanish: the classifier has new rules which are adapted to the Spanish tags and Spanish local Syntactic contexts