1 / 49

Methodological provisions in the construction of idiom resources

Methodological provisions in the construction of idiom resources. Collocations and idioms 2006: Linguistic, computational and psycholinguistic perspectives Berlin, Nov. 3, 2006. Eric Laporte Institut Gaspard-Monge Université de Marne-la-Vallée France http://www-igm.univ-mlv.fr/~laporte/.

Télécharger la présentation

Methodological provisions in the construction of idiom resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Methodological provisions in the construction of idiom resources Collocations and idioms 2006:Linguistic, computational and psycholinguistic perspectivesBerlin, Nov. 3, 2006 Eric LaporteInstitut Gaspard-MongeUniversité de Marne-la-ValléeFrancehttp://www-igm.univ-mlv.fr/~laporte/

  2. Why construct resources describing idioms?Defining objectives of quality accuracy coverageData- or computer-based provisions corpus attestations statistical analyses golden-standard-based evaluationHuman-based provisions objective introspective

  3. Why construct resources describing idioms? Linguistic interestIdioms make up a large part of languagesComputer applicationsText analysis for information retrieval, information extraction, translation...Text generation

  4. What kinds of idioms? Verbal make ends meetAdverbial in the long termNominal American coffeeAdjectival rough and readyPrepositional ph. in a hurryNot support verb constructions make a decision

  5. 1. Defining objectives of quality Goals and methodological provisions must be adapted to each otherProvisions depend on goalsProvisions are responses to goal-specific risksExample Objective: know idioms in 1st century AD Latin Provision: gather 1st century AD Latin textMore ambition, more methodological provisionsCompatibility between objectives and provisionsExample Objective: know idioms in 1st century AD Latin Provision: human control over acceptability of idiomsTrade-off between ambition and provisions

  6. Defining objectives of quality General objective of qualityConformity with linguistic realityInclusion of all relevant informationRealistic goalsAlready attained for some languages

  7. Selected objectives of quality: accuracy Complementarity with grammarThe salmons swim up the river grammarJohn drank up his beer grammarMike gave up the piano idiom resourceCompositional: grammarNon-compositional: idiom resourcesIn fact, idioms also require a grammarFormalization of descriptionConventional dictionaries, second-language grammars... are interesting but not formalized enough for computer exploitation

  8. Selected objectives of quality: consistency between intended and actual coverage Independence from authors' idiolectsau petit bonheur la chance (Fr.)au petit bonheur de la chance (my idiolect)discovered through paper reviewingGeographical limitsLuc amuse le temps (Québec)*Luc amuse le temps (France)Limits with respect to language playsInclusion of variantsRecall or completeness vs. silence or undergenerationPrecision vs. noise or overgeneration

  9. Intended and actual coverage Completeness vs. undergenerationExamples of undergenerationNeglecting variantsin the long termin the very long termConsider an idiom as compositional (i.e. taken into account by grammar)pomme de terre (recent conversation with a linguist)

  10. Intended and actual coverage Precision vs. overgenerationInclusion of obsolete idioms (out-of-date dictionaries)It rains cats and dogs (?)Admission of unacceptable variantsJohn is on the verge of giving up again*John is on a new verge of giving upChecking lemmas, not inflected formsIl faut voir les choses en face idiomatic meaningIl faut voir la chose en face no idiomatic meaning

  11. Intended and actual coverage The linguistic notion underlying over- and undergeneration is obviously that of constraintsExampleCo-reference of possessivesLeurs hôtes préviennent leurs désirs not necessarily co-referent to subjectLeurs hôtes reprennent leurs esprits co-referent to subject(cf. lose one's temper)

  12. Intended and actual coverage Goals with respect to language playsExamples: creative reworking of lexicalised metaphorsJohn spilled the beans lexicalisedJohn spilled the beans of their relationship lexicalisedJohn spilled coffee on the bed and the beans of their relationship with it creativeLa direction a jeté le bébé avec l'eau du bain lexicalisedLa direction a jeté le bébé de la qualité avec l'eau du bain de la formation creative

  13. Intended and actual coverage A realistic goalInclude: Fully lexicalised forms Limits of variation of fully lexicalised formsExclude: Creative reworkingA basis for future studies about creative reworking

  14. Intended and actual coverage Syntactic variantsSomeone spilled the beans idiomaticThe beans were spilled idiomaticThe beans spilled not idiomaticA realistic goalDescribe idiomatic variants of idiomsLink all variants of each idiomEx. Freckleton 1985, Machonis 1985A common overgeneralizationA frequent base form, unfrequent variantsLuc n'a pas été gâté par la nature more frequentLa nature n'a pas gâté Luc less frequent, active

  15. Other objectives of quality Less relevant psychological plausibility of description etymology ...

  16. 2. Data- and computer-based provisions Corpus linguisticsA reaction to biased introspective linguistics: - normativity - idiolect generalization - tendency to disregard contexts - reliance on incomplete conventional dictionaries - necessity of updatesConvergence with computational linguisticsAutomatization of corpus linguistics

  17. Corpus attestations Attestations give information about existence and frequency of idioms (example: the 'Collocations in the German Language' project)Balanced corporaAnnotated corporaThe web as corpus (example: the BFQS project)Recognising the limits of language playsContext: headlines, advertisement...Requires intuition also

  18. Corpus attestations ConcordancersMost corpus linguists use concordancers without lexiconsUnitex, an open-source generator of lemmatized concordances from raw corporahttp://igm.univ-mlv.fr/~unitexContains lexicons produced through introspective approaches

  19. Corpus attestations ResultsConventional dictionaries (e.g. COBUILD) for human usersProblemNo attestations of unacceptabilitypetite cuillère 'tea spoon' absent from a large Canadian corpus of French textsCorpus-dependent information about frequency can be in contradiction with real language use (Garrigues 1993)

  20. Statistical analysis Can be seen as a methodological provision against subjectivityFor many researchers, other motivations: more fun ('Manual construction of resources is tedious'), better salaries?...ExampleStatistical attraction as a sign of frozennessSimilarity of contexts as a sign of semantic proximityMore efficient on technical terms than on verbal idiomsHuman revisionRequired (methodological provisions: human-based, part 3)

  21. Statistical analysis ProblemsQuality of results of automatic analysis of natural language: shallow parsing small tagsets incomplete data about sense distinctionsUnfrequent idioms are a challenge (e.g. variations, constraints)Detection of properties: semantic properties, creative reworking of idioms

  22. Statistical analysis ResultsLists only: properties (variants, constraints) still largely out of reachUsually not made availableTerminological lists placed on the market

  23. Golden-standard evaluation Evaluation of an idiom extractorManual annotation of a sub-corpus (golden standard)Comparison with results of automatic extractionProblemsGolden standards for idioms are small and rareLittle communication about methodological problems in building them (human-based provisions)

  24. Lexicon-Grammar of idioms as Golden standard A manually constructed Lexicon-Grammar of French idiomsAuthors: Maurice Gross, Laurence Danlos10.000 entriesMade available on line in 2006http://infolingu.univ-mlv.fr/englishUsersUse as golden standardDo not be scared by so much information, you can use only the lists if you prefer soUsers and descriptive linguistsConstructive criticism is welcome

  25. Lexicon-Grammar of idioms

  26. 3. Human-based provisions Objective: psycholinguistic experimentsIntrospective: avoid preconceptions native linguists mutual control time limitation readability of resources formal criteria differential semantic judgment

  27. Psycholinguistic experiments A reaction to biased introspective linguisticsSeparate informant from scientistControl age, sex, origin, number... of informantsExamplesRecognising idioms as suchParaphrasing idioms

  28. Psycholinguistic experiments DrawbacksTypical time required by an experiment on 20 forms:2 monthsExtrapolated velocity of construction of resources: 40 lexical entries/year (counting 3 forms/entry)Usually, the idioms need to be known beforehandNot applicable for comprehensive resources

  29. Human-based provisions: introspective Specific solutions to the biases of introspective linguisticsMethodology and actual description simultaneouslyEuropean traditionLexis/grammar interactionDescription of idioms: 1980-nowhttp://infolingu.univ-mlv.fr/englishAmerican traditionWordnet

  30. Avoid preconceptions (1/2) Preconception 1'Manual construction of language resources is too difficult''Manual construction of resources is error-prone'Frequently read in (peer-reviewed) computer scientists' papersThe quality of manually constructed resources depends on the background, skills, training and effort of authorsCf. softwareDysfunctioning of scientific democracy in a case of multi-disciplinarityAt stake: the future of the institutions around the world that train people to construct high-quality language resources

  31. Avoid preconceptions (2/2) Preconception 2'Descriptive linguistics is not difficult enough to be interesting''Descriptive linguistics does not require much skill''Making lists is not the point'In fact, results of descriptive linguistics are basic information for theoretical linguistics and for computer applications

  32. Native linguists Native linguists are much better than non-native ones at- taking into account sense distinctions- inserting idioms in relevant sentences (this ensures that context is taken into account)- taking into account semantic propertiesExampleLa défense a cité un témointémoin can have co-referentsLe patron a chié une pendule(not an elegant phrase)pendule cannot have co-referents

  33. Native linguists DrawbacksResults depend on skill, training and effort of the linguistNot applicable to languages without native speakers with higher educationNot applicable to extinct languages

  34. Mutual control An idiom resource should be built by a teamExamplesGross' Lexicon-Grammar of French verbal idiomsMost idioms were listed during the meetings of construction of the Lexicon-Grammar of French verbs (5 linguists)The Belgium/France/Québec/Switzerland (BFQS) projectDifferences between idioms in these 4 varieties of French (4 to 6 linguists)

  35. The BFQS project

  36. The BFQS project 1. Make a separate list for each variety2. Compare listsComparison requires meetingsIf an idiom is not in the F list, the F author can have missed itIf an idiom in the B list is not understood by the F author, it is considered evidence that it does not belong to the F varietyIntermediate case: passively understood, not actively usedIf an idiom in the B list is understood by the F author, compare interpretations, they can be different

  37. Mutual control DrawbackCost: several years of weekly or monthly meetingsThe grant for the BFQS project will cover only a part of publication costs

  38. Readability of description Goal: facilitate critical reviewing, update of resourcesExampleTable representationRows: lexical itemsColumns: structure and propertiesOpen-source software: HOOP (Sastre 2006)Density of representationNumber of lexical items on the same screen or pageNumber of properties on the same screen or pageMetalanguage should not invade the description (which is the case with feature structures)

  39. Readability of description DrawbacksReadable formats are usually not directly exploitable in computer applicationsCompilation processes are requiredCf. source code vs. executable codelemma lexicon vs. inflected-form lexicon

  40. Time limitation The description of a lexical item is normally limited to a few minutesRegularities --> classification --> similar items are described in sequence --> efficiencyFor properties, description by property is more efficient than description by entryEven so, manual description of all idioms of a language takes several years

  41. Formal criteria Formal criteria based on acceptability of sentencesExample: co-referent of possessivesLeurs hôtes préviennent leurs désirsLeurs hôtes préviennent nos désirsLeurs hôtes reprennent leurs esprits*Leurs hôtes reprennent nos espritsIdentifying such a constraint is immediate for a linguist trained to distributional analysis

  42. Formal criteria Complementarity between idiom resource and grammar is obtained through distributional analysisLuc a couché par écrit ses instructionsLuc a mis par écrit ses instructions*Luc a placé par écrit ses instructions*Luc a couché par imprimé ses instructionsLuc a couché par écrit ses demandes?*Ses instructions sont par écrit--> 2 expressions:N0coucher par écrit N2N0mettre par écrit N2

  43. Formal criteria Limits of variation are obtained through systematic testsLuc met cela par écritCela est mis par écrit par LucLuc n'entend pas cela de cette oreille*Cela n'est pas entendu de cette oreille par Luc

  44. Differential semantic judgment Comparison of variantsDistributional analysisTaking into account connotations, implicationsRecognising the limits of language playsIntuition (cf. acceptability judgment, lexicalization, institutionalization)Requires a corpus also (context: headlines, advertisement...)

  45. Results Theoretical results(M. Gross 1982, P. Freckleton 1985, P. Machonis 1985)'Free' grammar vs. idiom grammar: Idiom grammar accounts for variantsIdiom grammar is close to free grammar: same structures, same transformationsIdiom entries are more numerous than simple entriesSupport verb constructions vs. idioms

  46. Results Idioms with free determiner, including indefinite determiner(1)La défense a cité (un + ce) témoinCe numéro a été le clou de (un + le) spectacle de 2002Distributionally frozenThe noun with the free determiner can have co-referents, even without language playsMore in core than in peripheryThe noun has to be attached both to a simple entry and to the idiom entry1700 examples like (1), mostly technical

  47. Conclusion Different backgrounds, different approachesBackgrounds are so different that much synergy between researchers is missedA result can have very different usersTheoreticalPractical: computer applications construction of further resourcesDistinct approaches can converge to an objective

  48. Conclusion Synergy between corpus approaches and introspective approachesIntrospective approaches produce dense, informative resourcesResources are useful to corpus explorationCorpus exploration is an aid to introspective approachesExcessive methodological provisionsThrowing away the baby of idiom description with the bath water of introspective linguistics

  49. Bibliographical references Freckleton, Peter. 1985. Sentence idioms in English, Working Papers in Linguistics, University of Melbourne, pp. 153-168 + appendix (196 p.). Gross, Maurice. 1982. Une classification des phrases "figées" du français,Revue Québécoise de Linguistique 11.2, pp. 151-185, Montréal: UQAM. Machonis, Peter A. 1985. Transformations of verb phrase idioms: passivization, particle movement, dative shift,American Speech 60:4, pp. 291-308. Sastre Martinez, Javier M. 2006. Computer Tools for the Management of Lexicon-Grammar Databases, poster, Proceedings of the 13thConference on natural language processing, TALN 2006, Leuven, 10-13 April 2006, UCL, Presses Universitaires de Louvain, pp. 600-608.

More Related