1 / 33

WP 2: Semi-automatic metadata generation driven by Language Technology Resources

WP 2: Semi-automatic metadata generation driven by Language Technology Resources. Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007. Our Background. Experience in corpus annotation and information extraction from texts Experience in grammar development Experience in statistical modelling

sook
Télécharger la présentation

WP 2: Semi-automatic metadata generation driven by Language Technology Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP 2: Semi-automatic metadata generation driven by Language Technology Resources Lothar Lemnitzer Project review, Utrecht, 1 Feb 2007

  2. Our Background • Experience in corpus annotation and information extraction from texts • Experience in grammar development • Experience in statistical modelling • Experience in eLearning

  3. WP2 Dependencies • WP 1: collection and preparation of LOs • WP 3: WP 2 results are input to this WP • WP 4: Integration of tools • WP 5: Evaluation and validation

  4. M12 Deliverables Language Resources: • > 200 000 running words per language • With structural and linguistic annotation • > 1000 manually selected keywords • > 300 manually selected definitions • Local grammars for definitory contexts

  5. M12 Deliverables Documentation: • Guidelines linguistic annotation • Guidelines keyword annotation • Guidelines annotation of definitions • (Guidelines evaluation)

  6. M12 Deliverables Tools: • Prototype Keyword Extractor • Prototype Glossary Candidate Detector

  7. LMS User Profile LING. PROCESSOR EN GE Lemmatizer, POS, Partial Parser Ontology CROSSLINGUAL RETRIEVAL Lexikon Lexikon Lexicon Lexikon Lexicon Lexikon Lexikon Lexikon Lexikon RO PT PL CZ BG DT MT PT GE PL RO DT MT EN CZ Documents SCORM Pseudo-Struct. Basic XML CONVERTOR 2 Documents SCORM Documents HTML Pseudo-Struct Glossary CONVERTOR 1 Metadata (Keywords) Ling. Annot XML BG EN Documents User (PDF, DOC, HTML, SCORM,XML) REPOSITORY

  8. Linguistically annotated learning objects • Structural annotation: par, s, chunk, tok • Linguistic annotation: base, ctag, msd attributes  Example 1 • Specific annotation: marked term, defining text

  9. Part of the DTD • <!ELEMENT markedTerm (chunk | tok | markedTerm)+ > • <!ATTLIST markedTerm • %a.ana; • kw (y|n) "n" • dt (y|n) "n" • status CDATA #IMPLIED • comment CDATA #IMPLIED > • <!ELEMENT definingText (chunk | tok | markedTerm)+ > • <!ATTLIST definingText • id ID #IMPLIED • xml:lang CDATA #IMPLIED • lang CDATA #IMPLIED • rend CDATA #IMPLIED • type CDATA #IMPLIED • wsd CDATA #IMPLIED • def IDREF #IMPLIED • continue CDATA #IMPLIED • part CDATA #IMPLIED • status CDATA #IMPLIED • comment CDATA #IMPLIED >

  10. Linguistically annotated learning objects Use: • Linguistically annotated texts are input to the extraction tools • Marked terms and defining texts are used as training material and / or as gold standard for the evaluation

  11. Characteristics of keywords • Good keywords have a typical, non random distribution in and across LOs • Keywords tend to appear more often at certain places in texts (headings etc.) • Keywords are often highlighted / emphasised by authors

  12. Distributional features of keywords We are using the following metrics to measure keywordiness by distribution • Term frequency / inverse document frequency (tf*idf), • Residual Inverse document frequency (RIDF) • An adjusted version of RIDF (adjustment by term frequency) to model inter text distribution of KW • Term burstiness to model intra text distribution of KW

  13. Structural and layout features of keywords We will use: • Knowledge of text structure used to identify salient regions (e.g., headings) • Layout features of texts used to identify emphasised words ( Example 2) We will weigh words with such features higher

  14. Complex keywords • Complex, multi-word keywords are relevant, differences between languages • The keyword extractor allows the extraction of n-grams of any length • Evaluation showed that the including bi- or even trigrams word increases the results, with larger n-grams the performance begins to drop • Maximum keyword length can be specified as a parameter

  15. Complex keywords

  16. Language settings for the keyword extractor • Selection of single keywords is restricted to a few ctag categories and / or msd values (nouns, proper nouns, unknown words and some verbs for most languages) • Multiword patterns are restricted wrt to position of function words (style of learning is acceptable; of learning behaviours is not)

  17. Output of Keyword Extractor List, ordered by “keywordiness” value, with the elements • Normalized form of keyword • (Statistical figures) • List of attested forms of the keyword  Example 3

  18. Evaluation strategy We will proceed in three steps: • Manually assigned keywords will be used to measure precision and recall of key word extractor • Human annotators will judge results from extractor and rate them • The same document(s) will be annotated by many test persons in order to estimate inter-annotator agreement on this task

  19. Summary With the keyword extractor, • We are using several known statistical metrics in combination with qualitative and linguistic features • We give special emphasis on multiword keywords • We evaluate the impact of these features on the performance of these tools for eight languages • We integrate this tool into an eLearning application • We have a prototype user interface to this tool

  20. Identification of definitory contexts • Empirical approach based on linguistic annotation of LO • Workflow • Definitory contexts are searched and marked in LOs • Recurrent patterns are characterized quantitatively and qualitatively ( Example 4) • Local grammars are drafted on the basis of these recurrent patterns • Extraction of definitory context performed by lxtransduce (University of Edinburgh - LTG)

  21. Characteristics of local grammars • Grammar rules match and wrap subtrees of the XML document tree • One grammar rule refers to subrules which match substructures • Rules can refer to lexical list to constrain categories further • The defined term should be identified and marked

  22. Example Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

  23. <rule name="simple_NP" > <seq> <and> <ref name="art"/> <ref name="cap"/> </and> <ref name="adj" mult="*"/> <ref name="noun" mult="+"/> </seq> </rule> Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

  24. <query match="tok[@ctag='V' and @base='zijn' and @msd[starts-with(.,'hulpofkopp')]]"/> Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

  25. <rule name="noun_phrase"> <seq> <ref name="art" mult="?"/> <ref name="adj" mult="*" /> <ref name="noun" mult="+" /> </seq> </rule> Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

  26. <rule name="is_are_def"> <seq> <ref name="simple_NP"/> <query match="tok[@ctag='V' and @base='zijn' and @msd[starts-with(.,'hulpofkopp')]]"/> <ref name="noun_phrase" /> <ref name="tok_or_chunk" mult="*"/> </seq> </rule> Een vette letter is een letter die zwarter wordt afgedrukt dan de andere letters.

  27. Output of Glossary candidate detector • Ordered List of words • Defined Term marked • (Larger context – one preceding, one following sentence) ( Example 10)

  28. Evaluation Strategy We will proceed in two steps: • Manually marked definitory contexts will be used to measure precision and recall of the glossary candidate detector • Human annotator to judge results from the glossary candidate detector and rate their quality / completeness

  29. Results Precision Recall Own LOs 21.5 % 34.9 % Verbs only 34.1 % 29.0 %

  30. Evaluation Questions to be answered by a user-centered evaluation: • Is there a preference for higher recall or for higher precision? • Do user profit from seeing a larger context?

  31. Integration of functionalities Development Server (CVS) Content Portal KW/DC Ontology ILIAS LOs Code Code/Data Code Migration Tool Nightly Updates Use functionalitiesthroughSOAP Java Webserver (Tomcat) ILIAS Server Application Logic Webservices nuSoap LOs Axis KW/DC/Onto JavaClasses/ Data Evaluate functionalitiesin ILIAS ThirdPartyTools User Interface Servlets/JSP Accessfunctionalities directly

  32. User Interface Prototypes

  33. User Interface Prototypes

More Related