1 / 23

Automatic term extraction from domain corpora

Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007. Automatic term extraction from domain corpora. Overview. Corpus versus Domain-based text collections Customer-case Term-extraction Demo.

Télécharger la présentation

Automatic term extraction from domain corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007 Automatic term extraction from domain corpora

  2. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Overview • Corpus versus Domain-based text collections • Customer-case • Term-extraction • Demo

  3. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Corpus versus Domain-based text collections • Corpus to study linguistic phenomena: • INL corpus: NRC-handelsblad • Corpus geschreven Nederlands • British National Corpus • Brown corpus -> SemCor • Domain corpora: • portals • Wikipedia • Customer corpora: • web sites • manuals

  4. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Customer-case • Connect suppliers and buyers and create traffic and advertisement • B2B: companies with specialized products and services • terminology driven • branch driven • C2B: consumers looking for products and services • general language terminology: -> folksonomy • bottom-up

  5. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product name in ontology of 150,000 products "kleppen, vlinder, pomp, hoge druk" (valves, butterfly, pump, high pressure) product name on company website "Wij zijn gespecialiseerd in: pompen en pomponderdelen zoals kleppen" (We are specialized in: pumps and components such as valves user query searching for products or servcies "vlinderkleppen voor een hoge drukpomp" (butterfly valves for high pressure pumps) Subscription for product names Companies in database 1.5 million websites

  6. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

  7. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

  8. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction • morpho-syntactic analysis • statistical analysis • conceptual analysis • contextual analysis

  9. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: morpho-syntactic analysis • Tokenization, tagging and NP-chunking: • “een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player) • Term candidates: • Syntactic head of NPs: kaart (card); vleugelaanvaller (wing-player). • Word combinations including syntactic head: gele kaart (yellow card); kaart voor vleugelaanvaller (card for wing-player). • Head of compounds: aanvaller (attacker-player). • Term is a concept: • Normalized form (plural-singular variants, synonyms) • Hypernym based on the syntactic head

  10. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

  11. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: statistical analysis • Reference corpus based on 500 websites of diverse range of companies • Salience = normFreq * normRef • normFreq = normalized frequency of terms on the website normFreq = nTermFrequencynWords / nPages • normRef = normalized number of websites on which the term occurs in the reference corpus • multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize)) • singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))

  12. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

  13. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: conceptual analysis • Structural properties of the term hierarchy • Poor hierarchies: • many tops • few levels • diverse branches • Each branch is a concept: • number of descendants and levels • cumulated frequency of descendants • Branch profiling: • Domain classification of the hierarchy • Domain classification of each branch • Minimal overlap in domain

  14. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

  15. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Domains Clothing Sport Finance Culture Music Ball sports Winter sports Wordnet: Domain information Concepts Relations Vocabularies of languages 1 rec: 12345 • financial institute rec: 54321 - river side 2 bank 1 rec: 9876 - small string instrument violin 2 rec: 65438 - musician playing a violin violist rec:42654 - musician type-of 1 rec:35576 - string of an instrument type-of part-of string 2 rec:29551 - underwear rec:25876 - string instrument

  16. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Anything can be a product or service: there are no intrinsic properties to define products • Contextual features: • context patterns for products • product pages • special marking in HTML

  17. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Context patterns for products: • 144 patterns in English and 288 patterns in German • [we supply] [we deliver] [we provide] [our products are][we are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products] • Each term is scored for a product context in terms of the strength of the pattern and the distance

  18. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Product pages: • landing page: index.html • html files with product names: product, service, solution • html files referred to by these pages • html files referred to by menus with such names • Special marking in HTML: • meta keywords • headings and titles • menus

  19. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product terms with feature bundles

  20. <class> <name><![CDATA[arabica-kaffee gemahlene]]></name> <id>48</id> <pos>1</pos> <preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form> <parent_form><![CDATA[Gemahlener]]></parent_form> <documents>1</documents> <frequency>1</frequency> <salience>0.0523</salience> <connectivity>10</connectivity> <modifiers> <modifier>arabica</modifier> <modifier>kaffee</modifier> <modifier>arabica-kaffee</modifier> </modifiers> <profileMatch>-1</profileMatch> <profile/> <termSource><![CDATA[#product]]></termSource> <cumfrequency_parent>1</cumfrequency_parent> <cumdocuments_parent>1</cumdocuments_parent> <siblings>1</siblings> <features> <feature> <featureName>RIGHT</featureName> <featureValue>kaffee</featureValue> <featureScore>1.0</featureScore> </feature> </features> </class>

  21. Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Evaluation of French product extraction

  22. Evaluation of French product extraction Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

  23. Evaluation of French product extraction Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

More Related