1 / 13

Text mining tool for ontology engineering based on use of product taxonomy and web directory

Text mining tool for ontology engineering based on use of product taxonomy and web directory. Jan Nemrava and Vojtech Sv atek Department of Information and Knowledge Engineering VSE Praha. Current state. IE and Ontology learning are frequently discussed issues in the field of Semantic Web.

Télécharger la présentation

Text mining tool for ontology engineering based on use of product taxonomy and web directory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and Knowledge Engineering VSE Praha

  2. Current state • IE and Ontology learning are frequently discussed issues in the field of Semantic Web. • Semi-automatic and automatic methods ontology-based extraction of informationneeded • Web is great source for unstructured text DATESO 2005

  3. Task is … • Collect specific words – verbs in our case – that usually occur together with particular product category as support for ontology designers. • Small and specialized ontologies concerning one product category and describing its frequent relations in common text. • Make use of fulltext search engines and DMOZ directory for retrieving information • And UNSPSC (United Nations Standard Products and Services Code) product catalogue DATESO 2005

  4. Web directory are rarely valid taxonomies. • It is easy to see that subheadings are often not specializations of headings • Some of them are even not concepts (names of entities) but properties that implicitly restrict the extension of a preceding concept in the hierarchy. Consider for example .../Industries/Construction and Maintenance/Materials and Supplies/ /Masonry_and_Stone/Natural Stone/International Sources/Mexico. DATESO 2005

  5. Proposal of method … • Obtain so called „indicator verbs” that characterize particular term (product category in our case) in UNSPSC. • Particular terms will be then generalized and may mine verbs that are indicative for the upper level of these terms. • join UNSPSC taxonomy and it’s list of products with content of company websites to gain valuable information about verbs that usually occur in one sentence with some product category from the taxonomy. • Use hand classified web directories containing relevant web sites. DATESO 2005

  6. Task sequence decomposition • Manually select UNSPSC product and corresponding product category from DMOZ Business branch • Search in directory headings names • Search in web site description • Use fulltext • 1) Input: URL of DMOZ directory containing companies that manufacture desired product. • Output: List of URL of companies. • 2) Input:URL of company website • Output: List of web pages containing the target term. • 3) Input: Web page containing the term • Output: File with extracted sentences containing the term • 4) Input:Sentence with term. • Output: Tagged sentences • 5) Input: Verbs • Output: lemmatized, grouped and saved verbs DATESO 2005

  7. Experiment • Handling equipment branch / UNSPSC product with corresponding DMOZ category • Goal is find verbs: • common for most products. • characterizing one branch of products • specific for small group of products, or even only one product • 7 product categories, 303 verbs collected that occurred 7300 times at web sites. DATESO 2005

  8. Experiment DATESO 2005

  9. Experiments • some verbs are obvious to be entirely neutral and do not characterize the products at all. (be, have, provide and use) • Some are connected with manufacturing(design, require, offer, make, contact, manufacture, develop, supply) • activities describing manipulating with material. (handle, lift, install and move) DATESO 2005

  10. Experiments DATESO 2005

  11. normalization • Fij = fij * (Vtj / V) • Croft’s normalization moderates the effect of high-frequency verbs • cf = K + (1 - K) * fij / mij • TF/IDF • wij = fij * log2(N / n) DATESO 2005

  12. Problem remaining … • Automate assigning UNSPSC category to DMOZ category • Some UNSPSC have no appropriate category leading in no or little web sites. • Some categories are less informative DATESO 2005

  13. Thank you! DATESO 2005

More Related