1 / 17

Link Distribution on W ikipedia

Link Distribution on W ikipedia. [0422] KwangHee Park. Table of contents. Introduction Similarity between document Error case Modify word bag Conclusion. Introduction. Why focused on Link

marlow
Télécharger la présentation

Link Distribution on W ikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Distribution on Wikipedia [0422]KwangHee Park

  2. Table of contents • Introduction • Similarity between document • Error case • Modify word bag • Conclusion

  3. Introduction • Why focused on Link • When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others • Assumption • Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

  4. Introduction • Problem what we want to solve is • To analyses latent distribution of set of Target document by topic modeling

  5. Topic modeling – our approach • Target • Document = Wikipedia article • Terms = linked term in document • Modeling method • LDA • Modeling tool • Lingpipeapi

  6. Advantage of linked term • Don’t need to extra preprocessing • Boundary detection • Remove stopword • Word stemming • Include more semantics • Co-relation between term and document • Ex) cancer as a term  cancer as a document A Cancer cancer

  7. Preliminary Problem • How well link terms in the document are represent specific characteristic of that document • Link evaluation • Calculate similarity between document

  8. Link evaluation • Similarity based evaluation • Calculate similarity between documents • Sim_d{doc1,doc2} • Calculate similarity between terms • Sim_t{term1,term2} • Compare two similarity

  9. Similarity between documents • Sim_d • Similarity between documents • Significantly affected input term set • Data set • 1536 number of document • Disease domain : 208 • Settlement domain : 1328 KullbackLeibler divergence p,q = topic distribution of each document

  10. Example –reasonable

  11. Example – not good

  12. Error analysis • Length problem – overestimate portion of topic • If the document contain only few link term then portion of topic of that document tend to be overestimated • Ex)1950년,1960년,파푸아 뉴기니,식인풍습

  13. Error analysis • Some document’s Link terms do not describe document itself • Ex) Date, Country,…etc

  14. Demo website • For disease domain : • http://semanticweb.kaist.ac.kr/research/tmodel/ • For settlement domain : • http://semanticweb.kaist.ac.kr/research/tmodel/sindex.php • For disease + settlement domain : • http://semanticweb.kaist.ac.kr/research/tmodel/dsindex.php

  15. Modify word bag • Including non-link term • Excluding noise term • Weighted score for duplication term • Including incoming link

  16. Conclusion • Topic modeling with link distribution in Wikipedia • Need to measure how well link distribution can represent each article’s characteristic • After that analysis topic distribution in variety way • Expect topic distribution can be apply many application

  17. Thank

More Related