1 / 41

Progress Report on Multilingual Synchronization

Progress Report on Multilingual Synchronization. Mining Web Link Structure for Building Multilingual Synthesized Association Network. Main Goal of Work. Multilingual Synchronization focusing on Wikipedia Motivation

ronaldbrown
Télécharger la présentation

Progress Report on Multilingual Synchronization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progress Report onMultilingual Synchronization Mining Web Link Structure for Building Multilingual Synthesized Association Network

  2. Main Goal of Work • Multilingual Synchronization focusing on Wikipedia • Motivation • Monolingual sources often contain incomplete of data which are missing values for some entities • Two linked articles in two different languages have different amount of information Over 270 languages in 2011

  3. Main Goal of Work • Multilingual Synchronization focusing on Wikipedia • Proposed solution • Multilingual or cross-lingual synthesized data can offer more precise and detailed information across different intentions, different backgrounds by different community members Korean Spanish Chinese English French Synthesis Others Others

  4. Sub Goal of Work • Building co-occurrence network(COCNET) using multilingual hyperlinks • Process • Building several monolingual COCNETs • Unification of several COCNETs using mapping resources to generate M-COCNET • A bilingual dictionary provides translations relating nodes between graphs • Compute the synthesized relatedness to discover related terms from multiple graphs • Contributions • Supplementing new/hidden words(links) from different resources ∪ A C B Relatedness Calculation Relatedness Calculation Relatedness Calculation Translation

  5. Motivation of Sub WorkWhy We Need This Work? • Previous work • Infobox Synchronization: 명탐정코난

  6. Motivation of Sub WorkExample of cross-lingual supplement Baldness Hair

  7. Motivation of Sub WorkExample of cross-lingual supplement Baldness Hair 탈모증 털

  8. Motivation of Sub WorkExample of cross-lingual supplement Baldness Hair 탈모증 털 Increase Strength

  9. Motivation of Sub WorkExample of cross-lingual supplement Hamilton-Norwood_scale Baldness Hair 탈모증 털

  10. Motivation of Sub WorkExample of cross-lingual supplement Hamilton-Norwood_scale Baldness Hair 탈모증 털 Link Completion or Keyword Recommendation Hamilton-Norwood_scale

  11. Why We Need This Work? • Infobox Synchronization: 명탐정코난 • Needs for evaluation of New Added Values • Compute relatedness between Title and Value • e.g., (“명탐정 코난”, “희극”) , (“명탐정 코난”, “소년만화”)

  12. Background: What is Co-occurrence Network?

  13. Graph & Network • Graph • A graph G is a pair (V,E), where V is a finite set of vertices or nodes, and E is a set of edges, each being an unordered pair {u,v} of distinct nodes • Digraph (directed graph) • A digraph is a pair (V,A), where V is a finite set of nodes, and A is a set or arcs, ordered pairs (u,v) where (u,v) ∈ V x V, u != v • Network • Node-weighted or edge-weighted graph, or both

  14. Construction of Co-occurrence Network • Co-occurrence • Two terms are said to co-occur when they frequently appear close to each other within texts • Collective interconnection of terms based on their paired presence • Network • Can be generated by connecting pairs of terms using a set of criteria defining co-occurrence • Co-occurrence Network • Nodes are the terms • Co-occurrence relations as edges with each strength

  15. “Term”: Our mention • Definitions • Term: hyperlinked(internal link) word of a document • Document • Terms = {table computers, Apple Inc., smartphones, laptop, operating system, iPod Touch, iPhone, modification, online store}

  16. “M-COCNET”: Our mention • M-COCNET • Multilingual synthesized Co-OCcurrenceNETwork • Nodes: titles or terms • Edges: link presences between nodes • Edge weight: strength of each edge which is computed by proposed measure

  17. Proposed System to M-COCNET

  18. Workflow of Proposed System • Building monolingual COCNETs COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles iPhone Apple iPad Computing Relatedness iPhone iPhone iPad iPad Apple Apple

  19. Workflow of Proposed System • Building monolingual COCNETs • Building M-COCNET iPhone Apple iPad COCNET Building System INPUT OUTPUT Synthesis Extracting Link Wikipedia Articles Bilingual Dictionary Computing Relatedness iPhone iPhone iPad iPad Apple Apple

  20. Workflow of Proposed System • Building monolingual COCNETs iPhone Apple iPad COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles Computing Relatedness iPhone iPhone • 40k pages are selected (5-lingual-clique) iPad iPad Apple Apple

  21. Workflow of Proposed System • Building monolingual COCNETs iPhone Apple iPad COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles Computing Relatedness iPhone iPhone iPad iPad Apple Apple

  22. Workflow of Proposed System • Building monolingual COCNETs iPhone Apple iPad COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles Computing Relatedness iPhone iPhone iPad iPad Apple Apple

  23. (1) TF-IDF based Relatedness Measure • TF-IDF in Wikipedia • Extracting associations between words is achieved by extracting important hyperlinks in the page by means of TF-IDF • Page corresponds to a concept(word) • Hyperlinks clearly represent semantic associations to other concepts • tf(l,d) is the number of appearance of hyperlinks in the article d • df(l) is the number of articles containing the hyperlink l, and N is the number of articles

  24. Analysis of TF-IDF based method • Shortcoming of df(l) • df(l) is the number of articles containing the hyperlink l, and N is the number of articles • Our mention: only small world of Whole Wikipedia • Some pair of <title, link> have high score with the high df

  25. (2) Neighbor based Relatedness Measure • Supplementing method is added that compute the relatedness using graph structure • Jaccard’s coefficient • A commonly used similarity metric in IR • Measures the probability that both x and y have a feature f, for a randomly selected feature fthat either xor y has • If we take “features” here to be neighbors (in-link + out-link) • Previous work: separatelyuse: in-link and out-link,

  26. (2) Neighbor based Relatedness Measure • Supplementing method is added that compute the relatedness using graph structure • Jaccard’s coefficient • A commonly used similarity metric in IR • Measures the probability that both x and y have a feature f, for a randomly selected feature fthat either xor y has • If we take “features” here to be neighbors (in-link + out-link) • Previous work: separatelyuse: in-link and out-link, iPod iPad Mac Mac iPod Mobile Mobile iPhone Apple Apple JC(iPhone, iPad) = 1

  27. Result Analysis of M-COCNET

  28. Multilingual Synthesis Impact?Co-occurrence Pairs • Unique pairs in common pages (40k page) • Pairs in common pages

  29. Multilingual Synthesis Impact?Overlap Check in Multilingual & Cross-lingual • Overlap check in Multilingual • Overlap check in Cross-lingual

  30. Result Sample of M-COCNET • Associative terms with iPhone in COCNET

  31. Result Sample of M-COCNET • Associative terms with iPhone in COCNET Lingual unique terms are blue!

  32. Result Sample of M-COCNET • Associative terms with iPhone in COCNET Lingual overlapped terms with Result are blue! • M-OCCNET http://swrc.kaist.ac.kr/msync/index.php/Researchpage

  33. Evaluation • Using Measures of Semantic Relatedness (MSR) • PMI[1] • Normalized Google Distance[2] • How-to: eval. data • We try to find out which target words are associated from cue words • e.g. when the cue is ‘iPad’ • Targets: "Apple Inc.", "AppStore", "IPhone", "Apple A4", "Safari (web browser)", "Tablet personal computer” were chosen from the our system result with the top N • Distracters: "People's Republic of China", "Category:Australia", "Berlin", "Japan", "Le Monde” were chosen from the our system result with the bottom N • We took a random sample of 50 cue-targets-distracters test cases to evaluate(1 case for 1 template)

  34. Evaluation • How-to: method • Computing M(cue, word) • where word∈{targets, distracters}, N=5 N=5

  35. Evaluation • How-to: method • Computing M(cue, word) • where word∈{targets, distracters}, N=5 M M M M M M N=5 M M ordering M M M M M M M M M M M M Score for test case is 0 Score for test case is 1

  36. PMI: Avg. Score: 0.756

  37. NGD: Avg. Score: 0.736

  38. Conclusion • Progress • 4 language Wikipedia synthesized Term Association Network over hyperlinks is accomplished • Next schedule • Joining English resource ASAP(half completed) • Heavy to calculate neighborhood similarity with 5M pairs • Network at a single point in time leads to unrealistic analysis approaches so that I try to adapt the temporal analysis into network model(surveying related works, preprocessing data) • Experimenting other resources (acquiring resources) • Reuters multilingual corpus [Appendix 1] • To analysis multilingual synthesize impact • Biomedical literature (MEDLINE) • To analysis network model impact • Common: To compare Link with Co-occurred Term

  39. Appendix

  40. [A1] Reuters Corpus • Reuters Corpus, Volume 2, Multilingual Corpus • Release date 2005-05-31 • This contains over 487,000 Reuters News stories in 13 languages • Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish • Period: 1996-08-20 to 1997-08-19 • The stories are NOT PARALLEL, but are written by local reporters in each language

  41. References • TURNEY, P. D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In EMCL ’01: Proceedings of the 12th European Conference on Machine Learning. Springer-Verlag, London, UK, 491–502. • CILIBRASI, R. AND VITNYI, P. M. B. 2006. Similarity of objects and the meaning of words. CoRR abs/cs/0602065. informal publication.

More Related