Progress Report on Multilingual Synchronization

Progress Report onMultilingual Synchronization Mining Web Link Structure for Building Multilingual Synthesized Association Network

Main Goal of Work • Multilingual Synchronization focusing on Wikipedia • Motivation • Monolingual sources often contain incomplete of data which are missing values for some entities • Two linked articles in two different languages have different amount of information Over 270 languages in 2011

Main Goal of Work • Multilingual Synchronization focusing on Wikipedia • Proposed solution • Multilingual or cross-lingual synthesized data can offer more precise and detailed information across different intentions, different backgrounds by different community members Korean Spanish Chinese English French Synthesis Others Others

Sub Goal of Work • Building co-occurrence network(COCNET) using multilingual hyperlinks • Process • Building several monolingual COCNETs • Unification of several COCNETs using mapping resources to generate M-COCNET • A bilingual dictionary provides translations relating nodes between graphs • Compute the synthesized relatedness to discover related terms from multiple graphs • Contributions • Supplementing new/hidden words(links) from different resources ∪ A C B Relatedness Calculation Relatedness Calculation Relatedness Calculation Translation

Motivation of Sub WorkWhy We Need This Work? • Previous work • Infobox Synchronization: 명탐정코난

Motivation of Sub WorkExample of cross-lingual supplement Baldness Hair

Motivation of Sub WorkExample of cross-lingual supplement Baldness Hair 탈모증 털

Motivation of Sub WorkExample of cross-lingual supplement Baldness Hair 탈모증 털 Increase Strength

Motivation of Sub WorkExample of cross-lingual supplement Hamilton-Norwood_scale Baldness Hair 탈모증 털

Motivation of Sub WorkExample of cross-lingual supplement Hamilton-Norwood_scale Baldness Hair 탈모증 털 Link Completion or Keyword Recommendation Hamilton-Norwood_scale

Why We Need This Work? • Infobox Synchronization: 명탐정코난 • Needs for evaluation of New Added Values • Compute relatedness between Title and Value • e.g., (“명탐정 코난”, “희극”) , (“명탐정 코난”, “소년만화”)

Background: What is Co-occurrence Network?

Graph & Network • Graph • A graph G is a pair (V,E), where V is a finite set of vertices or nodes, and E is a set of edges, each being an unordered pair {u,v} of distinct nodes • Digraph (directed graph) • A digraph is a pair (V,A), where V is a finite set of nodes, and A is a set or arcs, ordered pairs (u,v) where (u,v) ∈ V x V, u != v • Network • Node-weighted or edge-weighted graph, or both

Construction of Co-occurrence Network • Co-occurrence • Two terms are said to co-occur when they frequently appear close to each other within texts • Collective interconnection of terms based on their paired presence • Network • Can be generated by connecting pairs of terms using a set of criteria defining co-occurrence • Co-occurrence Network • Nodes are the terms • Co-occurrence relations as edges with each strength

“Term”: Our mention • Definitions • Term: hyperlinked(internal link) word of a document • Document • Terms = {table computers, Apple Inc., smartphones, laptop, operating system, iPod Touch, iPhone, modification, online store}

“M-COCNET”: Our mention • M-COCNET • Multilingual synthesized Co-OCcurrenceNETwork • Nodes: titles or terms • Edges: link presences between nodes • Edge weight: strength of each edge which is computed by proposed measure

Proposed System to M-COCNET

Workflow of Proposed System • Building monolingual COCNETs COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles iPhone Apple iPad Computing Relatedness iPhone iPhone iPad iPad Apple Apple

Workflow of Proposed System • Building monolingual COCNETs • Building M-COCNET iPhone Apple iPad COCNET Building System INPUT OUTPUT Synthesis Extracting Link Wikipedia Articles Bilingual Dictionary Computing Relatedness iPhone iPhone iPad iPad Apple Apple

Workflow of Proposed System • Building monolingual COCNETs iPhone Apple iPad COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles Computing Relatedness iPhone iPhone • 40k pages are selected (5-lingual-clique) iPad iPad Apple Apple

Workflow of Proposed System • Building monolingual COCNETs iPhone Apple iPad COCNET Building System INPUT OUTPUT Extracting Link Wikipedia Articles Computing Relatedness iPhone iPhone iPad iPad Apple Apple

(1) TF-IDF based Relatedness Measure • TF-IDF in Wikipedia • Extracting associations between words is achieved by extracting important hyperlinks in the page by means of TF-IDF • Page corresponds to a concept(word) • Hyperlinks clearly represent semantic associations to other concepts • tf(l,d) is the number of appearance of hyperlinks in the article d • df(l) is the number of articles containing the hyperlink l, and N is the number of articles

Analysis of TF-IDF based method • Shortcoming of df(l) • df(l) is the number of articles containing the hyperlink l, and N is the number of articles • Our mention: only small world of Whole Wikipedia • Some pair of <title, link> have high score with the high df

(2) Neighbor based Relatedness Measure • Supplementing method is added that compute the relatedness using graph structure • Jaccard’s coefficient • A commonly used similarity metric in IR • Measures the probability that both x and y have a feature f, for a randomly selected feature fthat either xor y has • If we take “features” here to be neighbors (in-link + out-link) • Previous work: separatelyuse: in-link and out-link,

(2) Neighbor based Relatedness Measure • Supplementing method is added that compute the relatedness using graph structure • Jaccard’s coefficient • A commonly used similarity metric in IR • Measures the probability that both x and y have a feature f, for a randomly selected feature fthat either xor y has • If we take “features” here to be neighbors (in-link + out-link) • Previous work: separatelyuse: in-link and out-link, iPod iPad Mac Mac iPod Mobile Mobile iPhone Apple Apple JC(iPhone, iPad) = 1

Result Analysis of M-COCNET

Multilingual Synthesis Impact?Co-occurrence Pairs • Unique pairs in common pages (40k page) • Pairs in common pages

Multilingual Synthesis Impact?Overlap Check in Multilingual & Cross-lingual • Overlap check in Multilingual • Overlap check in Cross-lingual

Result Sample of M-COCNET • Associative terms with iPhone in COCNET

Result Sample of M-COCNET • Associative terms with iPhone in COCNET Lingual unique terms are blue!

Result Sample of M-COCNET • Associative terms with iPhone in COCNET Lingual overlapped terms with Result are blue! • M-OCCNET http://swrc.kaist.ac.kr/msync/index.php/Researchpage

Evaluation • Using Measures of Semantic Relatedness (MSR) • PMI[1] • Normalized Google Distance[2] • How-to: eval. data • We try to find out which target words are associated from cue words • e.g. when the cue is ‘iPad’ • Targets: "Apple Inc.", "AppStore", "IPhone", "Apple A4", "Safari (web browser)", "Tablet personal computer” were chosen from the our system result with the top N • Distracters: "People's Republic of China", "Category:Australia", "Berlin", "Japan", "Le Monde” were chosen from the our system result with the bottom N • We took a random sample of 50 cue-targets-distracters test cases to evaluate(1 case for 1 template)

Evaluation • How-to: method • Computing M(cue, word) • where word∈{targets, distracters}, N=5 N=5

Evaluation • How-to: method • Computing M(cue, word) • where word∈{targets, distracters}, N=5 M M M M M M N=5 M M ordering M M M M M M M M M M M M Score for test case is 0 Score for test case is 1

PMI: Avg. Score: 0.756

NGD: Avg. Score: 0.736

Conclusion • Progress • 4 language Wikipedia synthesized Term Association Network over hyperlinks is accomplished • Next schedule • Joining English resource ASAP(half completed) • Heavy to calculate neighborhood similarity with 5M pairs • Network at a single point in time leads to unrealistic analysis approaches so that I try to adapt the temporal analysis into network model(surveying related works, preprocessing data) • Experimenting other resources (acquiring resources) • Reuters multilingual corpus [Appendix 1] • To analysis multilingual synthesize impact • Biomedical literature (MEDLINE) • To analysis network model impact • Common: To compare Link with Co-occurred Term

Appendix

[A1] Reuters Corpus • Reuters Corpus, Volume 2, Multilingual Corpus • Release date 2005-05-31 • This contains over 487,000 Reuters News stories in 13 languages • Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish • Period: 1996-08-20 to 1997-08-19 • The stories are NOT PARALLEL, but are written by local reporters in each language

References • TURNEY, P. D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In EMCL ’01: Proceedings of the 12th European Conference on Machine Learning. Springer-Verlag, London, UK, 491–502. • CILIBRASI, R. AND VITNYI, P. M. B. 2006. Similarity of objects and the meaning of words. CoRR abs/cs/0602065. informal publication.

Progress Report on Multilingual Synchronization

Progress Report on Multilingual Synchronization

Presentation Transcript

Progress report: Calculations on BeH +

PROGRESS REPORT ON CABINET DECISION

Progress Report on Projection System

Multilingual Synchronization

Multilingual Synchronization focusing on Wikipedia

A Progress Report on BGP

PROGRESS REPORT ON CMAs ESTABLISHMENT

Multilingual Synchronization focusing on Wikipedia

Progress report on LAL activities

Progress Report on 334

Progress Report on Green Fund

Report on ACER Progress

Progress Report on Alzheimer’s Disease

Progress report on SRL

Progress report on APD

Progress Report on Fling Processing

Progress Report on resource certification

Progress report on APD

REPORT on Computational Lexicon Working Group on Multilingual Lexicon

Progress Report on Resource Certification