Implications of Web 2.0 on Information Research

Implications of Web 2.0 on Information Research Wen-Lian Hsu Academia Sinica, Taiwan 中央研究院資訊所許聞廉 hsu@iis.sinica.edu.tw

Outline • What is Web 2.0? • Web 2.0 and Research • Human-based Computation • Folksonomy (Social Tagging) • Academic Data Analysis • GIO-Info • Conclusion

What is Web 2.0? • Web 2.0 Conference (October 2004) • Tim O'Reilly • The Web As a Platform • Harnessing Collective Intelligence • Data is the Next Intel Inside • End of the Software Release Cycle • Lightweight Programming Models • Software Above the Level of a Single Device • Rich User Experiences

What is Web 2.0?

What is Web 2.0? • Web 2.0 is the combination of “tools and technologies”, “business strategies” and social/cultural trends, which drive the individual creation and sharing of content on the Internet. ED YOURDON • Web 2.0 opens up the Long Tail, making it increasingly cost-effective to service the interests of large numbers of relatively small groups of individuals, and to enable them to benefit from key pieces of the platform while fulfilling their own needs. PAUL MILLER

What is Web 2.0? • "Web 2.0" seems to be like Pink Floyd lyrics: It can mean different things to different people, depending on your state of mind. KEVIN MANEY • “Web 2.0 definitely is a buzzword, and it’s overused. But the movement is only starting. That movement is about leveraging the power of people” CHAD HURLEY • “What we’re seeing is a return to the roots of the web.” CATARINA FAKE • Web 2.0 “is enabling a fundamental shift in power that really is giving power to the consumer” MARK PARKER • “It’s a way to collaborate with your customers, to allow them to co-create with you”.

Web 2.0 Sites Source: http://www.go2web20.net/

Key Web 2.0 services/applications Blogs Wikis Tagging and social bookmarking Multimedia sharing RSS and syndication Podcasting P2P

Sharing and Collecting Resources • Content: Blogger, Wikipeida, Flickr, Youtube • Opinion: Digg, Hemidemi, 推推王 • Bandwidth: Emule, BT, Skype, Joost, PPStream • Computing: SETI, Grid • Innovation: Second Life • Money: Din.Ben.Don訂便當團購網, 共乘網

Social Bookmarking Source: http://funp.com/push/

Source: http://digg.com/ Soruce: http://www.hemidemi.com/

Blog Social bookmark adsense Content comments Source: http://carol.bluecircus.net/

Skype Source: S.A Baset, H. Schulzrinne (September 14, 2004). An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol. Technical Report. Columbia University.

Wikipedia

Second Life

Symbiosis (共生機制) is the Key Blog Social bookmark

The Web Changes in Several Dimensions • Dynamics • Heterogeneity • Collaboration • Composition • Socialization

Current Research Activities • Information Retrieval on Blogs • NTCIR-7 CLIRB (Cross-Lingual Information Retrieval for Blog) • Question Answering on Blogs • TREC 2007 QA Track • Question Answering on Wikipedia • QA@CLEF 2007 • CLEF 2006 WiQA • given a Wikipedia page, locate information snippets in Wikipedia • PASCAL Ontology Learning Challenge • Ontology construction • Ontology extension • Ontology population • Concept naming • LinkKDD2006, Textlink2007, MRDM2007

Web 2.0 and Research • Human-based Computation • Folksonomy (Social Tagging) • Academic Data Analysis • GIO-Info

Human-based Computation

Human-based Computation • Social Search • wayfinding tools informed by human judgment • CAPTCHA • reversed Turing test (Turing test 是由人來詢問系統，這裡則是由系統來詢問使用者） • Interactive Genetic Algorithm (IGA) • a genetic algorithm informed by human judgment. • 由人工提供fitness function結果 • 例子：描繪罪犯畫像，系統以GA方式產生嫌犯畫像，目擊者負責評分看那個比較像，不斷重複過程直到接近罪犯樣子為止

CAPTCHA Completely Automated Public Turing test to tell Computers and Humans Apart • A CAPTCHA is a type of challenge-response test used in computing to determine whether the user is human. wikipedia SOURCE: http://recaptcha.net/

blog blog blog CAPTCHA CAPTCHA CAPTCHA CAPTCHA Recognized text Unrecognized text

a two-player game The goal is to guess what your partner is typing on each image. Once you both type the same word(s), you get scores. The ESP Game ESP Source: http://www.espgame.org/

The Phetch Game Play as a describer

The Phetch Game Play as a seeker Phetch

How about a game for describing idioms? 罄竹難書: 壞事做太多虎頭蛇尾: 做事沒有毅力 ……… 高抬貴手不動如山壞事做太多罄竹難書如沐春風

Folksonomy (Social Tagging)

Folksonomy (Social Tagging) • Also known as social tagging, collaborative tagging, social classification, social indexing • Folksonomy is the practice and method of collaboratively creating and managing tags to annotate and categorize content. Wikipedia

del.icio.us Tags: Descriptive words applied by users to links. Tags are searchable My Tags: Words I’ve used to describe links in a way that makes sense to me

Tag Cloud

Semantic Web Source: Tim Berners-Lee

Using Folksonomy to Help Semantic Web • Top-down Semantic Annotation • Approach • Define an ontology first • Use the ontology to add semantic markups to web resources. • The semantics is provided by the ontology which is shared among different web agents and applications. • Problem • Negotiation • Evolution (hard to maintain) • High Barrier (background) Source: Xian Wu, Lei Zhang, Yong Yu. “Exploring Social Annotations for the Semantic Web”

Using Folksonomy to Help Semantic Web • Bottom-up approach with social tagging • Advantage • No common ontology or dictionary are needed • Easy to access • Sensitive to information drift • Disadvantage • Ambiguity Problem: For example, “XP” can refer to either “Extreme Programming” or “Windows XP”. • Group Synonymy Problem: two seemingly different annotations may bear the same meaning. Source: Xian Wu, Lei Zhang, Yong Yu. “Exploring Social Annotations for the Semantic Web”

Or Folksonomy is the Solution? • Ontology is Overrated • Classification of the web has failed • Classification itself is filled with bias and error • Tagging is the solution Source: http://www.shirky.com/writings/ontology_overrated.html

Academic Data Analysis

Academic Data Analysis Users participate and interact with data and people Add My Library, Tag Ex. Citeulike, BibSonomy Add Comments, Rating, Recommendation Ex. Techlens Domain Focus Groups Ex. Botanicus Arxiv e-Lib, Lib 2.0 concept adding into application, so search platform provide open API for collecting more data Google Scholar Windows Live Academic Search PudMed CiteSeer Citation index Papers , journal/conference, authors

An Example • Let’s use an example of TechLen to imagine what research on IR /NLP can do. Authors Readers Papers

Alfred V Aho Entities Aho, A. V. Alfred Aho AV Aho References Alfred Aho, John Hopcroft, Jeffrey Ullman Links AV Aho, BW Kernighan, PJ Weinberger G1 (Programming Languages) G2 (Databases) Entity Groups G3 (Algorithms) The Terminology

Imagine how we can make use of them Papers Reference Extraction Entity Resolution Authors Rating Comments Readers

New Research Topics • From those changes, key emerging challenge for “Data Mining” is tackling the problem of dealing with richly structured, finding patterns behind heterogeneous datasets, …, etc. • Several researches focus on those problem like • (Social) Network Analysis • Link Mining • PASCAL Ontology Learning Challenge • …

Society Nodes: individuals (Authors, Readers) Links: social relationship (family/work/friendship/belong to,…etc.) S. Milgram (1967) Six Degrees of Separation, Science John Guare Social networks: Many individuals with diversesocial interactions between them. source: www.cs.uiuc.edu/~hanj

Communication networks The Earth is developing an electronic nervous system, a network with diverse nodes and links are -computers -routers -satellites -Papers -User IP -Comments -Response -… -phone lines -TV cables -EM waves - Relations between artifacts Communication networks: Many non-identical components with diverseconnections between them. Artifacts in Techlens source: www.cs.uiuc.edu/~hanj

Link-based Object Ranking • Perhaps the most well known link mining task is that of link-based object ranking (LBR), which is a primary focus of the link analysis community. The objective of LBR is to exploit the link structure of a graphto order or prioritize the set of objects within the graph. • Example • PageRank • What paper is most important in this area? • What journal/conference is most important in this area? • What topic is important in this area?

Link-based Object Classification/ Link-based Classification (LBC) • Predicting the category of an object based on its attributesandits linksandattributes of linked objects • Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. • Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations • Epidemic : Predict disease type based on characteristics of the people; predict person’s age based on ages of people they have been in contact with and disease type

Group Detection • Cluster the nodes in the graph into groups that share common characteristics.That is, Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure. • Web: identifying communities • Citation: identifying research communities

Entity Resolution • Predicting when two objects are the same, based on their attributes and their links • Web: predict when two sites are mirrors of each other • Citation: predicting when two citations are referring to the same paper • Epidemics: predicting when two disease strains are the same • Biology: learning when two names refer to the same protein

Link Prediction • Predict whether a link exists between two entities, based on attributes and other observed links • Web: predict if there will be a link between two pages • Citation: predicting if a paper will cite another paper, or predict the venue type of a publication (conference, journal, workshop) based on properties of the paper • Epidemics: predicting who a patient’s contacts are (在流行病學上需要去找出病源(灶)/傳染源)

Implications of Web 2.0 on Information Research