Using Social Annotations to Improve Language Model for Information Retrieval

Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM’07 poster

Introduction • The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents • Two critical problems in LMIR: data sparseness and term independence assumption • In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.us • This paper explore the use of social annotations in addressing the two problems critical for LMIR

Properties of Social Annotations • The keyword property • Social annotations can be seen as good keywords for describing the respective documents from various aspects • The concatenation of all the annotations of a document is a summary of the document from users’ perspective • The structure property • An annotation may be associated with multiple documents and vice versa • The structure of social annotations can be used to explore two types of similarity: document-document similarity and annotation-annotation similarity

Deriving Data from Social Annotations • On the basis of social annotations, three sets of data can be derived • A summary dataset: sumann＝ {ds1, ds2, …, dsn}where dsi is the summary of the ith document • A dataset of document similarity simdoc＝ {(doci, docj,simscore_docij) | 0≦i≦j≦n} • A dataset of annotation similarity simann＝ {(anni, annj,simscore_annij) | 0≦i≦j≦m} (Define t as a triple of simdoc or simann, t[i] means the ith dimension of t)

Language Annotation Model (LAM) Figure. Bayesian network for generating a term in LAM

Content Model (CM) • Content Unigram Model (CUM) • Match the query against the literal content of a document • Topic Cluster Model (TCM) • Match the query against the latent topic of a document • Assume the similar documents of document d may more or less share the same latent topic of d • The term distribution over d’s topic cluster can be used to smooth d’s language model

Annotation Model (AM) • Assume AM contains two sub models: an independency model and a dependency model • Annotation Unigram Model (AUM) • A unigram language model that matches query terms against annotated summaries • Annotation Dependency Model (ADM)

Parameter Estimation • 5 mode probailities {Pcum(qi|d), Paum(qi|ds), Ptcm(qi|d), P(qi|a), P(a|ds)} and 3 mixture parameters (λc, λa,λd) have to be estimated • Use EM algorithm to estimate λc, λa, andλd • Dirichlet prior smoothing method for CUM, AUM, and TCM • Ptcm(qi|d) is estimated using a unigram language model on the topic clusters • P(a|ds) is approximated by maximum likelihood estimation • Approximate P(qi|a):

Experiment Setup • 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us • 80 queries with 497 relevant documents manually collected by a group of CS students • Merged Source Model (MSM) as baseline • Merge each document’s annotations into its content and implement a Dirichlet prior smoothed unigram language model on the merged source • SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations

SSR and SMM Table. Top 3 most similar annotations of 5 sample annotations exploited by SSR and SMM

Retrieval Performance Table. MAP of each model

Conclusions and Future Work • The problem of integrating social annotations into LMIR is studied. • Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. • In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations.

Using Social Annotations to Improve Language Model for Information Retrieval

Using Social Annotations to Improve Language Model for Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Optimizing Web Search Using Social Annotations

Natural Language Processing for Information Retrieval

Information Retrieval Model

Gravitation-Based Model for Information Retrieval

Cross-Language Information Retrieval

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Information Retrieval Using SQL

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Using Blog Properties to Improve Retrieval

Optimizing Web Search Using Social Annotations

Natural Language Processing for Information Retrieval

Using Semantic Relations to Improve Information Retrieval

Language Modeling Frameworks for Information Retrieval

Dependence Language Model for Information Retrieval

Optimizing Web Search Using Social Annotations

Model-based Feedback in the Language Modeling Approach to Information Retrieval