120 likes | 239 Vues
This paper investigates the use of social annotations to address critical issues in language modeling for information retrieval (LMIR), specifically data sparseness and term independence assumption. By leveraging folksonomy services like del.icio.us, we derive key insights and datasets from annotations that summarize document content and measure similarities among documents and annotations. Our research presents a robust language annotation model and outlines experiments with a substantial dataset to validate the effectiveness of integrated social annotations, paving the way for further exploration and refinement in LMIR.
E N D
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM’07 poster
Introduction • The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents • Two critical problems in LMIR: data sparseness and term independence assumption • In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.us • This paper explore the use of social annotations in addressing the two problems critical for LMIR
Properties of Social Annotations • The keyword property • Social annotations can be seen as good keywords for describing the respective documents from various aspects • The concatenation of all the annotations of a document is a summary of the document from users’ perspective • The structure property • An annotation may be associated with multiple documents and vice versa • The structure of social annotations can be used to explore two types of similarity: document-document similarity and annotation-annotation similarity
Deriving Data from Social Annotations • On the basis of social annotations, three sets of data can be derived • A summary dataset: sumann= {ds1, ds2, …, dsn}where dsi is the summary of the ith document • A dataset of document similarity simdoc= {(doci, docj,simscore_docij) | 0≦i≦j≦n} • A dataset of annotation similarity simann= {(anni, annj,simscore_annij) | 0≦i≦j≦m} (Define t as a triple of simdoc or simann, t[i] means the ith dimension of t)
Language Annotation Model (LAM) Figure. Bayesian network for generating a term in LAM
Content Model (CM) • Content Unigram Model (CUM) • Match the query against the literal content of a document • Topic Cluster Model (TCM) • Match the query against the latent topic of a document • Assume the similar documents of document d may more or less share the same latent topic of d • The term distribution over d’s topic cluster can be used to smooth d’s language model
Annotation Model (AM) • Assume AM contains two sub models: an independency model and a dependency model • Annotation Unigram Model (AUM) • A unigram language model that matches query terms against annotated summaries • Annotation Dependency Model (ADM)
Parameter Estimation • 5 mode probailities {Pcum(qi|d), Paum(qi|ds), Ptcm(qi|d), P(qi|a), P(a|ds)} and 3 mixture parameters (λc, λa,λd) have to be estimated • Use EM algorithm to estimate λc, λa, andλd • Dirichlet prior smoothing method for CUM, AUM, and TCM • Ptcm(qi|d) is estimated using a unigram language model on the topic clusters • P(a|ds) is approximated by maximum likelihood estimation • Approximate P(qi|a):
Experiment Setup • 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us • 80 queries with 497 relevant documents manually collected by a group of CS students • Merged Source Model (MSM) as baseline • Merge each document’s annotations into its content and implement a Dirichlet prior smoothed unigram language model on the merged source • SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations
SSR and SMM Table. Top 3 most similar annotations of 5 sample annotations exploited by SSR and SMM
Retrieval Performance Table. MAP of each model
Conclusions and Future Work • The problem of integrating social annotations into LMIR is studied. • Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. • In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations.