1 / 12

Using Social Annotations to Improve Language Model for Information Retrieval

Using Social Annotations to Improve Language Model for Information Retrieval. Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM ’ 07 poster. Introduction.

velika
Télécharger la présentation

Using Social Annotations to Improve Language Model for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM’07 poster

  2. Introduction • The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents • Two critical problems in LMIR: data sparseness and term independence assumption • In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.us • This paper explore the use of social annotations in addressing the two problems critical for LMIR

  3. Properties of Social Annotations • The keyword property • Social annotations can be seen as good keywords for describing the respective documents from various aspects • The concatenation of all the annotations of a document is a summary of the document from users’ perspective • The structure property • An annotation may be associated with multiple documents and vice versa • The structure of social annotations can be used to explore two types of similarity: document-document similarity and annotation-annotation similarity

  4. Deriving Data from Social Annotations • On the basis of social annotations, three sets of data can be derived • A summary dataset: sumann= {ds1, ds2, …, dsn}where dsi is the summary of the ith document • A dataset of document similarity simdoc= {(doci, docj,simscore_docij) | 0≦i≦j≦n} • A dataset of annotation similarity simann= {(anni, annj,simscore_annij) | 0≦i≦j≦m} (Define t as a triple of simdoc or simann, t[i] means the ith dimension of t)

  5. Language Annotation Model (LAM) Figure. Bayesian network for generating a term in LAM

  6. Content Model (CM) • Content Unigram Model (CUM) • Match the query against the literal content of a document • Topic Cluster Model (TCM) • Match the query against the latent topic of a document • Assume the similar documents of document d may more or less share the same latent topic of d • The term distribution over d’s topic cluster can be used to smooth d’s language model

  7. Annotation Model (AM) • Assume AM contains two sub models: an independency model and a dependency model • Annotation Unigram Model (AUM) • A unigram language model that matches query terms against annotated summaries • Annotation Dependency Model (ADM)

  8. Parameter Estimation • 5 mode probailities {Pcum(qi|d), Paum(qi|ds), Ptcm(qi|d), P(qi|a), P(a|ds)} and 3 mixture parameters (λc, λa,λd) have to be estimated • Use EM algorithm to estimate λc, λa, andλd • Dirichlet prior smoothing method for CUM, AUM, and TCM • Ptcm(qi|d) is estimated using a unigram language model on the topic clusters • P(a|ds) is approximated by maximum likelihood estimation • Approximate P(qi|a):

  9. Experiment Setup • 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us • 80 queries with 497 relevant documents manually collected by a group of CS students • Merged Source Model (MSM) as baseline • Merge each document’s annotations into its content and implement a Dirichlet prior smoothed unigram language model on the merged source • SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations

  10. SSR and SMM Table. Top 3 most similar annotations of 5 sample annotations exploited by SSR and SMM

  11. Retrieval Performance Table. MAP of each model

  12. Conclusions and Future Work • The problem of integrating social annotations into LMIR is studied. • Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. • In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations.

More Related