330 likes | 429 Vues
Explore methods to improve access and browsing in federated libraries by enhancing document aggregation representation and removing weakly topical documents. This study uses latent Dirichlet allocation (LDA) to induce topics and identifies and removes weakly topical documents to enhance the quality of the model. Experimental assessments compare model coherence between sampled and raw corpora. Stop documents are identified based on homogeneity using a parameter guided by confidence levels. The algorithm proposed aims to refine the induced topic model and improve resource discovery in federated libraries.
E N D
Building Topic Models in a Federated Digital Library Through Selective Document Exclusion Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign ASIST 2011 New Orleans, LA October 10, 2011 Supported by IMLS LG-06-07-0020.
The Setting: IMLS DCC … collection(s) collection(s) collection(s) Data providers (IMLS NLG & LSTA) OAI-PMH metadata metadata metadata DCC Service provider: DCC services
High-Level Research Interest • Improve “access” to data harvested for federated digital libraries by enhancing: • Representation of documents • Representation of document aggregations • Capitalizing on the relationship between aggregations and documents. • PS: By “document” I mean a single metadata (usually DC) record.
Motivation for our Work • Most empirical approaches to this type of problem rely on some kind of analysis of term counts. • Unreliable for our data: • Vocabulary mismatch • Poor probability estimates
The Problem: Supporting End-User Experience • Full-text search • Browse by “subject” • Desired: • Improved browsing • Support high-level aggregation understanding and resource discovery • Approach: Empirically induced “topics” using established methods--e.g. latent Dirichlet allocation(LDA).
Research Question • Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings? • Hypothesis: Harvested records are not all useful for training a model of corpus-level topics. • Approach: Identify and remove “weakly topical” documents during model training.
Latent Dirichlet Allocation • Given a corpus of documents, C, and an empirically chosen integer k • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) For each topic T1 … Tk
Latent Dirichlet Allocation • Given a corpus of documents, C, and an empirically chosen integer k • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) • Choose doc length N ~ Poisson(mu). • Choose probability vector Theta ~ Dir(alpha). • For each word wiin 1:N: • Choose topic zi ~ Multinomial(Theta). • Choose word wn from P(wn | wn, Beta). For each topic T1 … Tk
Latent Dirichlet Allocation • Given a corpus of documents, C and an empirically chosen integer k. • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) Calculate estimates via iterative methods: MCMC / Gibbs Sampling. For each topic T1 … Tk
Full Corpus Proposed algorithm
Reduced Corpus Train the Model Pr(w | T) Pr(D | T) Pr(T)
Full Corpus Pr(w | T) Pr(D | T) Pr(T) Inference Pr(w | T) Pr(D | T) Pr(T)
Documents’ Topical Strength • Hypothesis: Harvested records are not all useful for training a model of corpus-level. topics.
Documents’ Topical Strength • Hypothesis: Harvested records are not all useful for training a model of corpus-level. • Proposal: Improve induced topic model by removing “weakly topical” documents during training. • After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”
Identifying “Stop Documents” • Time at which documents enter a repository is often informative (e.g. bulk uploads). where MC is the collection language model and di is the words comprising the ith document log Pr(di | MC)
Identifying “Stop Documents” • Our paper outlines an algorithm for accomplishing this. • Intuition: • Given a document di decide if it is part of a “run” of near-identical records. • Remove all records that occur within a run. • The required amount of homogeneity to identify a run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.
Experimental Assessment • Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora? • Intrusion detection: • Find the 10 most probable words for topic Ti • Replace one of these 10 with a word chosen from the corpus with uniform probability. • Ask human assessors to identify the “intruder” word.
Experimental Assessment • For each topic Tihave 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models. • i.e. 20 * 2* 100 = 4,000 assessments • Asiis the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model • H0: Asi> Ariyields p<0.001
Experimental Assessment • For each topic Tihave 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.
Current & Future Work • Testing breadth of coverage • Assessing the value of induced topics • Topic information for document priors in the language modeling IR framework [next slide] • Massive document expansion for improved language model estimation [under review]
Thank You Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign ASIST 2011 New Orleans, LA October 10, 2011