Challenges in Information Retrieval and Language Modeling

Challenges in Information Retrieval and Language Modeling Report of a Workshop held at the Center for Intelligent Information Retrieval, University of Massachusetts Amherst, September 2002

What is Information Retrieval? • Salton (circa 1960s) defines it as being “… a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” • Throughout the 70s and 80s, pretty much all of the research in this field was focused on document (textual) retrieval. • In the last two decades, due to the huge increase in the types of content found online, the field has expanded into many different topics.

Some of these include… • Question Answering • Topic Detection and Tracking • Summarization • Multimedia Retrieval (image and music) • Software Engineering • Chemical and Biological Informatics • Text structuring • Text mining • Genomics

Similarity to other fields…? The text notes the similarities some might see between IR and Database Systems research. The main factor that distinguishes these two fields is that IR focuses solely on deriving information from “unstructured” data sources. However, recently this boundary has become blurred, when it comes to classifying “marked up” text (like HTML, XML), because it can fall in either category.

Google is quite advanced; why continue research into IR? • Web search and IR are not equivalent; web search is a subset of IR. • Web queries do not represent all information needs. • Web search engines are effective for some types of queries in some contexts.

Contents • Long term challenges • Cross-lingual retrieval • Web search • Summarization • Question Answering • Multimedia Retrieval • Information Extraction • Testbeds

Long Term Challenges • Global Information Access – develop massively distributed, multi-lingual retrieval systems that would take as input an information need, encoded in any language, and return relevant results, encoded in any language. • Contextual Retrieval – Combine search technologies and knowledge about query and user context into a single framework in order to provide the most “appropriate” anser for a user’s information needs (example from paper).

Topic Discussions • The following are topical discussions regarding subsets of the IR field. These were thought (by the committee) to be the most important areas to discuss. • Note that for many of these areas, one of the prime “needs” to further research in the field is a good set of test data. Many cite specific public domain test sets that can be used to verify new technologies and prove correctness. Sometimes these are just raw data such as newspaper archives. However, fields such as Summarization that seek to test themselves in genres outside of the news domain, sometimes need to compile together test sets of data that they know will provide results that are scaleable to the real world (see Testbeds section)

Cross-Lingual Information Retrieval (CLIR) • Purpose is to support queries in one language against a collection in other languages. • Recently achieved milestone: cross-lingual document retrieval performs essentially as accurately as monolingual retrieval. • Challenges: Effective User Functionality, New more complex applications, Languages with sparse data

Web Search • Has moved beyond just searching for specific web sites. Real Estate, Cars, Music, and Movies are just a few examples of other types of media people search for. • Challenges: develop a formal Web Structure, Crawling and Indexing (keeping cached search results fresh and up to date), Searching (develop more efficient methods that exploit the newly defined structure)

Summarization • Shares some basic techniques with Indexing, since both are concerned with identifying the essence of a document. • Brings together techniques from different areas • Challenges: Define clearly specified summarization task(s) in an IR setting, move to summarization in new genres/context (newswire/newspapers are trivial), Integrate user’s prior knowledge into models.

Question Answering • QA systems take as input a natural language question and a source collection. They produce a targeted, contextualized natural language answer. To build the answer, it gathers relevant data, summary statistics, and relations from the sources. • Challenges: Improve performance of “factoid” QA so the public would find it reliable, create systems that provide richer answers and leverage richer data sources, develop better interaction (UI) with the human user

Multimedia Retrieval • Large problem space because of the huge variance in the types of objects to be retrieved and the distinguishing factors; for instance, the underlying representation for most of these content types is binary; however, Movies, Music, and Pdf files compare VERY differently. • Challenges: given a non-test media object, the following options are available. Text may be associated with the object (captions, etc.); part of the object might be converted to text (via speech recognition or OCR); metadata might be assigned manually or media-specific features might be extractable.

Information Extraction • IE fills slots in an ontology or database by selecting and normalizing sub-segments of human readable text. One example is finding names of entities and relationships between them in a source text. • Meant to be more of a sub-system, used by many of the other IR systems like QA, cross-lingual retrieval, and summarization. • Challenges: improve accuracy so IE can be more easily used by other systems, ability to extract literal meaning from text, large scale reference matching, cross-lingual information extraction.

Testbeds Over the previous decade, the IR research community has benefited from a set of annual US government sponsored TREC (Text Retrieval Conference) conferences that provided a level field for evaluation algorithms and systems for IR. In addition to the evaluation exercise, these conferences created a number of significant data sets that fueled further research in IR. However, these data sets are too small (by about a thousand-fold) to be representative of the real world. Challenge: A community based effort is needed to create a more realistic (in scale and function) common data set to fuel further research and increase the relevance of IR.

Conclusion Near exponential growth in technology in the past twenty years, combined with the increase in raw data and the different representations it can have (beyond just text when the IR field was developed) requires the expansion and growth of the IR field to accurately perform its task: information retrieval.

A Neural Network approach to Topic Spotting Applying nonlinear neural networks to topic spotting.

What is Topic Spotting? Topic Spotting is the problem of identifying which of a set of predefined topics are present in a natural language document. More formally, given a set of n topics and a document, the task is to output for each topic the probability that the topic is present.

Neural Network The neural network we employ is essentially a non-linear regression model for fitting high-order interactions in some feature space to binary topic assignments. We have to limit the number of input variables to the neural network (so it can operate correctly), so we will approach reducing the dimensionality of the input space in two different methods: term selection, which picks a subset of the original terms to use as features, and Latent Semantic Indexing (LSI), which constructs new features from combinations of a large number of the original terms.

The Corpus • Reuters-22173 corpus of Reuters newswire stories from 1987 • 21450 stories in full collection • Only used stories that had at least one topic assigned; narrowed it down to 9610 stories for training, and 3662 for testing. • Stories have mean length 90.0 words; standard deviation is 91.6.

Representation • Term by document matrix containing word frequency information. The entries for each document vector, called a document profile, are computes as follows: • Pdk = √fdk / √∑vector(√fdi)2 where f is word frequency.

Term Selection • Find the subset of the original terms which seem the most useful for the classification task. • Divide problem into 92 independent classification tasks and search for the set of terms for each topic which can best discriminate between documents with that topic and those without. This serves the neural network, as it is near impossible to select a set of terms that can adequately discriminate between 92 classes of documents while at the same time being small enough to serve as the feature set for a neural network.

Term Selection (cont’d) • We score all of the terms according to how well they serve as individual predictors of the topic, then pick the top scoring terms. This is called the relevancy score. It has a relation to the relevancy weight, which measures how “unbalanced” the term is across documents with and without the topic. • Rk = log [ (wtk/dt + 1/6) / (wtk’/dt’ + 1/6) ] where wtkis the number of documents with the topic that contain the term, dtis the total number of documents with the topic.

Term Selection (cont’d) • 20 terms was found to yield, on average, the best classification performance. • Performance falls off after 20 terms due to Overfitting. Overfitting occurs when the network starts to “memorize” the training patterns; i.e. when it starts fitting the peculiarities of the training data, thus decreasing its performance on out-of-sample data.

Modeling • Basic framework is as a regression model relating the input variables (features) to the output variables (binary topic assignments) which can be fit using training data. • p = 1 / ( 1 + e-η) where η = βTx is a linear combination of the input features.

Neural Network Classifiers • Three major components of a neural network model: architecture, cost function, and search algorithm. • Architecture defines functional forms relation the inputs to the outputs (in terms of network topology, unit connectivity, and activation functions). • The Search in weight space for a set of weights which minimizes the cost function is the training process.

Neural Networks for Topic Spotting • Network outputs are estimates of the probability of topic presence given the feature vector for a document. • Advantage Neural Networks have over other techniques is that they can predict multiple topics simultaneously using a single model. • We use two different network architectures, Flat and Modular.

Flat Architecture • Use entire training set to train separate network for each topic. • To combat Overfitting, introduces a simple regularization scheme based on weight-elimination in which we add a term penalizing network complexity to the cross-entropy cost function. This term is given by: • √∑i,jwi,j2 / (1 + wi,j2 )

Modular Architecture • Decomposes the learning problem into a set of smaller problems. • Outputs of meta-topic network are multiplied by outputs of individual topic networks to get final topic predictions. Wool Barley Zinc AL Metal Group Meta-Topic Network

Results Shows precision for four flat networks macro averaged over four topic frequency ranges: topics 1-54 (all), 1-18 (high freq), 19-36 (medium), 37-54 (low)

Results (cont’d) Average precision for three modular networks, as well as several flat networks for comparison.

Conclusion • Experiments show LSI representation is able to equal or exceed the performance of selected term representations for high frequency topics, but performs relatively poorly for low frequency topics. • However, task directed LSI representations can improve performance if implemented (relevancy weighting and local LSI).

Challenges in Information Retrieval and Language Modeling

Challenges in Information Retrieval and Language Modeling

Presentation Transcript

Language Models for Information Retrieval

Cross-Language Information Retrieval

Cross-Language Information Retrieval

Exploring Sentence Level Query Expansion in Language Modeling Based Information Retrieval

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Language and Document Models in Information Retrieval

A Language Modeling Approach to Information Retrieval

Statistical Language Modeling for Speech Recognition and Information Retrieval

Language Modeling Frameworks for Information Retrieval

Risk Minimization and Language Modeling in Text Retrieval

Modeling Diversity in Information Retrieval

Information Retrieval Modeling

Workshop on Challenges in Information Retrieval and Language Modeling Wrap-up and action items

Model-based Feedback in the Language Modeling Approach to Information Retrieval

Cross-Language Information Retrieval (CLIR)

Statistical Language Modeling for Speech Recognition and Information Retrieval