Apollo – Automated Content Management System

Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap Innovations Work Performed under AFRL contract FA8750-06-C-0052

Capabilities Automated domain relevant information gathering Gathers documents relevant to domains of interest from www or proprietary databases. Automated content organization Organizes documents by topics, keywords, sources, time references and features of interest. Automated information discovery Assists users with automated recommendations on related documents, topics, keywords, sources…

Index: User Task Search Engine Task Data Comparison to existing manual information gathering method (what most users do currently) User performs a “Keyword Search” Data Generalized Search Index Take a break Search Engine Interface Query Yes 6. Satisfied 1. Develop Information Need 2. Form Keywords 3. Search 4. Results 5. Examine Results User No 7. Refine Query (Conjure up new keywords) 7a. Give up  The goal is to maximize the results for a user keyword query

Index: User Task Apollo Task Data Apollo Information Gathering method (what users do with Apollo) User explores, filters and discovers documents assisted by Apollo features Data Specialized Domain Model Features - Vocabulary, Location, Time, Sources … Take a break Apollo Interface Features Yes 6. Satisfied 1. Develop Information Need 2. Explore Features 3. Filter 4. Results 5. Examine Results User No 7. Discover new/related information via Apollo features The focus is on informative results seeded by a user selected combination of features

Apollo Architecture

Apollo Domain Modeling (behind the scenes) 1. Bootstrap Domain 2. Define domain, topics, subtopics 3. Get Training Documents (Option A/B/AB) 4. Build Domain Signature 5. Organize Documents (Option A/B/AB) B. From Specialized Domain Repository (Select a small sample) Identify Salient Terms per Domain, Topic, Sub topic A. From the Web B. From Specialized Domain Repository A. From the Web Filter Documents Build Representative Keywords Classify into defined topics/subtopics Extract Features - Vocabulary, Location, Time … Compute Classification Threshold Query Search Engine (s) Curate (optional)

Apollo Data Organization Snapshot of Apollo process to collect a domain relevant document Snapshot of Apollo process to evolve domain relevant libraries Data Source e.g. Web Site, Proprietary database, ... Data Source Data Source Data Source Data Source Data Source Data Source … e.g. Published Article, News Report, Journal Paper, … Document Document Document Document Document Document Document … Document Document Document Apollo collection process Is in Domain No Discard Apollo collection/organization process Yes Domain A Domain B Domain C … Classify into defied domain topics/subtopics Apollo library of domain relevant documents Extract Features: domain relevant vocabulary locations, time references, sources, … Organize documents by features Feature A Doc 1 Doc 2 Doc N Store document

Apollo Information Discovery User selects a feature via the Apollo interface e.g.: user selects phrase “global warming” from domain “climate change” Apollo builds a set of documents from the library that contains the feature A set of n documents containing phrase “global warming” Apollo collates all other features from the set and ranks them by domain relevance User is presented with co-occurring features e.g. user sees phrase “greenhouse gas emissions” And “ice core” as phrases co-occurring with “global warming” and explores documents containing the phrases User can use discovered features to expand or restrict the focus of search based on driving interests

Illustration: Apollo Web Content Management Application for the domain “Climate Change”

“Climate Change” Domain Model Vocabulary (Phrases, Keywords, idioms) identified for the domain from training documents collected from the web Building blocks of the model of the domain Modeling error based on noise in the training data Can be reduced by input from human experts

Apollo Prototype Keyword Filter Domain Extracted “Locations” across the collection of documents Document results of filtering Extracted “Keywords” or Phrases across the collection of documents Automated Document Summary

Inline Document View Features extracted only for this document Filter Interface Additional Features

Expanded Document View Features extracted for this document Cached text of the Document

Automatically Generated Domain Vocabulary Vocabulary collated across domain library Importance changes as the library changes Font size and thickness shows domain importance

Apollo Performance

Experiment Setup The experiment setup comprised the Text Retrieval Conference (TREC) document collection from the 2002 filtering track [1]. The document collection statistics were: The collection contained documents from Reuters Corpus Volume 1. There were 83,650 training and 723,141 testing documents There were 50 assessor and 50 intersection topics. The assessor topics had relevance judgments from human assessors where as the intersection topics were constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned. The main metrics were T11F or FBeta with a coefficient of 0.5 and T11SU as a normalized linear utility. 1. http://trec.nist.gov/data/filtering/T11filter_guide.html

Experiment Each topic was set as an independent domain in Apollo. Only the set of relevant documents from the training set of the topic were used to create the topic signature. The topic signature was used to output a vector – called the filter vector – that comprised single word terms that were weighted by their ranks. A threshold of comparison was calculated based on the mean and standard deviation of the cross products of the training documents with the filter vector. Different distributions were assumed to estimate the appropriate thresholds. In addition, the number of documents to be selected was set to be a multiple of the training sample size. The entire testing set was indexed using Lucene. For each topic, the documents were compared using the cross product with the topic filter vector in the document order prescribed by TREC.

Initial Results Initial results show that Apollo filtering effectiveness is very competitive with TREC benchmarks Precision and recall can be improved by leveraging additional components of the signatures. 2. Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.

Topic Performance

Apollo Filtering Performance Apollo training period was linear to the number and size of the training set (num training docs vs. avg. training time). On average, the filtering time per document was constant (avg. test time).

Apollo – Automated Content Management System