Comprehensive Overview of Information Filtering Systems

CS533 Information Retrieval Dr. Michal Cutler Lecture #21 April 25, 2000

Information Filtering • The filtering problem • The users and the user profiles • The relationship between information retrieval and filtering • Implementations (cluster based, information retrieval based, knowledge based, semi-automatic, collaborative)

The filtering problem • The task of a filtering system is to send interesting and useful data items to users • Basic assumptions: • Users must receive information in a timely fashion • There is a large number of users • New data items arrive frequently.

The filtering problem • Other terms used: • Information dissemination systems, • Alerting systems, • Routing systems

The users • Researchers discuss proactive and casual users. • A proactive user has specific information needs which may be expressed in a profile • The casual user may not have specific needs and thus find it difficult to provide a profile

The users • Proactive users also vary between those who • need high recall - cannot afford to miss a relevant item (army analysts), • need high precision - prefer very few and good items

User profile • Who will create it? • What will it contain? • How much data?

User profile - who will create it? • “Experts” interviewing new users • Users experimenting and updating • What help should be provided to users for generating good profiles? • choosing good terms • making the profile more specific and narrow to enhance precision • making the profile more general to enhance recall

User profile - who will create it? • Semi - Automatic • the user supplies seed document/s which are used by the system to generate a profile • The user’s behavior is monitored and used to create a profile

User profile - who will create it? Automatic • Users may click on some pages and ignore others • Spend more time reading some pages than others • Save/print certain clicked pages • Follow links on clicked pages to reach more pages • This behavior can be used to automatically learn and update a user’s profile

Morita et al • Monitor user behavior and derive interesting and uninteresting papers • Time spent on reading is used to acquire information • when a user spends more than t seconds to read an article, it is concluded the user considers article interesting

Morita et al. • Checked relation between article and follow up articles and found usually follow up interesting if article was and vice versa

User profile - what will it contain? • Boolean “queries” and/or natural language descriptions with or without importance weights • Seed document/s • Search domain knowledge + text pattern rules with evidence values • Menus

User profile - how much information? • Filtering systems provide users with some ability to control the amount of information they receive • Number of docs - N • Because database keeps changing, many good documents at one time, and very few at another. • Returning to the user N documents in both cases is not a good solution

User profile - how much information? • A similarity threshold - T • It is not clear how the user decides on a threshold. • Why .5 and not .3. • Is a document with .48 similarity not important?

Comparison with IR • Filtering systems assume repeated use of queries, versus a one time query assumed in IR • Creating a good profile is essential • Since user interests change, profile modification is also very important in filtering

Comparison with IR • The timeliness issue is more important for filtering than for IR • parallelism

Comparison with IR • IR assumes a relatively static database. • Filtering is mainly interested in selecting text from a dynamic data stream

Comparison with IR • IR takes advantage of collection statistics to generate stop words, an indexing vocabulary, and to compute good weights for document and query terms (tf*idf) • These statistics may not be available for filtering systems

Comparison with IR • Filtering systems tend to create an inverted index for the user profiles and not the data base

Comparison with IR • Users of filtering systems may not have a specific purpose (entertainment) • Both IR and filtering systems deal with the query/profile vocabulary issues • In filtering some users may not be motivated or able to specify a profile

Types of filtering systems • User profile used for filtering • Profile provided by users • Profile learned automatically or semi automatically from user behavior • User profile and opinions of other users used for filtering

Implementations • Profile provided by users: • Cluster based (NetNews) • IR based (SIFT, Individual) • Knowledge based (Rubric, Topic by Verity)

Implementations • Profile learned from behavior (LSI, Autodesk) • Collaborative filtering and recommendation systems

Cluster based filtering (NetNews) • http://www.switch.ch/netnews/ • News are classified into categories • A user subscribes to some categories, and from then on receives copies of all new items • Millions of users • Users may be interested in a much finer filtering capability

SIFT (Garcia Molina) • Based on Wais (free software on the Internet) • The database of user profiles is indexed • A profile is a list of (term, importance) pairs +a relevance threshold

SIFT (Garcia Molina) • Uses (.5+.5tf/max tf), idf and inner product • Threshold used to increase efficiency • http://sift.stanford.edu www.reference.com • Offers two ways to assist users with profile construction

Assistance provided by SIFT for creating a profile • User can apply candidate profile against present day articles. Use iterative refinement of profile to force good documents to the top • To help maintain profiles over time words which contributed to selection of an article are highlighted. Users can select additional words which should not appear with the profile word

Individual (Commercial) • http://www.individual.com/ • Based on SMART • Domain and SMART experts manage • customer profiles • the company's extensive Topic Library collection.

Filtering stages • Thesaurus • The core SMART engine • The Post Processor.

Thesaurus stage • Adds semantic equivalents of important profile words • Can recognize highly relevant words that may be used infrequently in a story and give them more weight • Thesaurus represents hundreds of thousands of person-hours of data entry and analysis

The core SMART engine • The database is a set of vectors associated with query topics • Each document is sent to SMART as a query. • The similarity of the document to the query topics is computed • Relevance feedback is used to improve new query topics

Post Processing. • Subject specialists add fuzzy Boolean rules to customer profiles • P-Norm is used to compare story to fuzzy Boolean rules in customer’s profile

Learning user profile from relevance feedback • Profile is learned from: • A set of old queries and a user’s selection of good documents • A collection of old documents • Profile is generated by using relevance feedback

Learning user profile from relevance feedback • In relevance feedback terms in good and possibly bad documents and old queries are used to generate a new query • In LSI a weighted sum of relevant documents is used as the user profile (expanded query) • The smaller number of concepts used in LSI helps in the feedback process

Recommendation systems • Systems that recommend restaurants, movies, etc. • Here the recommendation systems will recommend: • good documents, • good URL or • authors of documents, etc

Recommendation systems • From (Miller 96): Collaborative filtering systems make use of the reactions and opinions of people who have already seen a piece of information to make predictions about the value of that piece of information for people who have not yet seen it.

Recommendation systems • Collaborative filtering systems often recommend documents to a user (a query) that are liked (found useful) by similar users (e.g., users who have similar profiles) (for similar queries).

Contents of a recommendation • Can be a numeric value assigned by users to rate a document (explicit) • Mention of a person, a URL, or a citation of a document (mining) • Value derived automatically by observing user behavior (monitoring)

Learning interesting documents by monitoring • When many users read, or save, or print a document there is evidence that it is interesting • When a great deal of users ignore, or click and spend a short time on a document this indicates an uninteresting document

The privacy issue • A lot can be learned about users by observing their behavior • Users may not want other users to know which material they read • Users may not want authors to know who evaluated their work • Some systems allow the usage of pseudonymous

The privacy issue • The credibility of a recommendation can be enhanced by containing the names of the users who recommended or rejected material • In this case recommendations are attributed

The use of a recommendation • Some systems display the recommendations alongside articles • Other systems use the recommendations is order to select the documents which will be returned to a user

Aggregation of recommendation • Combining multiple recommendations into a useful measure. • Personalized weighting based on past agreement among recommenders • Personalized weighting combined with content analysis • Count number of recommneders, or the frequency of mention of URLs or documents

Collaborative (Tapestry) • First system to use the notion of collaboration for filtering • Developed at Xerox Palo Alto to control volume of email sent to users • Innovation is in the use of user reactions to messages (stored as annotations) for selecting messages for other users

Collaborative (Tapestry) • Messages are stored in a relational database • User knows that Smith keeps track of documents in some area of interest • System allows to filter on “documents replied to by Smith” • This means that outgoing email messages become part of the selection process • Filtering becomes iterative process

Collaborative (Tapestry) • A filter can contain some keywords with the added condition of 3 or more endorsements • Users can write ad-hoc queries or filter queries to receive data • A user can ask to use someone else's filter • Uses its own query language which is similar to 1st order logic

PHOAKS (People Helping One Another Know Stuff) • Recommends URLs. • Mining: Mention of a URL in a news article is used except for: • URLs in headers and quoted sections. • Articles posted to too many newsgroups. • URLs in announcements or ads. • Aggregation: number of distinct recommenders of each URL.

GroupLens • Collaborative filtering for Usenet news • Used for rec.humor, rec.food.recipes, rec.arts.movies.current-films, etc. • Recommendations are both explicit by providing a rate of 1-5, and implicit by monitoring reading time • Recommendations are displayed along a reference

GroupLens • Pseudonym are used • Selects a group of people to act as personal moderators • The moderators are users with whom you have substantial agreement on part articles • When a user fetches articles from a newsgroup evaluation predictions are displayed • The user may enter ratings • The ratings serve as input for predicting the value for other users and for correlating the user with other users

Comprehensive Overview of Information Filtering Systems

Comprehensive Overview of Information Filtering Systems

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval