1 / 53

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #21 April 25, 2000. Information Filtering. The filtering problem The users and the user profiles The relationship between information retrieval and filtering

Télécharger la présentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #21 April 25, 2000

  2. Information Filtering • The filtering problem • The users and the user profiles • The relationship between information retrieval and filtering • Implementations (cluster based, information retrieval based, knowledge based, semi-automatic, collaborative)

  3. The filtering problem • The task of a filtering system is to send interesting and useful data items to users • Basic assumptions: • Users must receive information in a timely fashion • There is a large number of users • New data items arrive frequently.

  4. The filtering problem • Other terms used: • Information dissemination systems, • Alerting systems, • Routing systems

  5. The users • Researchers discuss proactive and casual users. • A proactive user has specific information needs which may be expressed in a profile • The casual user may not have specific needs and thus find it difficult to provide a profile

  6. The users • Proactive users also vary between those who • need high recall - cannot afford to miss a relevant item (army analysts), • need high precision - prefer very few and good items

  7. User profile • Who will create it? • What will it contain? • How much data?

  8. User profile - who will create it? • “Experts” interviewing new users • Users experimenting and updating • What help should be provided to users for generating good profiles? • choosing good terms • making the profile more specific and narrow to enhance precision • making the profile more general to enhance recall

  9. User profile - who will create it? • Semi - Automatic • the user supplies seed document/s which are used by the system to generate a profile • The user’s behavior is monitored and used to create a profile

  10. User profile - who will create it? Automatic • Users may click on some pages and ignore others • Spend more time reading some pages than others • Save/print certain clicked pages • Follow links on clicked pages to reach more pages • This behavior can be used to automatically learn and update a user’s profile

  11. Morita et al • Monitor user behavior and derive interesting and uninteresting papers • Time spent on reading is used to acquire information • when a user spends more than t seconds to read an article, it is concluded the user considers article interesting

  12. Morita et al. • Checked relation between article and follow up articles and found usually follow up interesting if article was and vice versa

  13. User profile - what will it contain? • Boolean “queries” and/or natural language descriptions with or without importance weights • Seed document/s • Search domain knowledge + text pattern rules with evidence values • Menus

  14. User profile - how much information? • Filtering systems provide users with some ability to control the amount of information they receive • Number of docs - N • Because database keeps changing, many good documents at one time, and very few at another. • Returning to the user N documents in both cases is not a good solution

  15. User profile - how much information? • A similarity threshold - T • It is not clear how the user decides on a threshold. • Why .5 and not .3. • Is a document with .48 similarity not important?

  16. Comparison with IR • Filtering systems assume repeated use of queries, versus a one time query assumed in IR • Creating a good profile is essential • Since user interests change, profile modification is also very important in filtering

  17. Comparison with IR • The timeliness issue is more important for filtering than for IR • parallelism

  18. Comparison with IR • IR assumes a relatively static database. • Filtering is mainly interested in selecting text from a dynamic data stream

  19. Comparison with IR • IR takes advantage of collection statistics to generate stop words, an indexing vocabulary, and to compute good weights for document and query terms (tf*idf) • These statistics may not be available for filtering systems

  20. Comparison with IR • Filtering systems tend to create an inverted index for the user profiles and not the data base

  21. Comparison with IR • Users of filtering systems may not have a specific purpose (entertainment) • Both IR and filtering systems deal with the query/profile vocabulary issues • In filtering some users may not be motivated or able to specify a profile

  22. Types of filtering systems • User profile used for filtering • Profile provided by users • Profile learned automatically or semi automatically from user behavior • User profile and opinions of other users used for filtering

  23. Implementations • Profile provided by users: • Cluster based (NetNews) • IR based (SIFT, Individual) • Knowledge based (Rubric, Topic by Verity)

  24. Implementations • Profile learned from behavior (LSI, Autodesk) • Collaborative filtering and recommendation systems

  25. Cluster based filtering (NetNews) • http://www.switch.ch/netnews/ • News are classified into categories • A user subscribes to some categories, and from then on receives copies of all new items • Millions of users • Users may be interested in a much finer filtering capability

  26. SIFT (Garcia Molina) • Based on Wais (free software on the Internet) • The database of user profiles is indexed • A profile is a list of (term, importance) pairs +a relevance threshold

  27. SIFT (Garcia Molina) • Uses (.5+.5tf/max tf), idf and inner product • Threshold used to increase efficiency • http://sift.stanford.edu www.reference.com • Offers two ways to assist users with profile construction

  28. Assistance provided by SIFT for creating a profile • User can apply candidate profile against present day articles. Use iterative refinement of profile to force good documents to the top • To help maintain profiles over time words which contributed to selection of an article are highlighted. Users can select additional words which should not appear with the profile word

  29. Individual (Commercial) • http://www.individual.com/ • Based on SMART • Domain and SMART experts manage • customer profiles • the company's extensive Topic Library collection.

  30. Filtering stages • Thesaurus • The core SMART engine • The Post Processor.

  31. Thesaurus stage • Adds semantic equivalents of important profile words • Can recognize highly relevant words that may be used infrequently in a story and give them more weight • Thesaurus represents hundreds of thousands of person-hours of data entry and analysis

  32. The core SMART engine • The database is a set of vectors associated with query topics • Each document is sent to SMART as a query. • The similarity of the document to the query topics is computed • Relevance feedback is used to improve new query topics

  33. Post Processing. • Subject specialists add fuzzy Boolean rules to customer profiles • P-Norm is used to compare story to fuzzy Boolean rules in customer’s profile

  34. Learning user profile from relevance feedback • Profile is learned from: • A set of old queries and a user’s selection of good documents • A collection of old documents • Profile is generated by using relevance feedback

  35. Learning user profile from relevance feedback • In relevance feedback terms in good and possibly bad documents and old queries are used to generate a new query • In LSI a weighted sum of relevant documents is used as the user profile (expanded query) • The smaller number of concepts used in LSI helps in the feedback process

  36. Recommendation systems • Systems that recommend restaurants, movies, etc. • Here the recommendation systems will recommend: • good documents, • good URL or • authors of documents, etc

  37. Recommendation systems • From (Miller 96): Collaborative filtering systems make use of the reactions and opinions of people who have already seen a piece of information to make predictions about the value of that piece of information for people who have not yet seen it.

  38. Recommendation systems • Collaborative filtering systems often recommend documents to a user (a query) that are liked (found useful) by similar users (e.g., users who have similar profiles) (for similar queries).

  39. Contents of a recommendation • Can be a numeric value assigned by users to rate a document (explicit) • Mention of a person, a URL, or a citation of a document (mining) • Value derived automatically by observing user behavior (monitoring)

  40. Learning interesting documents by monitoring • When many users read, or save, or print a document there is evidence that it is interesting • When a great deal of users ignore, or click and spend a short time on a document this indicates an uninteresting document

  41. The privacy issue • A lot can be learned about users by observing their behavior • Users may not want other users to know which material they read • Users may not want authors to know who evaluated their work • Some systems allow the usage of pseudonymous

  42. The privacy issue • The credibility of a recommendation can be enhanced by containing the names of the users who recommended or rejected material • In this case recommendations are attributed

  43. The use of a recommendation • Some systems display the recommendations alongside articles • Other systems use the recommendations is order to select the documents which will be returned to a user

  44. Aggregation of recommendation • Combining multiple recommendations into a useful measure. • Personalized weighting based on past agreement among recommenders • Personalized weighting combined with content analysis • Count number of recommneders, or the frequency of mention of URLs or documents

  45. Collaborative (Tapestry) • First system to use the notion of collaboration for filtering • Developed at Xerox Palo Alto to control volume of email sent to users • Innovation is in the use of user reactions to messages (stored as annotations) for selecting messages for other users

  46. Collaborative (Tapestry) • Messages are stored in a relational database • User knows that Smith keeps track of documents in some area of interest • System allows to filter on “documents replied to by Smith” • This means that outgoing email messages become part of the selection process • Filtering becomes iterative process

  47. Collaborative (Tapestry) • A filter can contain some keywords with the added condition of 3 or more endorsements • Users can write ad-hoc queries or filter queries to receive data • A user can ask to use someone else's filter • Uses its own query language which is similar to 1st order logic

  48. PHOAKS (People Helping One Another Know Stuff) • Recommends URLs. • Mining: Mention of a URL in a news article is used except for: • URLs in headers and quoted sections. • Articles posted to too many newsgroups. • URLs in announcements or ads. • Aggregation: number of distinct recommenders of each URL.

  49. GroupLens • Collaborative filtering for Usenet news • Used for rec.humor, rec.food.recipes, rec.arts.movies.current-films, etc. • Recommendations are both explicit by providing a rate of 1-5, and implicit by monitoring reading time • Recommendations are displayed along a reference

  50. GroupLens • Pseudonym are used • Selects a group of people to act as personal moderators • The moderators are users with whom you have substantial agreement on part articles • When a user fetches articles from a newsgroup evaluation predictions are displayed • The user may enter ratings • The ratings serve as input for predicting the value for other users and for correlating the user with other users

More Related