Information Retrieval

Information Retrieval March 25, 2005 Handout #11

Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M 11-12 & Th 12-1 or via email • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

Text classification

Introduction • Text classification: assigning documents to predefined categories • Hierarchical vs. flat • Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.

Generative models: knn • K-nearest neighbors • Very easy to program • Issues: choosing k, b?

Feature selection: The 2 test • For a term t: • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n

Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.

Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model

Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence

Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

Well-known datasets • 20 newsgroups (/data0/projects/graph/20ng) • http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ • Reuters-21578 (/data2/corpora/reuters21578) • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB (/data2/corpora/webkb) • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35

Support vector machines • Introduced by Vapnik in the early 90s.

Semi-supervised learning • EM • Co-training • Graph-based

Exploiting Hyperlinks – Co-training • Each document instance has two sets of alternate view (Blum and Mitchell 1998) • terms in the document, x1 • terms in the hyperlinks that point to the document, x2 • Each view is sufficient to determine the class of the instance • Labeling function that classifies examples is the same applied to x1 or x2 • x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]

Co-training Algorithm • Labeled data are used to infer two Naïve Bayes classifiers, one for each view • Each classifier will • examine unlabeled data • pick the most confidently predicted positive and negative examples • add these to the labeled examples • Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]

Additional topics • Soft margins • VC dimension • Kernel methods

Conclusion • SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. • NB also good in many circumstances

Information extraction

Information Extraction • Automatically extract unstructured text data from Web pages • Represent extracted information in some well-defined schema • E.g. • crawl the Web searching for information about certain technologies or products of interest • extract information on authors and books from various online bookstore and publisher pages [Slide from Pierre Baldi]

Info Extraction as Classification • Represent each document as a sequence of words • Use a ‘sliding window’ of width k as input to a classifier • each of the k inputs is a word in a specific position • The system trained on positive and negative examples (typically manually labeled) • Limitation: no account of sequential constraints • e.g. the ‘author’ field usually precedes the ‘address’ field in the header of a research paper • can be fixed by using stochastic finite-state models [Slide from Pierre Baldi]

Hidden Markov Models Example: Classify short segments of text in terms whether they correspond to the title, author names, addresses, affiliations, etc. [Slide from Pierre Baldi]

Hidden Markov Model • Each state corresponds to one of the fields that we wish to extract • e.g. paper title, author name, etc. • True Markov state diagram is unknown at parse-time • can see noisy observations from each state • the sequence of words from the document • Each state has a characteristic probability distribution over the set of all possible words • e.g. specific distribution of words from the state ‘title’ [Slide from Pierre Baldi]

Training HMM • Given a sequence of words and HMM • parse the observed sequence into a corresponding set of inferred states • Viterbi algorithm • Can be trained • in supervised manner with manually labeled data • bootstrapped using a combination of labeled and unlabeled data [Slide from Pierre Baldi]

Human behavior on the Web [The slides in this section are from Pierre Baldi]

Web data and measurement issues Background: • Important to understand how data is collected • Web data is collected automatically via software logging tools • Advantage: • No manual supervision required • Disadvantage: • Data can be skewed (e.g. due to the presence of robot traffic) • Important to identify robots (also known as crawlers, spiders)

A time-series plot of Web requests Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002.

Robot / human identification • Robot requests identified by classifying page requests using a variety of heuristics • e.g. some robots self-identify themselves in the server logs (robots.txt) • Robots explore the entire website in breadth first fashion • Humans access web-pages in depth first fashion • Tan and Kumar (2002) discuss more techniques

Robot / human identification • Robot traffic consists of two components • Periodic Spikes (can overload a server) • Requests by “bad” robots • Lower-level constant stream of requests • Requests by “good” robots • Human traffic has • Daily pattern: Monday to Friday • Hourly pattern: peak around midday & low traffic from midnight to early morning

Server-side data Data logging at Web servers • Web server sends requested pages to the requester browser • It can be configured to archive these requests in a log file recording • URL of the page requested • Time and date of the request • IP address of the requester • Requester browser information (agent)

Data logging at Web servers • Status of the request • Referrer page URL if applicable • Server-side log files • provide a wealth of information • require considerable care in interpretation • More information in Cooley et al. (1999), Mena (1999) and Shahabi et al. (2001)

Page requests, caching, and proxy servers • In theory, requester browser requests a page from a Web server and the request is processed • In practice, there are • Other users • Browser caching • Dynamic addressing in local network • Proxy Server caching

Page requests, caching, and proxy servers A graphical summary of how page requests from an individual user can be masked at various stages between the user’s local computer and the Web server.

Identifying individual users from Web server logs • Useful to associate specific page requests to specific individual users • IP address most frequently used • Disadvantages • One IP address can belong to several users • Dynamic allocation of IP address • Better to use cookies • Information in the cookie can be accessed by the Web server to identify an individual user over time • Actions by the same user during different sessions can be linked together

Identifying individual users from Web server logs • Commercial websites use cookies extensively • 90% of users have cookies enabled permanently on their browsers • However … • There are privacy issues – need implicit user cooperation • Cookies can be deleted / disabled • Another option is to enforce user registration • High reliability • Can discourage potential visitors

Client-side data • Advantages of collecting data at the client side: • Direct recording of page requests (eliminates ‘masking’ due to caching) • Recording of all browser-related actions by a user (including visits to multiple websites) • More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer) • Preferred mode of data collection for studies of navigation behavior on the Web • Companies like comScore and Nielsen use client-side software to track home computer users • Zhu, Greiner and Häubl (2003) used client-side data

Client-side data • Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data • Some limitations • Still some statistics like ‘Page-view duration’ cannot be totally reliable e.g. user might go to fetch coffee • Need explicit user cooperation • Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior • Web surfing data can be collected at intermediate points like ISPs, proxy servers • Can be used to create user profile and target advertise

Handling massive Web server logs • Web server logs can be very large • Small university department website gets a million requests per month • Amazon, Google can get tens of millions of requests each day • Exceed main memory capacities, stored on disks • Time-costs to data access place significant constraints on types of analysis • In practice • Analysis of subset of data • Filtering out events and fields of no direct interest

Empirical client-side studies of browsing behavior • Data for client-side studies are collected at the client-side over a period of time • Reliable page revisitation patterns can be gathered • Explicit user permission is required • Typically conducted at universities • Number of individuals is small • Can introduce bias because of the nature of the population being studied • Caution must be exercised when generalizing observations • Nevertheless, provide good data for studying human behavior

Early studies from 1995 to 1997 • Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997) • In both studies, data was collected by logging Web browser commands • Population consisted of faculty, staff and students • Both studies found • clicking on the hypertext anchors as the most common action • using ‘back button’ was the second common action

Early studies from 1995 to 1997 • high probability of page revisitation (~0.58-0.61) • Lower bound because the page requests prior to the start of the studies are not accounted for • Humans are creatures of habit? • Content of the pages changed over time? • strong recency (page that is revisited is usually the page that was visited in the recent past) effect • Correlates with the ‘back button’ usage • Similar repetitive actions are found in telephone number dialing etc

The Cockburn and McKenzie study from 2002 • Previous studies are relatively old • Web has changed dramatically in the past few years • Cockburn and McKenzie (2002) provides a more up-to-date analysis • Analyzed the daily history.dat files produced by the Netscape browser for 17 users for about 4 months • Population studied consisted of faculty, staff and graduate students • Study found revisitation rates higher than past 94 and 95 studies (~0.81) • Time-window is three times that of past studies

The Cockburn and McKenzie study from 2002 • Revisitation rate less biased than the previous studies? • Human behavior changed from an exploratory mode to a utilitarian mode? • The more pages user visits, the more are the requests for new pages • The most frequently requested page for each user can account for a relatively large fraction of his/her page requests • Useful to see the scatter plot of the distinct number of pages requested per user versus the total pages requested • Log-log plot also informative

The Cockburn and McKenzie study from 2002 The number of distinct pages visited versus page vocabulary size of each of the 17 users in the Cockburn and McKenzie (2002) study

The Cockburn and McKenzie study from 2002 The number of distinct pages visited versus page vocabulary size of each of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot)

The Cockburn and McKenzie study from 2002 Bar chart of the ratio of the number of page requests for the most frequent page divided by the total number of page requests, for 17 users in the Cockburn McKenzie (2002) study

Video-based analysis of Web usage • Byrne et al. (1999) analyzed video-taped recordings of eight different users over a period of 15 min to 1 hour • Audio descriptions of the users was combined with the video recordings of their screen for analysis • Study found • users spent a considerable amount of time scrolling Web pages • users spent a considerable amount of time waiting for pages to load (~15% of time)

Probabilistic models of browsing behavior • Useful to build models that describe the browsing behavior of users • Can generate insight into how we use Web • Provide mechanism for making predictions • Can help in pre-fetching and personalization

Markov models for page prediction • General approach is to use a finite-state Markov chain • Each state can be a specific Web page or a category of Web pages • If only interested in the order of visits (and not in time), each new request can be modeled as a transition of states • Issues • Self-transition • Time-independence

Markov models for page prediction • For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states • Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general, • Under a first-order Markov assumption, we have • This provides a simple generative model to produce sequential data

Markov models for page prediction • If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix • Properties • Strong first-order assumption • Simple way to capture sequential dependence • If each page is a state and if W pages, O(W2), W can be of the order 105 to 106 for a CS dept. of a university • To alleviate, we can cluster W pages into M clusters, each assigned a state in the Markov model • Clustering can be done manually, based on directory structure on the Web server, or automatic clustering using clustering techniques

Information Retrieval