Policy Search for Focused Web Crawling

Policy Search for Focused Web Crawling Charlie Stockman NLP Group Lunch August 28, 2003

Outline • Focused Crawling • Reinforcement Learning • Policy Search • Results • Discussion and Future Work

Focused Crawling • Web crawling is an automated searching of the web, usually with the purpose of retrieving pages for a search engine. • Large search engines like Google attempt to visit as many text pages as possible using strategies like breadth first search. • Focused crawling aims at only searching promising parts of the web where specific targets will be found. • Internet portals like Citeseer might use focused crawling because they aim at only answering domain specific queries.

Why Use Focused Crawling? • Why don’t we just crawl the entire web? • There is a lot of data to save and index. (3 billion plus pages according to Google) • Crawling takes time, and you want to be able to revisit relevant pages that might change. (Our crawler has limited resources and crawls ~1 million pages a day. It would take us about 10 years to crawl the entire web.) • Why don’t we just use Google to search for our targets? • Google won’t allow you to make millions of queries a day to its search engine.

Previous Approaches • Cho, Garcia-Molina, and Page (Computer Networks ‘98) • Uses Backlink and PageRank(weighted Backlink) metrics to decide which URLs to pursue. • Performed well when Backlink was used to measure the importance of a page. • Performed poorly for a similarity based performance measure (eg. Whether the page was computer related.) • Chakrabarti, Van den Berg, and Dom (Computer Networks ‘99). • Works on the assumption that good pages are grouped together. • User-supplied hierarchical tree of categories like Yahoo!. • Use a bag-of-words classification to put each downloaded page in a category. • Neighbors and children of this page are pursued depending upon whether this category or one of its ancestors has been marked as good. • Burden on user to provide this tree and example documents. • Rennie and McCallum - Cora portal for CS papers • Uses Reinforcement Learning.

Reinforcement Learning • An RL task consists of: • a finite set of states, • a finite set of actions, • a transition function, • A reward function, • The Goal of RL is to find a policy, that maximizes the expected sum of future rewards. r=0 r=0 r=1

Value Functions in RL • In the infinite horizon model the value of a state, s, is equal to: • This value function can be defined recursively as: • The value function, , is defined as the value of a state using the optimal policy, . • It is also convenient to define the function, , to be the value of taking action a in state s and acting optimally from then on. where is a discount factor

RL in Focused Crawling • Naïve approach: • States are web pages, actions are links, and the transition function is deterministic • Inaccurate model of focused crawling • Better approach: • A state s is defined by the set of pages we have visited (and the set not visited) • Our set of actions, A, is the set of links we have seen but have not visited. • Our reward function, R, for an action a is positive if a is downloading a target page, and negative or zero if not. • Our transition function, T, adds the page we are following to the set of seen pages, and the links on the page to our action set. • If we knew our optimal policy would be simply to pick the action with the highest Q value. visited pages not visited targets

Problems with RL • Note that the Q function is specified as a table of values, one for each state-action pair • In order to compute Q* we would have to continuously iterate over all of the states and actions. • Problem – Our goal is to not visit all web pages and links, and the web is too large for this even if we did. • We could use a function of the features of a state-action pair to approximate the true Q function, and use machine learning to learn this function during the crawl. • Problem – RL with value function approximation is not proven to converge to the true values. • We could download a dataset, compute something close to the exact Q function, and then learn an approximation to it, and use the approximation function in our crawls. • This is what Rennie and McCallum do.

Rennie and McCallum • They compute the Q-values associated with a near-optimal policy on a downloaded training set. • Their near optimal policy selects actions from a state that lead to the closest target page. • This makes sense. You can always “jump” to any other page that you’ve seen. • They then train a Naïve Bayes classifier, using the binned Q value of a link as the class, and the text, context, url, etc. as features. 2 1 Near optimal policy The sequence of rewards if you pursue the immediate reward first is {1,0,1,1,1,1} while it is {0,1,1,1,1,1} if you go to the “hub” first.

Rennie and McCallum Problems • Rennie and McCallum require that the user select and supply large training sets. • Q function is binned • Near optimal policy is not optimal. • It misses the fact that some links lead to a single target k steps away, while others lead to multiple targets k steps away • It chooses arbitrarily between two equidistant rewards. Thus, it could make the wrong decision if one of the pages has more rewards right behind it.

Policy Search • Approximating the value function is difficult. • The exact value of a state is hard to approximate, and more information than we need. • Approximating the policy is easier. • All we need to learn is whether or not to take a certain action. • However: • The space of possible policies is large. For a policy that is a linear function of features, there are as many parameters as features • We can’t do hill-climbing in policy space. • Therefore: • We need at very least the gradient of the performance function with respect to the parameters of the policy, so we know in which direction to search. • So… • We need a probabilistic model of focused crawling, where we can state the performance of a policy as a differentiable function of the policy parameters.

Probabilistic Model • Joint probability distribution over random variables representing the status of each page at each iteration of the crawl. • N rows for the N pages in the web. • T columns for the T steps away from the start page you travel. • Each variable sit can take the value 1, meaning that the page i was visited before time t, or 0 • Dynamic Bayesian network (DBN) provides a compact representation • A node has parents in the previous column only; the parents are the nodes corresponding to the web pages that link to it, as well as itself (to ensure that a page remains visited) • Whether a page has been visited depends on whether its parents have been visited by the previous step. 1 2 3 4 N 1 T-1 2 T

Probability of Visiting a Page • The probability of that a node i is visited given that its parent is visited is represented as wij. • wij is calculated as a logistic regression function of the dot product of the parameters, , and the features from j to i, Fji. • We use a Noisy-OR conditional probability distribution over parents for efficiency of representation and computation. p1 w31 p2 w33 p3 p3 w34 p4 t-1 t

Probability of Visiting (cont) • Noisy-OR Approximation Product over all the parents, j, in the last time step. Probability that i is not visited in this time segment if j has been visited. uj = 1 if the parent j has been visited by the previous time step. uj = 0 otherwise.

Performance of Policy • The performance of a policy is computed at the last time step. • Equal to the sum of the probabilities of visiting each page multiplied by the reward of that page. • Possible rewards. 1 .5 2 .2 3 1 4 .3 N .4 ri = positive if si is a target. ri = negative or zero if si is not a target. T

Computing the Gradient Chain rule of partial derivatives Sensitivity of performance function with respect to a specific node.

Gradient of the Policy • Computing this derivative is fairly trivial. • The logistic function has a nice derivative. • We have already have everything we need here. • We have the parameters, . • We have the features, Fji. • And we have wij , we calculate it when we go forward through the network and create all of the probabilities and get a value for our policy. • Remember, wij is the same at every time step.

Gradient of a Marginal Prob • Calculating this derivative is less trivial. • But with a little bit of math. • Again we have everything we need • We calculate all of the probabilities and weights with our forward run through the network.

Computing the sensitivity • For the time step T, this derivative is simple. • Recall that: • So: • For all other time steps. • Note here that I have changed the i on the left side to a j. This is because the sensitivity of the node j is determined by the sensitivity of its children, i. • Thus when computing the gradient, we only have to keep the sensitivities for one layer back. t-1 t t+1

We Have the Gradient! • Because we just have to go once forward and once backward through the network, this process is of complexity O(N*T*P*F) • N~=50,000 • T~=10 • P~=10 (number of parents) • F~=10,000 (number of features) • ~50 billion calculations. • Training on a set of 15,000 pages with 60,000 features took about a minute.

Results

Results • Performance of policy search isn’t yet beating that of Rennie and McCallum. • But we believe that we can improve performance vastly by adjusting algorithm. • Larger training sets • Different parameterizations • Better optimization techniques • Better feature selection • Optimize with respect to alternative performance metrics.

Discussion and Future Work • We still need to play around with it to see how much we can get its performance to improve. • It is still an offline algorithm. We would like to do an online algorithm of some sort. • Batch processing algorithm where we crawl for a while, collecting a new data set, and then learn for a while. • We would have to throw in some kind of random exploration if that were the case. • Calculate gradient and search parameters online?

Backup Slides

Second Part of Sum

Second Part of Sum (cont.)

Reinforcement Learning

Policy Search for Focused Web Crawling