For: CS590 Intelligent Systems

Application of the A* Informed Search Heuristic to finding Information on the World Wide Web Daniel J. Sullivan, Presenter Date: April 30th, 2003 Location: SL210 For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering

Problem Domain This project explores the application of the A* heuristic search function to the problem of document retrieval and classification based upon a relevancy criterion. This work includes a modification of A* and proposes a means of determining relevancy as a function of independent textual mappings.

The Principal Objectives of this Project • The problem of retrieving useful information from the WWW. • The A* (A-Star) heuristic approach to searching a state space. • The development of a simple relevance heuristic which does not require a large sample base. • The development and testing of a basic search agent.

The A* Heuristic • An informed search technique. • A function which evaluates the total cost of a path based upon the actual cost [ G(n) ] up to the current node and the estimated cost [ H(n) ] from the current node to the goal node. • Requires an effective means of predicting expected path cost and it needs to be admissible. Admissible means that the H(n) cannot overestimate the cost of reaching the goal node.

A* Function with User Set Time Limit • F(n) = G(n) + H(n) where n = time in Seconds • G(n) = Total Time Elapsed • Document Value = Relevance * Size (in bytes) • CRI = Current Bytes of Relevant Information • BP (Best Path) = Max_Bandwidth * total_time_avail, this is the perfect case and serves as the admissibility criterion • H(n) = (BP-CRI)/DV, which yields the number of seconds left if this path is followed. • Links with lowest total time left to reach information goal are inserted into the priority queue and are explored first.

Relevance • The technique used in this project is simply a comparison of text sample features. • It begins with a single sample without specifying how large the sample needs to be. • It uses more than one functional mapping for comparison and expects that the weights assigned to each mapping accurately reflect their specificity.

Text Document Mappings 1 : S → WL, where S is the sample document and WL is the set of ordered pairs (a,b), such that a is a word in S and b is its relative frequency. This is the most basic lexical comparison between documents. 2 : S → WC, where S is the sample document and WC is a set of ordered pairs (a,b), such that a is a content related token (a  C) from S and b is the relative frequency of this token. 3 : S → TC, where S is same as above and TC is a set of ordered pairs (a,b), such that a is a  O and b is the relative frequency of a. 4 : S → OP, where S is same as above and OP is a set of 3-tuples (a,b,c), such that a is is S, b is in S, b is 1 place ahead in the ordering than a or b = a + 1. And c is the relative frequency of this pair of words. 5 : S → MXST, where S is same as above and MXST represents a set of 3-tuples, which is a subset of OP and represents the maximum spanning tree connecting all words in the document based upon their ordering.

Value of Different Mappings? Document 1: This is the original sample document. The user wants to find a maximum of related materials – related to this sample. This is the set produced by F3 above and should show similarity for most documents, even those which are not really relevant. But are there small distinctions? Which can be used to judge similarity? The diagonal region is meant to indicate intersection between sets. Document 2: This is the document downloaded from the WWW. This document may contain related information. For this case, lets assume this is a comparison using the F2 mapping. The intersection is small, but clearly not of the same magnitude as testing whether these documents use a similar frequency of operators. In this case, it is obvious we would not want to weight these mappings the same.

Reasons to work with Web Search Agents… • To investigate general and common problems for all forms of intelligence. • To experiment in a domain where machines are on a more ‘equal’ footing in terms of perception. • To confront the common and real problem of information overload.

What my program does.... This program takes a sample of text (and possibly a very small sample) and conducts a search for similar text documents (html format) on the World Wide Web.

CONNECTION OBJECT: This part of the program opens up a connection with a web site and downloads the information, and this information is returned. MAIN: All of the functionality, including the execution of A* is included in the Main Module. VISIT LIST: This object is a hash table (I am simply using a HASH as implemented by PERL) which ensures that there are not duplicate visits. HEURISTICS: This part of the program contains all of the code which is related to the main investigation of this thesis. It contains the A* as implemented. PRIORITY QUEUE: This part of the program ensures that only the links with the lowest A* value will be visited first. TEXT PROCESSOR: This portion of the program prepares information for processing, removes links and initializes key data points. LINK OBJECT: This object is the actual data type managed by the Priority Queue and contains two values: a hyperlink and an A* score. DATABASE: This portion of the program manages important data which needs to be persistent. Principal Objects in Design

Simplified High Level Process Flow PROCESS TEXT SAMPLE AND CREATE COMPARISON TABLES PROCESS RETRIEVED DATA AS DIRECTED BY THE USER AND THE PURPOSE OF THE SEARCH. APPLY A* FUNCTION TO THE DATA RETRIEVED AND INSERT ALL LINKS IN THE PRIORITY QUEUE WITH THE SCORE SUBMIT INITIALIZATION QUERY TO INTERNET SEARCH ENGINE REMOVE LOWEST SCORE LINK FROM PRIORITY QUEUE AND DOWNLOAD INFORMATION PLACE LINKS IN PRIORITY QUEUE WITH AN INITIALLY LOW SECONDS SCORE YES NO HALT! HAS TIME LIMIT BEEN REACHED?

Lessons Learned • Use ANN for relevance function • Investigate whether this problem is better solved using Hill-Climbing • Use JAVA and distributed objects to break down the tasks further and to enable simultaneous processing. Many tasks can be performed at the same time.

For: CS590 Intelligent Systems