FACT: A Learning Based Web Query Processing System
E N D
Presentation Transcript
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University
Outline • Introduction • Learning Based Web Query Processing • FACT: A Prototype System • Preliminary System Evaluation • Conclusions
How Do We Query the Web? • Use a search engine • Form query key words • An example: Find room rates of hotels in Hong Kong • used search engine www.yahoo.com • keywords: Hong Kong+hotel
forward Hotel 1 3 Hotel 2 Look at the Number! done
Query the Web -- Current Situation • Search engines return a long list of URLs. User is required to browse the web pages to find the information. • The information required is often not on the returned page -- navigation through hyperlinks is often required (those links may or may not that obvious). • The target information is in different forms (paragraphs, lists, tables …) • A lot of web pages to be browsed Are we happy with this?
Efforts to Improve the Situation • Search engines • better index, improve precision/recall, metasearch engines, better presentation of results, …. • IR techniques to Web • document clustering/indexing, better model, similarity functions, documents ranking, ... • Intelligent agent • user profiling, hyperlink recommendation, ... • Database approach • wrappers, query languages, …
Our Dream • Querying the Web as easy as querying a relational database • SQL query returns a table of hotel prices SELECT room rates FROM web.hotel WHERE city = “hong kong” • May remain a dream for a while :-(
A Practical goal • Use keywords to express query requirements • simple, no need to know schema of data • inaccurate • Relieve users from tedious browsing as much as possible • Not URLs, not Web sites, even not Web pages • Present query results to users as accurate and concise as possible • Tables, lists, paragraphs, … containing user required information
Query Results -- Queried Segments • Return query results as accurate and concise as possible. • Basic idea: • Breaking a Web page into segments: a row in a table, a table, an item in a list, a list, a paragraph, • returning only queried segments to users • queried segments : segments that contain the information the user is interested in .
Outline • Introduction • Learning Based Web Query Processing • FACT: A Prototype System • Preliminary System Evaluation • Conclusions
Learning Based Query Processing • The fundamental difficulties in Web query processing: • Web is a huge, ever growing, heterogeneous, semi-structured data source • Most users of Web are naïve users issuing ad hoc queries • Learn the knowledge for query processing from the User!
A Learning Based Technique • Learn from the user when he browses from the first few URLs • to navigate through the web pages • to identifythe required information in a web page • Process the rest URLs automatically and retrieve queried segments
forward Hotel 1 3 Hotel 2 User browses it! done
Back User clicks here!
Room information User marks it!
back Fact starts here!
roomrates Fact chooses it!
xxx Fact finds it!
Outline • Introduction • Learning Based Web Query Processing • FACT: A Prototype System • Preliminary System Evaluation • Conclusions
A Query Processing System A learning based query processing system: • User Interface: accepts user queries, presents query results, a browser capable of capturing user actions • Query Analyzer: analyzes and transforms user queries • Session Controller: coordinates learning and locating • Learner: generates knowledge from captured user actions • Locator: applies knowledge and locates query results • Crawler & Parser: retrieves pages and parses to trees • Knowledge Base: stores learned knowledge
User User Interface Learner KnowledgeBase SessionController QueryAnalyzer Locator Crawler & Parser SearchEngine Web Reference Architecture
Learning Process Scripts Learner Browser User Actions SessionController URLs KnowledgeBase ResultBuffer TrainingStrategy SegmentGraph Queryresults Checking Locating Process Locator Query Result Presenter A Query Session
Training Strategies • Sequential • First nsites: user browses and system learns • Next N-n sites: system processes • Random • Randomly choose n sites: user browses and system learns • the system processes the rest • Interleaved • First n0sites, user browses and system learns • Next n - n0site, system makes decision. For incorrect ones, user browses and system re-learns • Next N-n sites: system processes
Outline • Introduction • Learning Based Web Query Processing • FACT: A Prototype System • Preliminary System Evaluation • Conclusions
System Evaluation • Functionality • Performance • precision, recall, correctness • efficiency: in a site, how many pages the system visits to find a result • training efficiency: how many training samples are needed • User interface
System Evaluation - Effectiveness • Given a set of keywords, the system makes N decisions N =N1 + N2 + N3 + N4 Precision = N1 / (N1+N3) , Recall = N1 / # relevant sites , Correctness = (N1+N2) / N .
System Evaluation - Efficiency • How efficiently the system finds a queried segment in a site? Level of a Queried Segment = the length of the shortest path to find it Absolute Path length = # Crawled pages, Relative Path Length = # Crawled pages / Level of the Queried Segment .
Basic Performance • Q11: Hong Hong Hotel Room Rate • Q12: Hong Kong Hotel Sequential training
Query Q12 Effects of training Strategies
Improved Performance Interleaved training
Outline • Introduction • Learning Based Web Query Processing • FACT: A Prototype System • Preliminary System Evaluation • Conclusions
Conclusions • Proposed and implemented learning based Web query processing with the following features • Returning succinct results: segments of pages; • No a prior knowledge or preprocessing, suited for ad hoc queries; • exploiting page formatting and linkage information simultaneously. • The preliminary results are promising
Future Work • Better knowledge • key factor that affects system performance • Dynamic web pages ? • Integrating results from another project • System evaluation • Prototype product dot com company $$$ ???