Query Rewriting for Extracting Data Behind HTML Forms
E N D
Presentation Transcript
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University August, 2002 Funded by National Science Foundation
Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways
Motivation • Web information is stored in databases • Databases are accessed through forms • Forms are designed in various ways • Automated agents are of great value
Input Analyzer Extracted Information Application Ontology Site Form User Query Retrieved Page(s) Output Analyzer System Flowchart
User Query Acquisition • Our system provides a form created based on application-specific ontology
Site Form Analysis • Understand type, name, and/or values for each field
Form Filling • Name matched? • Case 0 – 5 • Field matched? • Case 1, 2 • Value matched? • Case 3, 4, 5
Form Filling: Case 0 • Fields specified in user query are the same as in a site form. 84601
? ? Form Filling: Case 1 • Fields specified in a user query are not contained in a site form, but are in the returned information.
? ? Form Filling: Case 2 • Fields specified in a user query are not contained in a site form, and are not in the returned information. Color?
Form Filling: Case 3 • Fields required by a site form are not provided in user query, but a general default value, such as “All” or “Any”, is provided by the site form.
Form Filling: Case 4 • Fields appear in a site form are not provided in a user query, and the default value provided by the site form is specific, not “All” or “Any”. ?
Form Filling: Case 5 • Values specified in a user query do not match with values provided in a site form.
Post-processing • Valid pages? Error pages? Pages with error messages [Yau01]
Post-processing • Valid pages? Error pages? Pages with error messages [Yau01] • Concatenates the results [Yau01] • Recognizes the boundary of each record [EJN99] • Identifies the formats of the retrieved pages
Post-processing (cont’) • Removes duplicates [Yau01] • Extracts key information [Deg02, ETL02] • Places the results in a database [Deg02] • Executes the original user query and displays the results.
Measurements • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency
Measurements (cont’) • Field-matching Efficiency
Measurements (cont’) • Field-matching Efficiency • Submission Efficiency
Measurements (cont’) • Field-matching Efficiency • Submission Efficiency • Post-processing Efficiency
Contributions • It enhances the effectiveness of the data-extraction process • It presents another technique, in addition to [RGa01], to access data behind HTML forms.