180 likes | 185 Vues
Query Rewriting for Extracting Data Behind HTML Forms. Xueqi Chen Department of Computer Science Brigham Young University March, 2003. Funded by National Science Foundation. Motivation. Web information is stored in databases Databases are accessed through forms
E N D
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National Science Foundation
Motivation • Web information is stored in databases • Databases are accessed through forms • Automated agents are of great value • Process is difficult because of nature of forms
Input Analyzer Extracted Information Application Ontology Site Form User Query Retrieved Page(s) Output Analyzer System Flowchart
User Query Acquisition • Our system provides a form created based on application-specific ontology
Site Form Analysis • Understand type, name, and/or values for each field
Form Filling • Name matching • Regular Expressions – for fields with values provided • Stemming • Levenshtein Edit Distance • Longest Common Subsequences • Soundex • Wordnet • Value matching
? ? Value Matching: Case 2
? ? Value Matching: Case 3 Color?
Measurements • Matching Efficiency • Submission Efficiency • Post-processing Efficiency
Measurements (cont’) • Matching Efficiency
Measurements (cont’) • Matching Efficiency • Submission Efficiency
Measurements (cont’) • Matching Efficiency • Submission Efficiency • Post-processing Efficiency
Contributions • It enhances the effectiveness of the data-extraction process • It presents another technique, in addition to [RGa01], to access data behind HTML forms.