130 likes | 259 Vues
This research presents a methodology to automatically filter multiple-record web documents based on application ontologies, particularly focusing on the car sales and rental domains. By applying filtering heuristics such as density, expected values, and grouping factors, our system achieves approximately 95% accuracy. The study employs decision tree algorithms and utilizes human-classified training sets to refine its classification performance. The results show high precision and recall rates for both car and obituary applications, demonstrating the effectiveness of our approach in parsing relevant information from unstructured web data.
E N D
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley
D1: Car D2: Item for Sale or Rent Examples
Car Ontology Car[->object]; Car[0..0.975..1] has Year; Car[0..0.925..1] has Make; Car[0..0.908..1] has Model; Car[0..0.45..1] has Mileage; Car[0..2.1..*] has Feature; Car[0..0.8..1] has Price; PhoneNr is for Car[1..1.15..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, . . End;
Filtering Heuristics • H1: Density • H2: Expected-values • H3: Grouping
H1: Density • Car • Total Number of Characters: 2048 • Number of Matched Characters: 626 • Density: 0.306 • Item for Rent or Sale • Total Number of Characters: 196 • Number of Matched Characters: 2671 • Density: 0.073
H2: Expected-values OV D1 D2 Year 0.98 16 6 Make 0.93 10 0 Model 0.91 12 0 Mileage 0.45 6 2 Price 0.80 11 8 Feature 2.10 29 0 PhoneNr 1.15 15 11 D1: 0.996 D2: 0.567 D1 ov D2
H3: Grouping Year: 2000 Year: 1989 Make: Subaru Model: SW------ Nr of Distinct "One Max" Object:3 Price: 1900 Year: 1998 Model: Elantra Year: 1994------ Nr of Distinct "One Max" Object:3 . . . Grouping Factor is: 0.865 Year: 1999 Year: 1998 Year: 1960 Mileage: 10000 Nr of Distinct "One Max" Object:2 Mileage: 401000 Year: 1940 Price: 17500 Year: 10971 Nr of Distinct "One Max" Object: 3 . . . Grouping Factor is: 0.5
Combining Heuristics • Decision tree learning algorithm C4.5 • Learning task: suitability • Performance measure: accuracy • Training experience: human classified documents • Training set • 20 positive examples (from 10 geographical regions of US States) • 30 negative examples • Test set • 10 positive examples • 20 negative examples
Generated Rules • Car application • H2 <= 0.8767:NO • H2 > 0.8767:YES • Obituary application • H2 <= 0.6793:NO • H2 > 0.6793 • | H1 <= 0.2171:NO • | H1 > 0.2171:YES • Universal rule • H3 <= 0.625 • | H1 <= 0.369: NO • | H1 > 0.369 • | | H2 <= 0.6263: NO • | | H2 > 0.6263: YES • H3 > 0.625: YES
Experiment Results • Car application • accuracy 96.7% • precision 100% • recall 91% • Obituary application • accuracy 96.7% • precision 91% • recall 100% • Universal rule • accuracy 93.4% • precision 84% • recall 100%
Summary • Objective: Automatically filter multiple-record web documents. • Approach: Filtering heuristics • Density • Expected-values • Grouping • Result: ~95% accuracy