200 likes | 286 Vues
Recognizing Ontology-Applicable Multiple-Record Web Documents. David W. Embley Dennis Ng Li Xu. Brigham Young University. Problem: Recognizing Applicable Documents. Document 1: Car Ads. Document 2: Items for Sale or Rent. A Conceptual Modeling Solution. Car-Ads Ontology. Car [->object];
E N D
Recognizing Ontology-ApplicableMultiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University
Problem: Recognizing Applicable Documents Document 1: Car Ads Document 2: Items for Sale or Rent
Car-Ads Ontology Car [->object]; Car [0:0.975:1] has Year [1:*]; Car [0:0.925:1] has Make [1:*]; Car [0:0.908:1] has Model [1:*]; Car [0:0.45:1] has Mileage [1:*]; Car [0:2.1:*] has Feature [1:*]; Car [0:0.8:1] has Price [1:*]; PhoneNr [1:*] is for Car [1:1.15:*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;
Recognition Heuristics • H1: Density • H2: Expected Values • H3: Grouping
H1: Density Document 1: Car Ads Document 2: Items for Sale or Rent
H1: Density • Car Ads • Number of Matched Characters: 626 • Total Number of Characters: 2048 • Density: 0.306 • Items for Rent or Sale • Number of Matched Characters: 196 • Total Number of Characters: 2671 • Density: 0.073
H2: Expected Values Document 1: Car Ads Document 2: Items for Sale or Rent Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4
H2: Expected Values OV D1 D2 Year 0.98 16 6 Make 0.93 10 0 Model 0.91 12 0 Mileage 0.45 6 2 Price 0.80 11 8 Feature 2.10 29 0 PhoneNr 1.15 15 11 D1: 0.996 D2: 0.567 D1 ov D2
H3: Grouping (of 1-Max Object Sets) Document 1: Car Ads Document 2: Items for Sale or Rent { Year Make Model Price Year Model Year Make Model Mileage … { { { Year Mileage … Mileage Year Price Price … {
H3: Grouping 2+3+2+1 44 3+3+4+4 44 = 0.875 = 0.500 Car Ads ---------------- Year Year Make Model -------------- 3 Price Year Model Year ---------------3 Make Model Mileage Year ---------------4 Model Mileage Price Year ---------------4 … Grouping: 0.865 Sale Items ---------------- Year Year Year Mileage -------------- 2 Mileage Year Price Price ---------------3 Year Price Price Year ---------------2 Price Price Price Price ---------------1 … Grouping: 0.500 Expected Number in Group = Ave = 4 (for our example) 1-Max Sum of Distinct 1-Max in each Group Number of Groups Expected Number in a Group
Combining Heuristics • Decision-Tree Learning Algorithm C4.5 • (H1, H2, H3, Positive) • (H1, H2, H3, Negative) • Training Set • 20 positive examples • 30 negative examples (some purposely similar, e.g. classified ads) • Test Set • 10 positive examples • 20 negative examples
Car Ads: Rule & Results • Precision: 100% • Recall: 91% • Accuracy 97% • Harmonic Mean • 2/(1/Precision + 1/Recall)
Obituaries: Rule & Results • Precision: 91% • Recall: 100% • Accuracy: 97%
Universal Rule • Precision: 84% • Recall: 100% • Accuracy: 93%
Additional and Future Work • Other Approaches • Naïve Bayes [McCallum96] (accuracy near 90%) • Logistic Regression [Wang01] (accuracy near 95%) • Multivariate Analysis with Continuous Random Vectors [Tang01] (accuracy near 100%) • More Extensive Testing • Similar documents (motorcycles, wedding announcements, …) • Accuracy drops to near 87% • Naïve Bayes drops to near 77% • Others … ? • Other Types of Documents • XML Documents • Forms and the Hidden Web • Tables
Summary • Objective: Automatically Recognize Document Applicability • Approach: • Conceptual Modeling • Recognition Heuristics • Density • Expected Values • Grouping • Result: Accuracy Near 95% www.deg.byu.edu