660 likes | 674 Vues
Explore the generation of semantic annotations and ontology from machine-generated web pages. Learn about automatic annotation techniques such as TISP and FOCIH for extracting knowledge from hidden web content. Discover methods for ontology creation and information harvesting.
E N D
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages Cui Tao PhD Dissertation Defense
Motivation • Birth date of my great grandpa • Price and mileage of red Nissans, 1990 or newer • Protein and amino acids information of gene cdk-4? • US states with property crime rates above 1%
“cdk-4" Search the Hidden Web • The Hidden Web: • Hidden behind forms • Hard to query
Query for Data • The Hidden Web: • Hidden behind forms • Hard to query Find the protein and the animo-acids information for gene “cdk-4"
A Web of Pages A Web of Knowledge • Web of Knowledge • Machine-“understandable” • Publicly accessible • Queriable by standard query languages • Semantic annotation • Domain ontologies • Populated conceptual model • Problems to resolve • How do we create ontologies? • How do we annotate pages for ontologies?
Contributions of Dissertation Work • Web of Pages Web of Knowledge • Knowledge & meta-knowledge extraction • Reformulation as machine-“understandable” knowledge • Automatic & semi-automatic solutions via: • Sibling tables (TISP/TISP++) • User-created forms (FOCIH)
Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations
Recognize Tables Layout Tables (discard) Data Table Nested Data Tables
Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2
Interpretation Technique:Sibling Page Comparison Almost Same
Interpretation Technique:Sibling Page Comparison Different Same
Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment
Table Structure Patterns • Regularity Expectations: • (<tr><(td|th)> {L} <(td|th)> {V})n • <tr>(<(td|th)> {L})n • (<tr>(<(td|th)> {V})n)+ • … Pattern combinations are also possible.
Table Structure Patterns <tr>(<(td|th)> {L})n (<tr>(<(td|th)> {V})n)+
TISP++ • Automatic ontology generation • Automatic information annotation
Ontology Generation – OSM • Object set: table labels • Lexical: labels that associate with actual values • Non-lexical: labels that associate with other tables • Relationship set: table nesting • Constraints: updates based on observation
Ontology Generation – OWL • Object set: OWL class • Relationship set: OWL object property • Lexical object set: • OWL data type property • Different annotation properties to keep track of the provenance
Query the Data Find the protein and the animo-acids information for gene “cdk-4"
TISP Evaluation • Applications • Commercial: car ads • Scientific: molecular biology • Geopolitical: US states and countries • Data: > 2,000 tables in 35 sites • Evaluation • Initial two sibling pages • Correct separation of data tables from layout tables? • Correct pattern recognition? • Remaining tables in site • Information properly extracted? • Able to detect and adjust for pattern variations?
Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
TISP++ Performance • Performance depends on TISP • TISP test set • Generates all ontologies correctly • Annotates all information in tables correctly
Form-based Ontology Creation and Information Harvesting (FOCIH) • Personalized ontology creation by form • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Automated ontology creation • Automated information harvesting
Almost Ready to Harvest • Need reading path: DOM-tree structure • Need to resolve mapping problems • Pattern recognition • Instance recognition
regular expression for decimal number left context right context Pattern & Instance Recognition
Pattern & Instance Recognition list pattern, delimiter is “,”
Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma
Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma