190 likes | 309 Vues
This paper presents a semiautomatic approach to generate resilient data extraction ontologies aimed at efficiently extracting structured data from web pages. By leveraging concepts, relations, and participation constraints from the Mikrokosmos ontology and predefined data frame libraries, the methodology addresses the challenges of manual ontology generation. The study discusses the process of knowledge selection, conflict resolution, and the generation of a data extraction ontology, while providing insights into the assumptions regarding knowledge bases and meaningful relationships within the data.
E N D
Semiautomatic Generation ofResilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF
Data Extraction Ontology • Goal: extract data from web pages • Components • concepts • relations between the concepts • participation constraints • Resilient • Difficulty: manual ontology generation is costly
Data-Extraction Ontology Generation Procedure Train Test Knowledge Selection Processing Extraction Processing Database Knowledge Sources
Knowledge Collection • Assumptions about knowledge base • general • contains meaningful relationships • pre-existing • XML or easy to transfer to XML • Current input • Mikrokosmos ontology [Mik] • auxiliary data frame library
Selection of Concepts PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-Library); ConflictHandling(); SelectedSubgraphGeneration(); MANY ISSUES selection strategies, conflict resolution, …
Basic Selection Strategy • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. • Select from Mikrokosmos Ontology
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Afghanistan<Nation> • smaller than Texas<USState>. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • extract result based on the data frames • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Document-Level Conflict • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Concept-Level Conflict • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000<Area> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>: 17.7 million<Population>. • Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Relation Retrieval • Theoretical solution • all paths in the subgraph • too expensive: NP-Complete • Heuristic solution • find the shortest path between any two nodes • set a threshold distance
Participation Constraints • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital—Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]
Participation Constraints (cont.) • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul<City>, • Other cities<City>--Kandahar<City> Mazar-e-Sharif<City> Konduz<City> • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*]
Performance Evaluation • Speed of generation • Precision and recall of the generation process • Precision and recall of the generated ontology
Conclusion • Data Extraction Ontology generated • Knowledge sources exploited • Many issues applied • Many more to explore