1 / 19

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool. Fan Wang and Gagan Agrawal The Ohio State University. Presented by : Tantan Liu. The Deep Web. The definition of “Deep web” from Wikipedia.

armine
Télécharger la présentation

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Self-Healing Approach for A Domain-Specific Deep Web Search Tool Fan Wang and Gagan Agrawal The Ohio State University Presented by : Tantan Liu

  2. The Deep Web • The definition of “Deep web” from Wikipedia The Deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.

  3. Deep Web in Biological Domain • 500 times larger than the surface web • Nearly 800 deep web data sources in the bio-domain • 95 percent of the deep web is publicly accessible

  4. Search the Deep Web: Solved and Unsolved Issues • Data Source Integration • Schema matching and Schema mining • Query Planning and Answering • Keyword search and Structured query answering • Fault Tolerance • Data access over wide-area networks • Unpredictable data source inaccessibility/unavailability • Network contention • However, uncompromised user search experience

  5. Our Solution: A Redundancy based Self-Healing Approach • Identify data redundancy across independent data sources • Find the minimal “have to be replaced” sub-plan caused by data source unavailability/inaccessibility • Find the sub-query corresponding to the “have to be replaced” sub-plan • Generate a new replacing sub-plan based on redundancy using other data sources

  6. Roadmap • Introduction and Motivation • Problem Formulation in Detail • Our Self-Healing Approach • Evaluation • Conclusion

  7. Data Redundancy Model • A data source is represented by a three-tuple • IN: input attribute • O: output attribute • Con: attribute conditions imposed on data source • Data redundancy condition between data source A and B • They have the same input attributes • They have overlapping output attributes • They have non-conflicting attribute conditions

  8. Query and Query Plan • Query • SQL query format select t1,t2,…,tn search term set ST from the deep web where in1=e1 and in2=e2,…,nm=em input term set INT • Query Plan • A DAG of data source nodes “covers” the user query Query plan nodes Starting node Output attributes may be user requested search terms Its input attributes are input terms in query Data source dependency

  9. Algorithm Overview (1) • Find the part of the query plan needs to be replaced • Impacted sub-plan • the sub-graph reachable from the unavailable data source nodes • Minimal impacted sub-plan • The impacted sub-plan without usable data source nodes considering given data redundancy

  10. Algorithm Overview (2) • Find Maximal Fixable Sub-Query • The sub-query corresponding to the minimal impacted sub plan • New Sub-Plan Generation • Use our existing query planning algorithm Select t3, t4 where input=t1

  11. Minimal Impacted Sub-Plan Algorithm • Identify unavailable data sources • {B, I} 2. Find the sub-graph reachable from them (impacted sub-plan) 3. Cascading-crash conditions for data source X which is dependent on data source D A. At least one data source, sharing redundant data with D, is not crashed B. At least one such above data source has the same usage as D

  12. Minimal Impacted Sub-Plan Fixability • Minimal Impacted Sub-Plan Fixability • How much the minimal impacted sub-plan can be fixed using other data sources taking advantage of data redundancy • Dead Attribute • No un-crashed data source can provide the attribute as its output attribute • Plan Fixability Categorization • Fully fixable: only self crashed node, no dead attribute • Partial fixable: only self crashed node, dead attribute • Cascading fully fixable: cascading crashed node, no dead attribute • Cascading partial fixable: cascading crashed node, dead attribute

  13. Maximal Fixable Sub-Query Generation • For each source in the minimal impacted sub-plan, we compute • Input set IN • Requested output set RO • Linking set L • Maximal Fixable Sub-Query • Input term set: input attributes of all data sources in the minimal impacted sub-plan without incoming edges (self-crashed data sources) • Search term set • Users requested search terms which are supposed to be covered by the minimal impacted sub-plan • Terms in the linking set of the nodes in the minimal impacted sub-plan which have outgoing edges to data sources outside of the minimal impacted sub-plan IN={t1} L={t3,t4}

  14. Roadmap • Introduction and Motivation • Problem Formulation in Detail • Our Self-Healing Approach • Evaluation • Conclusion

  15. Evaluation • 12 biological deep web data sources • 20 queries, 4 groups • Each group corresponding to one fixability category • Methods compared • Baseline: start from stretch • Our method

  16. Query Answering Time Comparison • Our method is more efficient in fixing failed query plans than • the baseline method • 2. Our method is at least 20% faster for all queries in this figure.

  17. Query Result Quality Comparison For 18 out of 20 cases, the recall from our method is exactly the same as the ideal recall from the baseline method

  18. Conclusion • Propose a self-healing approach to support fault tolerance for deep web searches • Find the minimal impacted sub-plan caused by unavailable/inaccessible data sources • Find a new plan to replace the minimal impacted sub-plan • Our method outperforms a baseline method in terms of both efficiency and result quality

  19. Questions? Contact us: Fan Wang wangfa@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu

More Related