1 / 17

Structural Web Search Using a Graph-Based Discovery System

Structural Web Search Using a Graph-Based Discovery System. Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington cook@cse.uta.edu http://www-cse.uta.edu/~cook. Structured Web Search. Existing search engines use linear feature match

Télécharger la présentation

Structural Web Search Using a Graph-Based Discovery System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington cook@cse.uta.edu http://www-cse.uta.edu/~cook

  2. Structured Web Search • Existing search engines use linear feature match • Web contains structural information as well • Hyperlink information • Web viewed as a graph [Kleinberg] • Subdue searches based on structure • Use as foundation of a structural search engine • Incorporation of WordNet allows for synonym match

  3. T1 C1 S1 T2 T3 T4 S2 S3 S4 SUBDUE • Discovers structural patterns in input graphs • A substructure is connected subgraph • An instance of a substructure is a subgraph that is isomorphic to substructure definition • Pattern discovery, classification, clustering Input Database Substructure S1 (graph form) Compressed Database triangle shape C1 S1 object R1 R1 on square S1 S1 S1 shape object

  4. Subdue Algorithm • Start with individual vertices • Keep only best substructures on queue • Expand substructure by adding edge/vertex • Compress graph and repeat to generate hierarchical description • Optional use of background knowledge

  5. Inexact Graph Match • Some variations may occur between instances • Want to abstract over minor differences • Difference = cost of transforming one graph to make it isomorphic to another • Match if cost/size < threshold

  6. Application Domains • Protein data • Human Genome DNA data • Spatial-temporal domains • Earthquake data • Aircraft Safety and Reporting System • Telecommunications data • Program source code • Web data

  7. page Represent Web as Graph • Breadth-first search of domain to generate graph • Nodes represent pages / documents • Edges represent hyperlinks • Additional nodes represent document keywords subdue texas projects word word university work hyperlink page parallel group learning robotics planning

  8. Instructor Postscript | PDF http http Teaching Robotics Research Robotics Publication Robotics WebSubdue’s Structural Search • Formulate query as graph • Use Subdue’s predefined substructure option to search for instances of query

  9. Query: Find all pages which link to a page containing term ‘Subdue’ • Subgraph vertices: • 1 page • URL: http://cygnus.uta.edu • 7  page • URL: http://cygnus.uta.edu/projects.html • Subdue • [1->7] hyperlink • [7->8] word Subdue word hyperlink page page /* Vertex ID Label */ s v 1 page v 2 page v 3 Subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 hyperlink d 2 3 word

  10. WebSubdue 22 instances AltaVista Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.” 12 instances Search for Presentation Pages page hyperlink hyperlink hyperlink page page page hyperlink hyperlink

  11. Search for Reference Pages page • Search for page with at least 35 in links • WebSubdue found 5 pages in www-cse • AltaVista cannot perform this type of search hyperlink hyperlink hyperlink … page page page

  12. Inclusion of WordNet • When generating graph • Use common stopword list • When searching for subgraph instances • Morphology functions • October = Oct • teaching = teach • Synsets • Optional allowance of synonyms

  13. Search for pages on ‘jobs in computer science’ • Inexact match: allow one level of synonyms • WebSubdue found 33 matches • Words include employment, work, job, problem, task • AltaVista found 2 matches page word word word jobs computer science

  14. WebSubdue found 3 hub (and 3 authority) pages AltaVista cannot perform this type of search Inexact match applied with threshold = 0.2 (4.2 transformations allowed) WebSubdue found 13 matches page page page HUBS hyperlink page page page word word word AUTHORITIES algorithms algorithms algorithms Search for ‘authority’ hub and authority pages

  15. word page box Subdue Learning from Web Data • Distinguish professors’ and students’ web pages • Learned concept (professors have “box” in address field) • Distinguish online stores and professors’ web pages • Learned concept (stores have more levels in graph) page page page page page page page

  16. Conclusions • WebSubdue can be used to search for structural web data • Could be enhanced with additional WordNet features such as synset path length • Efficient structural search necessary for future of web search tools

  17. To Learn More cygnus.uta.edu/subdue cook@cse.uta.edu http://www-cse.uta.edu/~cook

More Related