1 / 8

HITS Hypertext Induced Topic Selection

HITS Hypertext Induced Topic Selection. Gyozo Gidofalvi Uppsala Database Laboratory. Idea. Given a set of web pages that are all concerned with the same topic we want to find the most interesting pages by examining the internal link structure in the set we want to find

manasa
Télécharger la présentation

HITS Hypertext Induced Topic Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HITSHypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory

  2. Idea • Given a set of web pages • that are all concerned with the same topic • we want to find • the most interesting pages • by examining the internal link structure in the set • we want to find • the pages that are most likely to guide us to an interesting pages Gyozo Gidofalvi

  3. Foundation • Identify Hubs and Authorities • Definition is mutually recursive: • A good hub is pointing to good authorities • A good authority is pointed to by good hubs • The hub value of a site is • the sum of the authority values of the sites that the site is pointing to. • The authority value of a site is • the sum of the hub values of the sites that points to the site. Gyozo Gidofalvi

  4. Pseudo-code • Find a set of pages about a given subject • You may use an existing search engine (such as Google) • In the assignment, you are provided a bunch of pages with links • Preprocess the link structure • Initialize hub and authority vectors • Normalize the vectors to length 1 • Calculate the new authority vector based on the link structure and the hub vector • Calculate the new hub vector based on the link structure and the authority vector • If the new values of the hub and authority vectors are similar enough to the old ones we are done, otherwise repeat from 4 • Sort the vectors and find the top authorities and hubs Gyozo Gidofalvi

  5. Calculating the hub and authority vectors • First we initialize the hub and authority vector to some value. • What initial values are appropriate? • Does it matter what we initialize to? • Next, we calculate the new hub and authority vectors using the formulas • Does it matter which order these calculations happen? • Do we need to normalize the vectors in each iteration? • How do we know when to stop? Gyozo Gidofalvi

  6. Preprocessing • Preprocessing will improve the accuracy o • Several links may point to the same page; • http://www.it.uu.se • http://www.it.uu.se/index.html • www.it.uu.se • Remove site-internal links as this can make a site seem more important than it really is. • Remove links to sites for which we do not know the link structure. Gyozo Gidofalvi

  7. The assignment • You will mine four different link structures for four different queries. • We have done the web crawling and some of the preprocessing for you!  • Input files are on the lab course web page • However, you must • Do some preprocessing yourselves • Directions for pre-processing are on the lab course web page • Validate your implementation • Think of how to verify your solution • Your validation does not have to be fancy • not even automated • At least, implement the test case on the following slide, and see what output it gives you. • Make sure that the test case output is reasonable Gyozo Gidofalvi

  8. a b c d Example (test case) • Rank the pages according to hub and authority value in this link structure: Gyozo Gidofalvi

More Related