1 / 20

Automatic Web Page Categorization by Link and Context Analysis

Automatic Web Page Categorization by Link and Context Analysis. Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani. Introduction. Document retrieval on the Web Search engines – keyword-based searches Classified categories – each category lists Web sites relevant to that category.

anakin
Télécharger la présentation

Automatic Web Page Categorization by Link and Context Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani

  2. Introduction • Document retrieval on the Web • Search engines – keyword-based searches • Classified categories – each category lists Web sites relevant to that category

  3. Introduction • Document d Category c • Requires understanding of both d and c • Has traditionally been accomplished manually • Disadvantages • Growth rate, number of web pages • Highly subjective, lesser quality

  4. Introduction • Automatic classification • Text categorization • Build the representation of a category using a training set of documents pre-categorized under it • Compare representation of a given document d with representation of the category c to decide if d belongs to c • Other approaches • Basic idea – categorization by content

  5. Introduction • Categorization by context • Uses the context surrounding a link • Uses relevance hints that are present in the structure of HTML documents • Advantage • Ability to deal with multimedia material since it analyzes context and not content • Theseus [Teseo]

  6. Improving Web search engines • AltaVista: “refine” capability • Infoseek: grouping of query results, retrieving similar pages • Automatic categorization techniques  better Web retrieval tools, organized material e.g. Lycos, Infoseek (Content Classification Engine - CCE)

  7. Categorization by context • Basic idea • The referring Web page must contain enough hints about the document’s content • These hints are sufficient to classify the document • What are these hints? • Anchor text of a link: <A>…</A> • Page title • Section titles

  8. Architecture • Tasks performed • Spidering • Structure analysis • URL categorization • Weight combination • Catalog update

  9. Spidering and HTML Structure Analysis <html> <head> <title> Yahoo! – Science: Biology </title> </head> <body> ... <ul> <li> <a href=“esg-www.mit.edu:8001/esgbio/”>MIT Biology Hypertextbook</a> - introductory resource including information on chemistry, biochemistry, genetics, cell and molecular biology, and immunology. <li> ...

  10. Spidering and HTML Structure Analysis • The following URL context path is created http://esg-www.mit.edu:8001/esgbio: “MIT Biology Hypertextbook”: “introductory resource including information on chemistry, biochemistry, genetics, cell and molecular biology, and immunology”: “Yahoo! – Science: Biology”

  11. URL Categorization • One URL may have several context paths • Category tree – each node identifies a category • URL categorization finds the most appropriate categories to which the URL should belong • Produces a sequence of weights associated to each node in the category tree • URL: N1=w1, N2=w2, N3=w3, …, Nn=wn • Each weight wi degree of confidence

  12. Weight Combination • Weights from all context paths for a URL are added and normalized • If the weight of a node is greater than a certain threshold, the URL is categorized under that node

  13. Theseus • Theseus is a tool built to verify validity of the method • Components • TreeTagger: a part-of-speech tagger • HTML parser written in Perl • HTML structure analyzer (produces the context tree) written in Java • Experimented using the Arianna catalog

  14. Theseus: Exploiting Noun Phrases • What is noun-phrase analysis? • “a high school female student” • without noun-phrase analysis  “high school” • with noun-phrase analysis  detects that the subject of the phrase is not “high school” • Does it improve the effectiveness of classification? • Lesser number of documents per category • Overall improvement of about 5%

  15. Theseus: Identifying Site Structure, Link Identification • Performs initial breadth-first analysis to a depth of 3 • Repeated links (occurrence of 90% or more) are considered structural links and eventually get discarded • Link identification is performed in the initial phase of site analysis • Ability to recognize CGI references

  16. Theseus: Integration With a Search Engine • Example: Yahoo! • Several benefits • avoid separate spidering of Web documents • provide support for queries within categories – “Search within this category” • Vice-versa • category information can be used to group query results – improved presentation

  17. Theseus: Assessment • Experiment: Categorize a subset of Yahoo! pages • Obtained the same categorization in most cases • Classifies approximately 500 sites per hour • Is more precise • “microbiology journals” instead of “biology journals”

  18. Theseus: Assessment

  19. Open Issues • Building category profiles • By hand • Learning techniques • Possible solution: minimal category profiles, to be extended in the learning phase • Proper ranking of documents in the catalog

  20. Part-of-speech Tagging • The task of POS-tagging is to assign part of speech tags to words reflecting their syntactic category. But often, words can belong to different syntactic categories in different contexts. For instance, the string "books" can have two readings: in the sentence he books tickets the word "books" is a third person singular verb, but in the sentence he reads books it is a plural noun. A POS-tagger should segment a word, determine its possible readings, and assign the right reading given the context.

More Related