1 / 9

Web Page Categorization without the Web Page

Web Page Categorization without the Web Page . Author: Min-Yen Kan WWW-2004. Basic Idea. Web Page Categorization ~ Text Categorization Some retrieve the whole document This yields URLs of additional documents Could result in cyclic crawling or non-terminating crawling

orinda
Télécharger la présentation

Web Page Categorization without the Web Page

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Page Categorization without the Web Page Author: Min-Yen Kan WWW-2004

  2. Basic Idea • Web Page Categorization ~ Text Categorization • Some retrieve the whole document • This yields URLs of additional documents • Could result in cyclic crawling or non-terminating crawling • Glean information from intuitive URLs • Avoid the bottleneck

  3. An Example • http://cs.cornell.edu/Info/Courses/Current/CS415/CS415.html • Classify the above webpage into one of the following categories: • Course • Faculty • Project • Student

  4. Approach • 2 phase URL segmentation • First phase • Baseline • scheme://host/path-elements/document.extension • More segmentation like, faculty-info  faculty info • Refined • Break the URL if a transition between uppercase, lowercase and digits is observed

  5. Approach • Second phase • Information content reduction • Examines all possible partitions of the segment • Adds information content (IC) of all such partitions • Pick the one with lowest IC • Title token based finite state transducer • What about acronyms • Non-deterministic weighted finite-state transducer splits and expands segments based on previously seen web page titles

  6. An Example • nytimes New York Times • ФNewYorkTimes • Score of 12 and outputs |n|y|times R1 R5 R5 R1 R5 R5 R5 R1 R3 R3 R3 R4

  7. Experiments • Dataset used: WebKB (4167 pages) • Classified under student, faculty, course and project • Classification used: SVM • Compared with: FOIL-PILFS (based on inductive logic programming) • Evaluation made based on (U)RL {Ub,Ur,Ui,Uf}, (A)nchor text, (T)itle text and page te(X)t

  8. Experiments

  9. Conclusion • URLs contain tokens effective for classification • Its faster • Careful URL segmentation boosts classification • URL segmentation is more powerful than expansion • Can assist source based classification to a limited extent • FST can not expand what it hasn’t seen • Cryptic URLs are hard to tackle

More Related