1 / 15

Using Web Structure for Classifying and Describing Web Pages

Using Web Structure for Classifying and Describing Web Pages. Eric J. Glover 1 , Kostas Tsioutsiouliklis 1,2 , Steve Lawrence 1 , David M. Pennock 1 , Gary W. Flake 1 International World Wide Web Conference, 2002 Presented by Zaihan Yang CSE Web Mining. Introduction. Aim

debbie
Télécharger la présentation

Using Web Structure for Classifying and Describing Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Web Structure for Classifying and Describing Web Pages Eric J. Glover1, Kostas Tsioutsiouliklis1,2, Steve Lawrence1, David M. Pennock1, Gary W. Flake1 International World Wide Web Conference, 2002 Presented by Zaihan Yang CSE Web Mining

  2. Introduction • Aim • Classification of web pages • Description of web pages (to name clusters of web pages) • Using Web Structure • Extracting patterns from hyperlinks in the web. • HyperLink • The destination page • Associated anchortext describing link

  3. Typical Text-based classification • To utilize the words (or phrases) of a target document, considering the most significant features. • Not Effective. • E.g. • The home page of General Motors (www.gm.com) does not state that they are a car company. • Full text • Anchortext • Extended-anchortext • A combination

  4. Virtual Document • A virtual document is: • A collection of anchortexts or extended anchortexts from links pointing to the target document. • Anchortext: • The words occurring inside of a link • Extended anchortext: • The set of rendered words occurring up to 25 words before and after an associated link (as well as the anchortext itself).

  5. Main Method • Full-text classifier • Virtual documents classifier • Two Improvement methods • Name a cluster • Main Procedure Datasets Features EFL Ranking Train SVM

  6. Datasets • Positive: a set of web pages downloaded from various Yahoo! Categories. • Negative: Random documents from outside Yahoo! • WebKB dataset • Features: • All words and two or three word phrases • i.e. My favorite game is scrabble. • Possible features: My, my favorite, my favorite game, favorite, favorite game, etc.

  7. Dimensionality reduction • To remove useless features. • Two step process: • First, remove all features that do not occur in a specified percentage of documents. i.e. (|Af|/|A| < T+) and (|Bf|/|B| < T-) • A: the set of positive examples. • B: the set of negative examples. • Af: documents in A that contain feature f. • Bf: documents in B that contain feature f. • T+: threshold for positive features. • T-: threshold for negative features. • Second, rank the remaining features based on expected entropy loss.

  8. Expected Entropy Loss • The prior entropy of the class distribution: • The posterior entropy of the class when the feature is present: • The posterior entropy of the class when the feature is absent: • The expected entropy loss:

  9. Train SVM • A set of data points: {(x1,y1),…, (xN,yN)} • xi is an input and yi is a target output (1 or -1). • Separating hyperplane: • w•φ(xi) + b = 0 • w•φ(xi) + b ≥ 1 if yi = 1 • w•φ(xi) + b ≤ -1 if yi = -1 • w•φ(xi) + b where minimizing • Output: Kernel function:

  10. Improvement-Uncertainty Sampling • The result from an SVM classifier is a real number from -∞ to +∞. • When the output is on the interval (-1,1) it is less certain than if it is on the intervals (-∞,-1) and (1,+∞). • The region (-1,1) is called the “uncertain region”. • Uncertainty sampling • A human judges the documents in the “uncertain region”

  11. Improvement-Combination • To combine results from the extended anchortext based classifier with the less accurate full-text classifier. Negative but uncertain? Result of extended-AT classifier Extended-AT classifier N Web page Y Full-text classifier Positive and |output| > |outputAT|? N negative Y positive

  12. Name the Cluster • Using the top ranked features extracted from the extended anchotexts virtual documents to name a cluster. • Beliefs: • The words near the anchortexts are descriptions of the target documents. • The top ranked features by expected entropy loss are those which occur in many positive examples,and few negative ones.

  13. Results-classifying • Anchortext alone is comparable for classification purpose with the full-text. • Classification accuracy is significantly improved when using the extended anchortext instead of the document full-text. • Combination method is highly effective for improving positive-class accuracy, but reduces negative class accuracy. • Uncertainty sampling required examining only 8% of the documents on average, while providing an average positive class accuracy improvement of almost 10 percentage points.

  14. Result--Clustering • The full-text appears comparable to the extended anchortext. • The anchortext alone appears to do a poor job of describing the category.

  15. Future Work • To include other features on the inbound web pages besides extended anchortext: • To examine the effects of the number of inbound links. • To examine the nature of the category by expanding this to thousands of categories. • To study the effects of the positive set size.

More Related