1 / 24

Learning to remove Internet advertisements

Learning to remove Internet advertisements. Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland. Presented by Bo Zhang Department of Computer Science Michigan Technological University.

oralee
Télécharger la présentation

Learning to remove Internet advertisements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to remove Internet advertisements Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland Presented by Bo Zhang Department of Computer Science Michigan Technological University

  2. Overview • Background • Introduction of ADEATER • Design of ADEATER • Evaluation • Related Work • Conclusion and Future Work

  3. Advertisement Image Advertisement Image Advertisement Image Background • Negative Impact of advertisement images on Internet • Slow down the speed of browsing • Consume resources of computer • Extra costs for users

  4. Introduction of ADEATER • Definition: - A browsing assistant that automatically removes advertisement images from Internet pages. • Property: • Rules generated from learning algorithm

  5. Introduction of ADEATER • Examples

  6. Design of ADEATER • System Architecture

  7. Design of ADEATER • Encoding instance • Fixed–width feature vector • Images enclosed in anchor tag <A> is a candidate advertisement • Geometric features of an image: -Height <IMG height=90> -Width <IMG width=90> -Aspect ratio (ratio of width to height) • Local feature: -Whether destination URL and image URL are in the same internet domain www.ee.mtu.edu/page.html www.cs.mtu.edu/image.jpg YES www.dell.com/notebook.html www.cs.mtu.edu/image.jpg No

  8. Design of ADEATER • Encoding instance • Fixed–width feature vector • Caption feature: -Words occuring in enclosing <A> tag with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count • Alt Feature -Set of “alternate” words in the <IMG> tag (<IMG alt=“ad”>) with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count

  9. Design of ADEATER • Encoding instance • Fixed–width feature vector • Ubase, Udest, Uimg -Words occuring in base URL, destination URL, image URL with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count • Stop list -Low-information terms (“http”, “www”, ”jpg”, etc.)

  10. Design of ADEATER • Encoding instance • Samplesof HTML page

  11. Design of ADEATER • Encoding of samples

  12. Design of ADEATER • Encoding of samples (cont)

  13. Design of ADEATER • Gathering examples • AD samples are generated by ADGRABBER browsing assistant • Identifier candidate advertisements • Generate vector encoding • NON-AD samples are generated by a custom-built Internet spider • Extract images from randomly-generated URLs.

  14. Design of ADEATER • Learning rules • Algorithm - C4.5 decision tree learning algorithm • Properties - Quick on-line execution of classifier - Not be overly sensitive to missing features or noises - Scale well and insensitive to irrelevant features • Examples of rules -If aspect ratio > 4.5833, alt doesn’t contain “to” but does contain “click+here”,and Udest doesn’t contain “http+www”, then instance is an AD - If Ubase does not contain “messier”, and Udest contains the “redirect+cgi”, then instance is an AD

  15. Design of ADEATER • Removing advertisements • Process - Fetch HTML pages from Internet - Identify candidate advertisements - Classify instances with learned rules - Replace the image’s URL with the URL of an inconspicuous low-bandwidth image • Implementation - Removal module as a proxy server

  16. Evaluation • Speed and accuracy • Experiment setting • Total samples - AD: 459 examples - NON-AD: 2820 examples • 10-fold cross-validation - Training set: 90% examples - Test set: 10% examples • Off-line training phase: 5.8 minutes • On-line classification phase: 70 msec/image • Average accuracy: 97.1%

  17. Evaluation • Learning curves • Simple methodology - Not recalculate feature set • Realistic methodology - Recalculate feature set

  18. Evaluation • Alternative encodings

  19. Related Work • Muffin: Filtering web pages • ImageKill Filter: Hand-crafted rules • ImageKill.minheight - Only remove images which are at least n pixels high • ImageKill.minwidth - Only remove images which are at least n pixels wide • ImageKill.ratio - Remove images which are more than n times as wide as they are high • ImageKill.exclude - Don't remove images that match the given string/regexp

  20. Related Work • WebFilter: Filtering web pages • Solution - User provides a list of URL templates and corresponding filter scripts

  21. Related Work • Junkbuster: Filtering web pages • Solution - User provides a block file

  22. Related Work • Smokey: Detect abusive messages • Solution - Training samples and generate rules by training - Parse messages and generate feature vector - Classify the feature vector with rules generated

  23. Conclusion and Future Work • Conclusion • High accuracy • Modest resource cost (processing time, training samples) • Future Work • Incremental learning algorithm • More efficient feature selection mechanism

  24. Thank you!

More Related