Learning to remove Internet advertisements
270 likes | 425 Vues
Learning to remove Internet advertisements. Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland. Presented by Bo Zhang Department of Computer Science Michigan Technological University.
Learning to remove Internet advertisements
E N D
Presentation Transcript
Learning to remove Internet advertisements Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland Presented by Bo Zhang Department of Computer Science Michigan Technological University
Overview • Background • Introduction of ADEATER • Design of ADEATER • Evaluation • Related Work • Conclusion and Future Work
Advertisement Image Advertisement Image Advertisement Image Background • Negative Impact of advertisement images on Internet • Slow down the speed of browsing • Consume resources of computer • Extra costs for users
Introduction of ADEATER • Definition: - A browsing assistant that automatically removes advertisement images from Internet pages. • Property: • Rules generated from learning algorithm
Introduction of ADEATER • Examples
Design of ADEATER • System Architecture
Design of ADEATER • Encoding instance • Fixed–width feature vector • Images enclosed in anchor tag <A> is a candidate advertisement • Geometric features of an image: -Height <IMG height=90> -Width <IMG width=90> -Aspect ratio (ratio of width to height) • Local feature: -Whether destination URL and image URL are in the same internet domain www.ee.mtu.edu/page.html www.cs.mtu.edu/image.jpg YES www.dell.com/notebook.html www.cs.mtu.edu/image.jpg No
Design of ADEATER • Encoding instance • Fixed–width feature vector • Caption feature: -Words occuring in enclosing <A> tag with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count • Alt Feature -Set of “alternate” words in the <IMG> tag (<IMG alt=“ad”>) with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count
Design of ADEATER • Encoding instance • Fixed–width feature vector • Ubase, Udest, Uimg -Words occuring in base URL, destination URL, image URL with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count • Stop list -Low-information terms (“http”, “www”, ”jpg”, etc.)
Design of ADEATER • Encoding instance • Samplesof HTML page
Design of ADEATER • Encoding of samples
Design of ADEATER • Encoding of samples (cont)
Design of ADEATER • Gathering examples • AD samples are generated by ADGRABBER browsing assistant • Identifier candidate advertisements • Generate vector encoding • NON-AD samples are generated by a custom-built Internet spider • Extract images from randomly-generated URLs.
Design of ADEATER • Learning rules • Algorithm - C4.5 decision tree learning algorithm • Properties - Quick on-line execution of classifier - Not be overly sensitive to missing features or noises - Scale well and insensitive to irrelevant features • Examples of rules -If aspect ratio > 4.5833, alt doesn’t contain “to” but does contain “click+here”,and Udest doesn’t contain “http+www”, then instance is an AD - If Ubase does not contain “messier”, and Udest contains the “redirect+cgi”, then instance is an AD
Design of ADEATER • Removing advertisements • Process - Fetch HTML pages from Internet - Identify candidate advertisements - Classify instances with learned rules - Replace the image’s URL with the URL of an inconspicuous low-bandwidth image • Implementation - Removal module as a proxy server
Evaluation • Speed and accuracy • Experiment setting • Total samples - AD: 459 examples - NON-AD: 2820 examples • 10-fold cross-validation - Training set: 90% examples - Test set: 10% examples • Off-line training phase: 5.8 minutes • On-line classification phase: 70 msec/image • Average accuracy: 97.1%
Evaluation • Learning curves • Simple methodology - Not recalculate feature set • Realistic methodology - Recalculate feature set
Evaluation • Alternative encodings
Related Work • Muffin: Filtering web pages • ImageKill Filter: Hand-crafted rules • ImageKill.minheight - Only remove images which are at least n pixels high • ImageKill.minwidth - Only remove images which are at least n pixels wide • ImageKill.ratio - Remove images which are more than n times as wide as they are high • ImageKill.exclude - Don't remove images that match the given string/regexp
Related Work • WebFilter: Filtering web pages • Solution - User provides a list of URL templates and corresponding filter scripts
Related Work • Junkbuster: Filtering web pages • Solution - User provides a block file
Related Work • Smokey: Detect abusive messages • Solution - Training samples and generate rules by training - Parse messages and generate feature vector - Classify the feature vector with rules generated
Conclusion and Future Work • Conclusion • High accuracy • Modest resource cost (processing time, training samples) • Future Work • Incremental learning algorithm • More efficient feature selection mechanism