1 / 25

Advanced Techniques for Automatic Web Filtering

Advanced Techniques for Automatic Web Filtering. James Z. Wang PNC Tech. Career Dev. Professor Penn State University Joint Work: Jia Li , Assist. Prof., Penn State Statistics Gio Wiederhold , Prof., Stanford Computer Science http://wang.ist.psu.edu. Outline. The problem

Télécharger la présentation

Advanced Techniques for Automatic Web Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Techniques for Automatic Web Filtering James Z. Wang PNC Tech. Career Dev. Professor Penn State University Joint Work: Jia Li, Assist. Prof., Penn State Statistics Gio Wiederhold, Prof., Stanford Computer Science http://wang.ist.psu.edu J. Z. Wang, Penn State University

  2. Outline • The problem • Related approaches • Filtering based on image content • Goals and methods • The WIPE system • Experimental results • Website classification by image content • Conclusions and future work J. Z. Wang, Penn State University

  3. The Size and Content of the Web • 02/99: ~16 million total web servers • Estimated total number of pages on the web: ~800 million • 15 Terabytes of text (comparable to text of Library of Congress) • Year 2001: 3 to 5 billion pages Lawrence, Giles, Nature, 1999. J. Z. Wang, Penn State University

  4. Outline • The problem • Related approaches • Filtering based on image content • Goals and methods • The WIPE system • Experimental results • Website classification by image content • Conclusions and future work J. Z. Wang, Penn State University

  5. Pornography-free Websites • E.g. Yahoo!Kids, disney.com • Useful in protecting those children too young to know how to use the Web browser • It is difficult to control access to other sites J. Z. Wang, Penn State University

  6. Text-based Filtering • E.g. NetNanny, Cyber Patrol, CyberSitter • Methods: • Store more than 10,000 IPs • Blocking based on keywords • Block all image access • Problems: • Internet is dynamic • Keywords are not enough (e.g. text incorporated in images) • Images are needed for all net users J. Z. Wang, Penn State University

  7. Classification of Web Community • Flake, Lawrence, Giles, ACM KDD, 2000 • Graph clustering based on max flow – min cut analysis of the Web connectedness J. Z. Wang, Penn State University

  8. Outline • The problem • Related approaches • Filtering based on image content • Goals and methods • The WIPE system • Experimental results • Website classification by image content • Conclusions and future work J. Z. Wang, Penn State University

  9. Goals and Methods • The problem comes from images, we deal with images • Goals: use machine learning and image retrieval to classify Web images and Websites • Requirements: high accuracy and high speed • Challenges: non-uniform image background, textual noise in foreground, wide range of image quality, wide range of camera positions, wide range of composition… J. Z. Wang, Penn State University

  10. The WIPE System • Inspired by the UC Berkeley’s FNP System • Detailed analysis of images • Skin filter and human figure grouper • Speed: 6 mins CPU time per image • Accuracy: 52% sensitivity and 96% specificity • Stanford WIPE System • Wavelet-based feature extraction + image classification + integrated region matching + machine leaning • Speed: < 1 second CPU time per image • Accuracy: 96% sensitivity and 91% specificity J. Z. Wang, Penn State University

  11. System Flow Original Web Image Feature Extraction (color, texture, shape) Type Classification photograph Photo Classification Result: REJECT or PASS Training Features J. Z. Wang, Penn State University

  12. Wavelet Principle J. Z. Wang, Penn State University

  13. Type Classification Graphs: Manually-generated images with smooth tones. J. Z. Wang, Penn State University

  14. Type Classification Photographs: Images with continuous tones. J. Z. Wang, Penn State University

  15. Photo Classification Content-based image retrieval + statistical classification J. Z. Wang, Penn State University

  16. Experimental Results • Tested on a set of over 10,000 photographic images • Speed: Less than one second of response time on a Pentium III PC • Accuracy J. Z. Wang, Penn State University

  17. Comment on Accuracy • The algorithm can be adjusted to trade off specificity for higher sensitivity • In a real-world filtering application system, both the sensitivity and the specificity are expected to be higher • Icons and graphs can be classified with almost 100% accuracy  higher specificity • Combine text and image classification  higher sensitivity and higher speed J. Z. Wang, Penn State University

  18. False ClassificationsBenign Images Partially obscured human Areas with similar features Painting, fine-art Partially undressed human Animals (w/o clothes) J. Z. Wang, Penn State University

  19. Frame and text noise Undressed area too small Dark, low contrast False ClassificationsObjectionable Images Partially dressed Dressed but objectionable J. Z. Wang, Penn State University

  20. Website Classification by Image Content • An objectionable site will have many such images • For a given objectionable Website, we denote p as the chance of an image on the Website to be an objectionable image • p is the percentage of objectionable images over all images provided by the site • We assume some distributions of p over all Websites (e.g., Gaussian, shifted Gaussian) • Classification levels could be provided as a service to filtering software producers J. Z. Wang, Penn State University

  21. Flow in Website classification J. Z. Wang, Penn State University

  22. Website Classification • Based on statistical analysis (see paper), we know we can expect higher than 97% accuracy on Website classification if • We download 20-35 images for each site • We classify a Website as objectionable if 20-25% of downloaded images are objectionable • Using text and IP addresses as criteria, the accuracy can be further improved • skip IPs for museums, dog-shows, beach towns, sport events J. Z. Wang, Penn State University

  23. Outline • The problem • Related approaches • Filtering based on image content • Goals and methods • The WIPE system • Experimental results • Website classification by image content • Conclusions and future work J. Z. Wang, Penn State University

  24. Conclusions and Future Work • Perfect filtering is never possible • Effective filtering based on image content is feasible with the current technology • Systems that combine content-based filtering with text-based criteria will have good accuracy and acceptable speed • Objectionable websites are automatically identifiable, a service for the community? • The technology can still be improved through further research. J. Z. Wang, Penn State University

  25. References • http://WWW-DB.Stanford.EDU/IMAGE (papers) • http://wang.ist.psu.edu ... /cgi-bin/zwang/wipe2_show.cgi (demo) • http://www-db.stanford.edu ... /pub/gio/inprogress.html#COPA (testimony) • jwang@ist.psu.edu (James Wang) • gio@cs.stanford.edu (Gio Wiederhold) • michel@db.stanford.edu (Michel Bilello) J. Z. Wang, Penn State University

More Related