Understanding Privacy and Web Robots: Managing Your Online Presence

“Privacy is the claim of individuals, groups or institutions to determine for themselves when, how, and to what extent information about them is communicated to others” • - Alan Westin: Privacy & Freedom,1967 Wasim Rangoonwala ID# 00506259 CS-460 Computer Security

What are www Robots? A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders or Bots.

Web Spiders / Robots Collecting Data

Controlling how search engine access and index your website? Google refers to their spiders as Googlebots and Googlebots-Image Google has a set of computers that continually crawl the web. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.

Controlling how search engine access and index your website? One key Question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results and which pages can be kept Private.. Answer: Robots.txt File

Controlling how search engine access and index your website? Robots.txt has been an industry standard for many years that lets a site owner control how search engines access their web site. The robots.txt file contains a list of the pages that search engines shouldn't access. You can exclude pages from Google's crawler by creating a text file called robots.txt and placing it in the root directory. Making Use of Robots.txt File

Controlling how search engine access and index your website? • Example of pages you want to kept private from search engines • A directory that contains internal logs. • News articles that require payment to access. • Administration area of website. Database configuration string, stored passwords, credit card details. • Images that you want to kept Private. Making Use of Robots.txt File Continue

Achieving Privacy through Robots.txt File # robots.txt File # Currently disallow all images to the Google Image bot User-agent: Googlebot-Image Disallow: / # ALL search engine spiders/crawlers (put at end of file) User-agent: Googlebot Disallow: /admin/ Disallow: /account_password.html Disallow: /address_book.html Disallow: /checkout_payment.html Disallow: /cookie_usage.html Disallow: /login.html Example of Robots.txt File

Privacy through Robots <META> tag • You can use a special HTML <META> tag to tell robots not to index • the content of a page, and/or not scan it for links to follow. • Example • <html> • <head> • <title>...</title> • <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> • </head> • The "NAME" attribute must be "ROBOTS". • Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out. Example of <META> Tag

Search Engine Web Spiders Names • Yahoo! Search-Yahoo Slurp • AltaVista- Scooter • AskJeeves- Ask Jeeves/Teoma • MSN Search- MSNbot • Visit http://www.robotstxt.org/db.html For more details on Search Engine Web Spider Names.

Bonus

Google: Anatomy • Google Crawlers (GoogleBot) • Multiple distributed crawlers • Own DNS cache • 300 connections open at once • Send fetched pages to Store Server • Originally written in Python

PageRank™ AlgorithmHypertext-matchingAnalysis Google: Technology

Google Webmaster Central Webmasters Central offer services: • see which parts of a site Googlebot had problems crawling • upload an XML Sitemap file • analyze and generate robots.txt files • remove URLs already crawled by Googlebot • specify the preferred domain • identify issues with title and description meta tags • understand the top searches used to reach a site • get a glimpse at how Googlebot sees pages • remove unwanted site links that Google may use in results

When surfing the internet, avoid “free” offers and protect your information! Beware of phishing, which are fake e-mails Sent to try to gain your personal and financial information. Chatting – guard your information unless You are 100% Sure who you are chatting with. Protect your privacy on the Web Don’t even open Spam, download a spam buster! Cookies aren’t just for eating, they may be sending your personal information to others. E-mail is not secure and should never be though of as private. Protect your passwords like you would your wallet or car keys. Make it complicate!

http://www.google.com/support/webmasters/bin/answer.py?answer=80553http://www.google.com/support/webmasters/bin/answer.py?answer=80553 • http://www.google.com/bot.html • http://www.googleguide.com • http://www.searchengineposition.com • http://www.google-watch.org • http://www.robotstxt.org/db.html • http://www.googleblog.blogspot.com • For more Details Visit http://techwasim.blogspot.com

Understanding Privacy and Web Robots: Managing Your Online Presence

Understanding Privacy and Web Robots: Managing Your Online Presence

Presentation Transcript

id@

Project ID#: Project ID

id@

id@

Your ID, My ID

Wasim Akram

id

Fake ID | Good Fake ID | Scannable Fake ID | Real Fake ID