1 / 81

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges. Google from the Inside Out. Hardware and Database Creation Relevance Ranking and Link Analysis Advanced and “Hidden” Search Features Hands-on Session Pay-for-Placement and Revenue Issues

Roberta
Télécharger la présentation

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges

  2. Google from the Inside Out • Hardware and Database Creation • Relevance Ranking and Link Analysis • Advanced and “Hidden” Search Features • Hands-on Session • Pay-for-Placement and Revenue Issues • Our Google “Wish List” • Other Services to Keep Our Eyes On

  3. Google’s Beginnings • 1996 -- Sergey Brin, Larry Page of Stanford develop “BackRub” –based on analysis of links TO a page from other sites • Sept. 7, 1998 Menlo Park, CA –- Google launches in beta with over 10,000 queries a day • December, 1998 – Listed in PC Magazine’s Top 100 Websites

  4. What’s in a name? • “Google” is a play on “googol”, a term coined by mathematician Milton Sirotta to refer to the number one followed by 100 zeros

  5. Google’s Hardware • Over 10,000 servers in two locations containing “hundreds of copies of the database” • Index of more than 3 billion web documents • Handles thousands of queries on a sub-second basis • Interviews in MP3 format with Chief Operations Engineer Jim Reese • //technetcast.com/tnc_play_stream.html? stream_id=420 (1 hr. 13 min) • //technetcast.com/tnc_play_stream.html? stream_id=421 (15 min.)

  6. Google’s Multi-faceted Database • Indexed html pages • Unindexed html pages • Other file types • Html pages that are re-indexed daily

  7. Multi-faceted Database

  8. What types of pages are unindexed? (25%) • Dead or inaccurate links • Duplicate pages • Database-generated URLs • Pages with robots.txt or noindex meta tags • Pages on an intranet • Pages “waiting” to be indexed fully

  9. How did they get into Google? • Google crawls and downloads links in the documents it encounters • Some of these links are dead, or inaccurate or cannot be crawled for other reasons (intranets, robots.txt) • The URL’s are in the database, but the documents are not

  10. Why does Google leave them in? • They are not COMPLETELY unindexed • Indexed elements include • Words in the URL http://members.home.net/gourdeaud/ • Words in the anchor text on indexed pages that link to the unindexed URL <a href= members.home.net/gourdeaud/ >Gourdeaud’s biography</a> • Can be useful in URL searches or unique term queries and PageRank

  11. How can I distinguish unindexed pages in search results? • No extract • No page size • No cached copy of the page

  12. Adobe Portable Document Format (pdf) Adobe PostScript (ps) Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk Lotus WordPro (lwp) MacWrite (mw) Microsoft Excel (xls) Microsoft PowerPoint (ppt) Microsoft Word (doc) Microsoft Works (wks, wps, wdb) Microsoft Write (wri) Rich Text Format (rtf) Text (ans, txt) Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf

  13. Google Non-html FiletypesWarning! • FOR NON-HTML FILES • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm • NOTE: Titles for non-html files are frequently not descriptive of content

  14. Non-html filetypes in GoogleNotess Study March 6, 2002 – 25 One-Word Searches

  15. “homeland security” filetype:ppt

  16. Deep Web Components:Daily re-indexed pages (.15%) • Over 3 million • Regular html pages that Google has noticed are frequently updated. • Google re-indexes these “every day or so” • Date of Google’s last visit to the page appears in the results listing

  17. Google’s Database • Freshness • Breadth • Depth

  18. Database Freshness • Refreshes its entire web index “on a roughly monthly basis, about every 28 days”. • On-going process • Some segments fresher than others

  19. Notess Study April 6, 2002Pages that are updated daily and report that date

  20. Database Breadth (Size) • About 3 billion documents (indexed and unindexed) • Daily figure on the homepage 3,083,324,652 on March 8, 2003 (Not including Images or Usenet) • FAST (alltheweb.com) claimed 2.1 billion indexed documents , March 8, 2003

  21. Database Depth • Google “typically” downloads the first 110 K of a web document • Download includes URL’s of outgoing links

  22. Database “Blending” • Results from Google’s News vertical engine are included in results for all searches • Blending is increasingly common among search services • News • Shopping • Directory

  23. Relevance Ranking and Link Analysis Google’s “PageRank” Demystified

  24. Relevance Ranking • Processing and presenting retrieved results • Proprietary information • Search Engine Optimization Industry has made it even more so • “How can I make my site rank high in Google?”

  25. What happens when I enter a search at Google? • Check of search syntax and spelling • Query routed to the appropriate server “based on the [database] segment on which the answer is likely to be found”

  26. What happens when I enter a search at Google? • Processing of Visible text • Search term(s) position – title, heading, text • Search term(s) frequency • Search term(s) proximity • Processing of Invisible text • Meta tags • Anchor text (within the <a> tag href) <a href=www.hws.edu >Hobart and William Smith Colleges</a>

  27. What happens when I enter a search at Google? • PageRank link analysis applied • Click popularity (Google Toolbar voting data) • Link context (Proximity of links to your search term(s) within the document) • Final dynamic mix of “about 25 factors”

  28. PageRank Demystified • Patented link analysis program • Part of Google since its beginnings • Objective – To make ranking more of a “human process” • Assigns each page in Google a PageRank score, which is dynamic (changeable) • Weighs heavily in final ranking of results

  29. PageRank’s Multi-layered processing • Layer I • Do others think your site is of value as demonstrated by linking to you? IF SO … • Layer II • Are these “others” in turn linked to by sites recognized through linkage within “web communities”?

  30. PageRank’s Multi-layered processing • A Favorable Ranking Scenario A .com site selling prosthetics linked TO by A local orthopedic association in turn linked TO by A national orthopedic group in turn linked TO by The National Institutes of Health

  31. Visualizing Linkage in Google’s Database with TouchGraph • Browser: http://www.touchgraph.com/TGGoogleBrowser.html • Instructions: http://www.touchgraph.com/TGGB_FullInstructions.html

  32. How Does Google Identify “Web Communities”? • Mutual linkage patterns • Metadata elements and keywords found in common • Human examination/verification of the quality of key sites within the community • Other proprietary factors ???????

  33. PageRank Nitty Gritty • Every page of a site can have a PageRank score, not just the main page • The value of a link from Site B to Site A is decreased with each additional link from Site B to anyother site Rationale: If Site B has only a few links, each one could be more important than if Site B has hundreds of outgoing links

  34. PageRank Nitty Gritty • Requires human adjustment in the case of large subject directories and quality lists of links • PageRank scoring is a dynamic process always in flux • To find a page’s PageRank score, go to the Toolbar and click on the green meter

  35. PageRank Feedback • Site A has NO outgoing links, but is linked TO by Site B • Site A decides to create a single link to Site B • This increases Site B’s PageRank score • Site B’s increased score in turn automatically increases Site A’s score

  36. Sounds easy to manipulate… • Possibilities include • Spam • Link “farms” • Cloaking (sneaky re-directs) • Google is vigilant • If Google detects any manipulation of PageRank, it eliminates the domain from its database and never crawls there again.

  37. PageRank Processing • How does Google know who has linked to Site A, for example? • By searching its database for all sites with links to Site A • No way to do this by examining Site A, as there is no physical change to a document when it is linked TO

  38. Implications of PageRank • PageRank is entirely dependent on linkage data derived from the Google database • Breadth, depth and freshness of the crawl is critical to accurate and current data for PageRank scoring

  39. A Different Perspective on PR:Anti-Google • Daniel Brandt claims • “PageRank discriminates against new web sites” (which may not yet be linked to by other sites). • “Careless custodian of private information” (Google associates each search with a cookie, set to last 36 years) • Maintains googlewatch.org

  40. PageRank –A Summary All links are not created equal • Is this site linked TO by “good” web pages associated with this topic? • EXAMPLE: If a page is linked to by a subject directory (Yahoo, OD, LII) its rank will be higher than another page with many links from personal web pages, link “farms”, etc. • NOTE: Link Analysis (PageRank) is not the same as Link Popularity (number of links)

  41. Searching Google: Touring the Known and the Unknown Please share your discoveries with us!

  42. Command Searching with Google’s Fields (aka Search Operators) • Field Searches that cannot be combined with other search elements: • NOTE: No space allowed between operator and following text • cache: retrieves cached version of the specified URL • link: retrieves pages that have links to the specified URL • related: retrieves pages that are “similar” to the specified URL (same as Similar Pages feature in results listing)

  43. Command Searching with Google’s Fields (aka Search Operators) • Field Searches that cannot be combined with other search elements: • info: retrieves information that Google has about the specified URL • stocks: retrieves stock information about the companies whose ticker symbols follow the stocks: operator stocks:intc (Intel)

  44. Command Searching with Google’s Fields (aka Search Operators) • Field Searches that can be combined with other search elements: • site: restrict results to those from the specified domain site:www.google.com PageRank NOTE: retrieves all pages from www.google.com that contain PageRank anywhere

More Related