1 / 28

Internet Search Engines

Internet Search Engines. Julia Vuong Aaron Kurtzhals. Introduction. What is an Internet Search Engine Brief History Importance What an Internet Search Engine does. How it works. Definition. An internet search engine is an automated system used to find information on the world wide web.

rebecca
Télécharger la présentation

Internet Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Search Engines Julia Vuong Aaron Kurtzhals

  2. Introduction • What is an Internet Search Engine • Brief History • Importance • What an Internet Search Engine does. • How it works

  3. Definition • An internet search engine is an automated system used to find information on the world wide web.

  4. Brief History • 1994 – An early search engine, World Wide Web Worm, indexed about 110,000 web pages. • 1997 – Search engines indexed millions of web pages. • Today – Google indexes over 4 billion web pages.

  5. Why are internet search engines important? • The internet contains large amounts of information. • But, this information is not always easy to find. • Where to look for the information? • DNS (_____) names are not very forgiving.

  6. What does an Internet Search Engine do? • Crawls the web • Indexes web pages • Responds to searches

  7. Spiders take a web Page’s content and create key search words that enable online users to find pages they’re looking for.

  8. Robots (spiders, crawlers, bots) • A program that downloads web pages. • Similar to a browser, but generates machine readable information rather than a human readable display. • Purpose is to create an index of the internet.

  9. Indexing the Internet • A search engine creates a database of words on the internet. • Information about each instance of the word is also stored.

  10. Executing Searches • Find webpages that contain the desired words. • Webpages that contain the desired words are ranked and displayed to the user.

  11. Problems with searching the Internet • The size and scope of the internet makes it difficult to search. • A search engine is not very good at understanding context. • Humans are only able to view a small number of search results.

  12. Ranking Search Results • An internet search can generate thousands of results. • A person is only able to read the first few results. • How does the search engine decide the order of the results?

  13. Ranking Algorithms • No ranking • Paid positioning • Content-based ranking • Pagerank

  14. No ranking • Simple • Requires less storage • Fast • Less helpful to humans • Performance advantages are minimal.

  15. Paid Positioning • If a webpage is willing to pay, it must be important • Similar but not identical to paid inclusion • Can create a backlash • Does not address the issue of ranking websites in general

  16. Content-based Ranking • Attempt to determine context • Relative proximity of words • How many times a word appears on a page. • Usage in HTML tags

  17. Words inside HTML tags • Title, Header • Links (anchors), both for the page the link is on, and the page the link points to • Allows search engines to find files not accessible to a crawler • Makes “Google-bombing” possible

  18. Words inside HTML tags • Meta tags • Not intended to be rendered by browsers • Supposed to be “what is this page about”, so it would seem ideal for search engines • The use of meta tags to “fool” search engine is a serious drawback.

  19. Pagerank • Algorithm to determine the “importance” of a website. • Developed by Google co-founders Sergey Brin and Lawrence Page • Based on hypertext links

  20. Pagerank • Essentially, the Pagerank of a webpage A is calculated by the total Pagerank of all webpages that link to A. • The more links a webpage has, the less it contributes to the Pagerank of the link targets.

  21. Pagerank Algorithm • Google’s current implementation of Pagreank is secret and probably different than the orginal. • For more information, see “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Brin and Page and other resources.

  22. Pagerank Algorithm • We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85 ... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

  23. Google • Google uses a combination of Content and Pagerank to rank its search results. • Google is widely regarded as the best search engine. • Demo http://www.google.com/

  24. The Future • The internet will continue to expand. • Research to improve the effectiveness of search engines continues. • Google plans to implement searchable web-based email. • Google’s lunar facility was an April Fool’s joke.

  25. Conclusion • The internet contains billions of webpages. • Search engines allow people to use the internet more effectively. • Tasks performed by an internet search engine • Crawls the web • Indexes web pages • Responds to searches

  26. Conclusion • People can only view a small number of webpages. • The effectiveness of a search engine depends greatly on how it ranks results.

  27. Questions • Any questions or comments? • If you have questions later, ask Google.

  28. References The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm Search Engine Research Papers by James Thornton http://jamesthornton.com/search-engine-research/ When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics (2000), by Krishna Bharat, George A. Mihaila (Google) www.cs.toronto.edu/~georgem/BM01.html... Exploiting the Block Structure of the Web for Computing PageRank(2003), by Sepandar Kamvar, Taher Heveliwala, Chris Manning, and Gene Golub (Stanford University) www.stanford.edu/~sdkamvar/papers/blockrank.pdf Writing a Web Crawler in the Java Programming Language by Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC January 1998 http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/index.html How Internet Search Engines Work by Curt Franklin http://computer.howstuffworks.com/search-engine.htm Checklist for Search Robot Crawling and Indexing (2003) by Avi Rappoport, Search Tools Consulting http://www.searchtools.com/robots/robot-checklist.html WebBase: Arepository of Web pages (2000), by Jun Hirai, Sriram Raghavan, Andreas Paepcke, and Hector Garcia-Molina (Stanfor University) dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1999-26&format=pdf&compressi…

More Related