1 / 45

Search Engines

Search Engines. INLS 200-001, Week 4, Session 7 Instructor: Sanghee Oh . Today. Logistics 2 nd draft of the research prospectus : due to next Monday Topics How do search engines work? How does Google work? Activities Google Advanced Search. How do search engines work?.

teranika
Télécharger la présentation

Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engines INLS 200-001, Week 4, Session 7 Instructor: Sanghee Oh

  2. Today • Logistics • 2nd draft of the research prospectus : due to next Monday • Topics • How do search engines work? • How does Google work? • Activities • Google Advanced Search

  3. How do search engines work? Web Crawler / Spiders Databases & Indexes (Inverted Index) Search Results Ranking

  4. How Search Engines Work • Three main parts: • Gather the contents of all web pages (using a program called a crawler or spider) • Organize the contents of pages in a way that allows efficient retrieval (indexing) • Take in a query, determine which pages match, and show the results (ranking and display of results)

  5. Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index Inverted index Search engine servers

  6. Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index user query Inverted index Search engine servers Show results To user 3. Results Ranking

  7. 1. Web Crawlers / Spiders • Crawlers gather pages/sites • Programs that move from site to site on the web and gather information about the pages found • Start with a list of domain names (homepages), and follow the hyperlink on the homepages. • Keep a list of urls visited and those still to be visited. • At each site, the crawler may be focused on breadth or depth • Breadth – gather top pages and move on to another site • Allows it to find more sites • Depth – gathers all pages at site • Allows it to index more pages in each site • How frequently a site gets crawled varies • From engine to engine • From site to site

  8. Web Crawler do collect… • Mostly html pages • PDF • Word • PPT, etc.

  9. Web crawlers do not collect… • Documents which are told not to collect. • Documents with dead links • Documents with deep Web (The Invisible Web)

  10. Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index user query Inverted index Search engine servers Show results To user 3. Results Ranking

  11. 2. Databases & Indexing • Databases • Input from crawlers, from submissions by authors, from related directories • Cashed pages • Describes pages (indexes) • The size of the database is an important issue • Even the largest does not cover the entire Web

  12. 2. Databases & Indexing • Indexing • Each page that is included in the database is indexed (automatically) • “All” the words on the page (for full-text search) • Stop words • Metatags: title, others • URL • Hypertext anchors and links • Spamming • Load words into metatags • Load invisible words (e.g., white text on white background)

  13. Inverted Index • How to store the words for fast lookup • Basic steps: • Make a “dictionary” of all the words in all of the web pages • For each word, list all the documents it occurs in. • Often omit very common words • “stop words” • Sometimes stem the words • (also called morphological analysis) • cats -> cat • running -> run • In reality, this index is huge. • Need to store the content across many machines • Need to do optimization tricks to make lookup fast

  14. Inverted Index: Example

  15. Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index user query Inverted index Search engine servers Show results To user 3. Results Ranking

  16. 3. Results ranking • Search engine receives a query, then • Looks up the words in the index, retrieves many documents, then • Rank orders the pages and extracts “snippets” or summaries containing query words. • Most web search engines assume the user wants all of the words (Boolean AND, not OR). • These are complex and highly guarded algorithms unique to each search engine.

  17. Some ranking criteria • For a given candidate result page, use: • Number of matching query words in the page • Frequency of terms on the page and in general • Proximity of matching words to one another • Location of terms within the page • Location of terms within tags e.g. <title>, <h1>, link text, body text • Anchor text on pages pointing to this one • Link analysis of which pages point to this one • (Sometimes) Click-through analysis: how often the page is clicked on • How “fresh” is the page

  18. Number of matching query words in the page US Education System Doc 1 Doc 2

  19. Frequency of terms on the page and in general Online Shopping Doc 2 Doc 1

  20. Proximity of matching words to one another Sleep Paralysis Doc 1 Doc 2

  21. Location of terms within the page Transportation Fuel Doc 1 Doc 2

  22. Anchor text on pages pointing to this one tobacco advertising Doc 2 Doc 1

  23. Link analysis of which pages point to this one US Prison Population Doc 1 Doc 2

  24. Search Engine Rankings? • Complex formulae combine different ranking criteria together.

  25. Does Web Searching = Google?How Google Does What it Does

  26. Collecting Results • Googlebot “spider” crawls the billions of pages on the web. • Google spider asks web servers to send web pages and it scans the pages for links, which connects Google to other pages. Each page is assigned a number. • Google builds an index of these numbers.

  27. Presenting Results When someone puts a query into Google… • Google uses the index to find the pages that include the words in the query. • Google ranks the pages in order of relevance.

  28. An Example Query: civil war “civil” is in documents: 3, 8, 22, 56, 68, 92 “war” is in documents: 2, 8, 15, 22, 68, 77 Which documents have both words? The result is called a “posting list”.

  29. Ranking Results Google’sPageRank evaluates: • How many links there are to a web page from other pages. • The quality of the linking sites. “A link from Page A to Page B is like a vote from Page A to Page B.”

  30. Google PageRank Doc 4 Doc 1 Doc 5 Doc 2 Doc 3 Doc 8 Doc 7

  31. PageRank Value • 0 to 10 values • A PageRank value of a Website can be checked in the Google Toolbar

  32. Ranking Results-What else matters? • Proximity of words. • Preference given to pages with your words: • In the order you typed • Close together • In phrases • Word location (like in titles or headings). • Frequency of words. • …And about 97 other factors.

  33. Google Search Tips

  34. Boolean Terms (AND, OR, NOT) crime AND music alcoholism OR binge drinking china NOT dishware

  35. Google Search Tip 1 • AND is the default connector All you terms should be somewhere • In the text of result pages • In pages that link to a result page • In other pages on the same site as the result page You need to force AND using “+” (plus sign) • OR must be capitalized california OR oregon • Use – (minus sign) for NOT china –porcelain

  36. GoogleSearch Tip 2 Phrase searching • Phrase searches use “ ” for the specific phrase rather than separate words “moon landing”

  37. GoogleSearch Tip 3 Field Searching • intitle: Words must occur in the official title of the page. Try: intitle:mileage “hybrid cars” or allintitle:mileage “hybrid cars”

  38. Google Search Tip4 • inurl: Words must occur in the URL. Try: inurl:ils library science or allinurl:ils library science • filetype: Try: filetype:ppt china onechild policy

  39. Google Search Tip5 Synonym searches ~food recipes, nutrition, cooking ~help  guide, tutorial, FAQ, manual Similar pages searches Try: related:www.consumerwebwatch.org

  40. Google Search 6 Find definition of your words Try: define:internet Calculator functions… Try: 2+2 Try: What is the mass of an electron?

  41. GoogleAdvanced Search Find results • all these words  AND • this exact wording or phrase  “ “ • with one ore more of the words  OR • Unwanted words  – (minus sign) • Domain search • Page-Specific Search • Google Advanced Search Exercise

  42. Other Search Engines • Ask.com • Kartoo.com • Others people use?

  43. Next • Due: 2nd draft of research prospectus • Email me by 5:00 pm • Readings for next class • Sullivan, D. (2005). Web Searching Tips. Search Engine Watch. • Search Engine Math, Power Searching for Anyone, and Search Assistance Features. For a summary, Search features chart. Recommended • Ackermann, E., & Hartman, K. (2000). The Information Specialist's Guide to Searching and Researching on the Internet and the World Wide Web. Chicago: Fitzroy Dearborn. [Available from Blackboard] • Chapter 6, Search strategies for search engines, 144-153 only

More Related