1 / 24

CS 430: Information Discovery

This lecture discusses the challenges and strategies of web search engines, economic models, components of a web search service, and considerations for effective information retrieval. It also includes a case study on Google's architecture and performance.

pcleveland
Télécharger la présentation

CS 430: Information Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 430: Information Discovery Lecture 18 Web Search Engines: Google

  2. Course Administration

  3. Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges • Volume of material -- several billion items, growing steadily • Items created dynamically or in databases • Great variety -- length, formats, quality control, purpose, etc. • Inexperience of users -- range of needs • Economic models to pay for the service

  4. Strategies Subject hierarchies • Yahoo! -- use of human indexing Web crawling + automatic indexing • General -- Google, AltaVista, Ask Jeeves, NorthernLight, ... • Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com, ... Mixed models • Human directed web crawling and automatic indexing -- BBC News

  5. Components of Web Search Service Components • Web crawler • Indexing system • Search system Considerations • Economics • Scalability

  6. Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services

  7. Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people.

  8. Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: • Ranking hits in order that fits user's requirements • Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is not the most important factor.

  9. Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text.

  10. Effective Information Retrieval (cont) Full text indexingwith contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information without non-textual materialsand ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.

  11. Google: Ranking 1. Paid advertisers 2. Manually created classification 3. Vector space ranking with corrections for document length 4. Extra weighting for specific fields, e.g., title, anchors, etc. 5. PageRank The balance between 3, 4, and 5 is not made public.

  12. Usability: Display of Results

  13. Usability: Dynamic Abstracts Query:Cornell sports LII: Law about...Sports...sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ...www.law.cornell.edu/topics/sports.html Query: NCAATarkanian LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ...www.law.cornell.edu/topics/sports.html

  14. Limitations of Web Crawling • Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. • Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. • Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but ... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag)

  15. Scalability 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 1994 1997 2000 The growth of the web

  16. Scalability Web search services are centralized systems • Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. • Will this continue? • Possible areas for concern are telecommunications costs, disk access rates.

  17. Case Study: Google • Python with C/C++ • Linux • Module-based architecture • Multi-machine • Multi-thread

  18. Performance • Storage • Scale with the size of the Web • Repository is comparatively small • Good/Fast compression/decompression • System • Crawling, Indexing, Sorting • Last two simultaneously • Searching • Bounded by disk I/O

  19. Image Search: indexing by contextual information only

  20. Google API

  21. Selective searching

  22. Google News

  23. Conclusion • Google: • Scalable search engine • Complete architecture • Many research ideas arise • Always something to improve • High quality search is the dominant factor • precision • presentation of results

More Related