110 likes | 248 Vues
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab: http://www.researchchannel.org/prog/displayevent.asp?rid=2459. (c) Wolfgang Hürst, Albert-Ludwigs-University. INFORMATION. INFORMATION NEED. DATA / DOCUMENTS. QUERY. IR vs. Web Search.
E N D
Web Search – Summer Term 2006III. Web Search - Introduction (Cont.)-Jeff Dean, Google's Systems Lab:http://www.researchchannel.org/prog/displayevent.asp?rid=2459 (c) Wolfgang Hürst, Albert-Ludwigs-University
INFORMATION INFORMATION NEED DATA / DOCUMENTS QUERY IR vs. Web Search Initial problem is similar to traditional IR ... The no. of users ishuge. Very huge. The web is huge.Very huge. Big variety in users Big variety in data Users don't cooperate (short queries, ...) Doc. authors don't cooperate (spam,...) .. but basic conditions & characteristics differ significantly
Classic IR vs. Web Search: Documents Hugh amount of data, continuous growth, high rate of change Hugh variability and heterogeneity- Quality, credibility and reputation of the source- Static vs. dynamic docs- Different media types (text, pics, audio, video)- Different formats (HTML, Flash, PDF, ...)- Miscellaneous topics- Continuous text vs. note form / keywords- Different languages, encoding Spam and advertisements Web-specific characteristics- Hypertext, linking- Broken links- Unstructured, not always conform with standards Redundancy (syntactic and semantic) Distributed (need to collect them automatically) Different popularity and access frequency
Classic IR vs. Web Search: Users Different needs and aims, e.g. users might want- to learn s.th. ("informational")- to go to a particular site ("navigational")- to do s.th., e.g. shopping, download, ... ("transactional")- to do other, miscellaneous things, e.g. finding hubs, "exploratory search", ... Different premises, qualifications, languages, ... Different network connection / bandwidths Imprecise, unspecific queriesShort, ambiguous, inexact, incorrect, no usage of operators or special syntax Classic IR vs. Web Search: Bottom line Different characteristics that cause lots of problems But there's also good news: We can take advantage of some of these characteristics (e.g. links, statistics, ...)
References [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 1 (Introduction, general architecture) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 1 (Introduction),Chapter 4.1 (Google Architecture Overview)
General Web Search Engine Architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)
DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION
The Google Search Engine Founded 1998 (1996) by two Stanford students Originally academic / research project that later became a commercial tool Distinguishing features (then!?): - Special (and better) ranking - Speed - Size
SORTERS CRAWLERS BARRELS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS DUMPLEXICON URL RESOLVER LEXICON DOC INDEX LINKS PAGERANK (CF. [2], FIG. 1)
Schedule Web Search: - Introduction - Crawling - Page Repository - Indexing - Ranking (PageRank, HITS) - Exercises for web search basics - Advanced / additional web search topics In parallel: - Programming project (Lucene)
References [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 1 (Introduction, general architecture) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 1 (Introduction),Chapter 4.1 (Google architecture overview)