Lecture 16: Web Search Engines

Lecture 16: Web Search Engines Oct. 27, 2006 ChengXiang Zhai

Characteristics of Web Information • “Infinite” size (Surface vs. deep Web) • Surface = static HTML pages • Deep = dynamically generated HTML pages (DB) • Semi-structured • Structured = HTML tags, hyperlinks, etc • Unstructured = Text • Different format (pdf, word, ps, …) • Multi-media (Textual, audio, images, …) • High variances in quality (Many junks) • “Universal” coverage (can be about any content)

General Challenges in Web Search • Handling the size of the Web • How to ensure completeness of coverage? • Efficiency issues • Dealing with or tolerating errors and low quality information • Addressing the dynamics of the Web • Some pages may disappear permanently • New pages are constantly created

Tasks in Web Information Management • Information Access • Search (Search engines, e.g. Google) • Navigation (Browsers, e.g. IE) • Filtering (Recommender systems, e.g., Amazon) • Information Organization • Categorization (Web directories, e.g., Yahoo!) • Clustering (Organize search results, e.g., Vivsimo) • Information Discovery (Mining, e.g. ??)

Typology of Web Search Engines Yahoo! Vivisimo Directory Service Meta Engine Directories Engine profiles Web Browser Google Search Engine Web Index Spider/Robot

Examples of Web Search Engines Web Browser Directory Service Meta Engines Autonomous Engines

Basic Search Engine Technologies … Browser Query Host Info. Results Retriever Efficiency!!! Precision Crawler Coverage Freshness ---- ---- … ---- ---- ---- ---- … ---- ---- … Indexer Cached pages Error/spam handling (Inverted) Index User Web

Component I: Crawler/Spider/Robot • Building a “toy crawler” is easy • Start with a set of “seed pages” in a priority queue • Fetch pages from the web • Parse fetched pages for hyperlinks; add them to the queue • Follow the hyperlinks in the queue • A real crawler is much more complicated… • Robustness (server failure, trap, etc.) • Crawling courtesy (server load balance, robot exclusion, etc.) • Handling file types (images, PDF files, etc.) • URL extensions (cgi script, internal references, etc.) • Recognize redundant pages (identical and duplicates) • Discover “hidden” URLs (e.g., truncated) • Crawling strategy is a main research topic (i.e., which page to visit next?)

Major Crawling Strategies • Breadth-First (most common; balance server load) • Parallel crawling • Focused crawling • Targeting at a subset of pages (e.g., all pages about “automobiles” ) • Typically given a query • Incremental/repeated crawling • Can learn from the past experience • Probabilistic models are possible The Major challenge remains to maintain “freshness” and good coverage with minimum resource overhead

Component II: Indexer • Standard IR techniques are the basis • Lexicons • Inverted index (with positions) • General challenges (at a much large-scale!) • Basic indexing decisions (stop words, stemming, numbers, special symbols) • Indexing efficiency (space and time) • Updating • Additional challenges • Recognize spams/junks • How to index anchor text? • How to support “fast summary generation”?

Google’s Solutions URL Queue/List Cached source pages (compressed) Inverted index Use many features, e.g. font, layout,… Hypertext structure

Component III: Retriever • Standard IR models apply but aren’t sufficient • Different information need (home page finding vs. topic-driven) • Documents have additional information (hyperlinks, markups, URL) • Information is often redundant and the quality varies a lot • Server-side feedback is often not feasible • Major extensions • Exploiting links (anchor text, link-based scoring) • Exploiting layout/markups (font, title field, etc.) • Spelling correction • Spam filtering • Redundancy elimination

Effective Web Retrieval Heuristics • High accuracy in home page finding can be achieved by • Matching query with the title • Matching query with the anchor text • Plus URL-based or link-based scoring (e.g. PageRank) • Imposing a conjunctive (“and”) interpretation of the query is often appropriate • Queries are generally very short (all words are necessary) • The size of the Web makes it likely that at least a page would match all the query words

Home/Entry Page Finding Evaluation Results(TREC 2001) MRR %top10 %fail 0.774 88.3 4.8 0.772 87.6 4.8 Unigram Query Likelihood + Link/URL prior i.e., p(Q|D) p(D) [Kraaij et al. SIGIR 2002] Exploiting anchor text, structure or links 0.382 62.1 11.7 0.340 60.7 15.9 Query example: Haas Business School

Named Page Finding Evaluation Results (TREC 2002) Okapi/BM25 + Anchor Text Dirichlet Prior + Title, Anchor Text (Lemur) [Ogilvie & Callan SIGIR 2003] Best content-only Query example: America’s century farms

Many heuristics are proposed for Web retrieval, but is there a model for Web retrieval? How do we extend regular text retrieval models for Web retrieval? Mostly an open question!

“Free text” vs. “Structured text” • So far, we’ve assumed “free text” • Document = word sequence • Query = word sequence • Collection = a set of documents • Minimal structure … • But, we may have structures on text (e.g., title, hyperlinks) • Can we exploit the structures in retrieval? • Sometimes, structures may be more important than the text (moving closer to relational databases … )

Examples of Document Structures • Intra-doc structures (=relations of components) • Natural components: title, author, abstract, sections, references, … • Annotations: named entities, subtopics, markups, … • Inter-doc structures (=relations between documents) • Topic hierarchy • Hyperlinks/citations (hypertext)

Structured Text Collection General question: How do we search such a collection? A general topic Subtopic k Subtopic 1 ...

Challenges in Structured Text Retrieval (STR) • Challenge 1 (Information/Doc Unit): What is an appropriate information unit? • Document may no longer be the most natural unit • Components in a document or a whole Web site may be more appropriate • Challenge 2 (Query): What is an appropriate query language? • Keyword (free text) query is no longer the only choice • Constraints on the structures can be posed (e.g., search in a particular field or matching a URL domain)

Challenges in STR (cont.) • Challenge 3 (Retrieval Model): What is an appropriate model for STR? • Ranking vs. selection • Boolean model may be more powerful with structures (e.g., domain constraints in Web search) • But, ranking may still be desirable • Multiple (potentially conflicting) preferences • Text-based ranking preferences (traditional IR) • Structure-based ranking preferences (e.g., PageRank) • Structure-based constraints (traditional DB) • Text-based constraints (e.g., ) • How to combine them?

Challenges in STR (cont.) • Challenge 4 (Learning): How to learn a user’s information need over time or from examples? • How to support relevance feedback? • What about pseudo feedback? • How to exploit structures in text mining (unsupervised learning) • Challenge 5 (Efficiency): How to index and search a structured text collection efficiently? • Time and space • Extensibility

These challenges are mostly still open questions….We now look at some retrieval models that exploit intra-document and inter-document structures….

Exploiting Intra-document Structures[Ogilvie & Callan 2003] Select Dj and generate a query word using Dj “part selection” prob. Serves as weight for Dj Can be trained using EM Anchor text can be treated as a “part” of a document D Intuitively, we want to combine all the parts, but give more weights to some parts Think about query-likelihood model… Title Abstract Body-Part1 Body-Part2 … D1 D2 D3 Dk

Exploiting Inter-document Structures • Document collection has links (e.g., Web, citations of literature) • Query: text query • Results: ranked list of documents • Challenge: how to exploit links to improve ranking?

Exploiting Inter-Document Links “Extra text”/summary for a doc Authority Hub Description (“anchor text”) Links indicate the utility of a doc What does a link tell us?

PageRank: Capturing Page “Popularity[Page & Brin 98] • Intuitions • Links are like citations in literature • A page that is cited often can be expected to be more useful in general • PageRank is essentially “citation counting”, but improves over simple counting • Consider “indirect citations” (being cited by a highly cited paper counts a lot…) • Smoothing of citations (every page is assumed to have a non-zero citation count) • PageRank can also be interpreted as random surfing (thus capturing popularity)

The PageRank Algorithm (Page et al. 98) Random surfing model: At any page, With prob. , randomly jumping to a page With prob. (1-), randomly picking a link to follow. d1 “Transition matrix” N= # pages d3 d2 d4 Stationary (“stable”) distribution, so we ignore time Iij = 1/N Initial value p(d)=1/N Iterate until converge

PageRank in Practice • Interpretation of the damping factor  (0.15): • Probability of a random jump • Smoothing the transition matrix (avoid zero’s) • Normalization doesn’t affect ranking, leading to some variants • The zero-inlink problem: p(di)’s don’t sum to 1 • One possible solution = page-specific damping factor (=1.0 for a page with no inlink)

HITS: Capturing Authorities & Hubs[Kleinberg 98] • Intuitions • Pages that are widely cited are good authorities • Pages that cite many other pages are good hubs • The key idea of HITS • Good authorities are cited by good hubs • Good hubs point to good authorities • Iterative reinforcement…

The HITS Algorithm [Kleinberg 98] “Adjacency matrix” d1 d3 Initial values: a=h=1 d2 Iterate d4 Normalize:

The problem of how to combine rank-based scores with content-based scores is still unsolved.

Next Generation Search Engines • More specialized/customized • Special group of users (community engines, e.g., Citeseer) • Personalized • Special genre/domain • Learning over time (evolving) • Integration of search, navigation, and recommendation/filtering (full-fledged information management)

What Should You Know • Special characteristics of Web information • Major challenges in Web information management • Basic search engine technology (crawling, indexing, ranking) • How PageRank and HITS work

Lecture 16: Web Search Engines

Lecture 16: Web Search Engines

Presentation Transcript

Web Search Engines

Web Technologies Search Engines

Web Technologies Search Engines

Search Engines and Web Research

Web search engines

Web Search Engines

Web Search Engines

Web and Search Engines

Web Search Engines

Web search engines

Web Search Engines

Web Search Engines

Lecture 16: Web search/Crawling/Link Analysis

Lecture 16: Web search basics

Web Search Engines

Search Engines and Web Advertising

Deep Web Search Engines

Web search engines

Web Search Engines