Collaborative Search

Collaborative Search Zheng Zhen

Traditional IR • Web search • Crawlers parallel crawler intelligent crawler • Collaborative Search • References

Traditional IR System User Acquisition documents, objects Problem information need Representation question Representation indexing, ... Database of Indexed documents Query search formulation Matching searching Feedback Retrieved objects

Classic Information Retrieval Homogenous documents Well categorized ‘Small’ well-controlled collection Closed, static environment Controlled collection growth

Web Search • Web: - open, dynamic environment - vast uncontrolled collection of PAGES • Web page: - heterogeneous: various formats, languages … - content may change over time ! • Importance of LINKS • Existing Search Facilities: • Generic: yahoo, askjeeves, google etc. • Specialized: Pluribus,Collaborative Spider

Common operations • Indexing - identifies potential index terms in documents • Query processing - form keywords • Search - access indexed file • Ranking

Ranking • Ranking is important • Factors which influence rank • Term location or frequency • Proximity to query terms • Date of Publication • Length • Popularity • Heuristics: Proper nouns may have higher weights • WWW: Link analysis Popularity (ex. Google)

The Web: indexing • Web pages are heterogenous documents • Contain both text information and meta information • External meta information can be inferred • Must be processed before the pertinence can be established

Indexing WWW documents • Web pages require Preprocessing to get uniform data structure - Normalizes the document stream to a predefined format - Breaks the document stream into desired retrievable units - Isolates and metatags subdocument pieces Web1 page1 Uniform format Web2 page2 preprocessing Web n Page n

Computing weights • Assign weight to each descriptor for document & add to index • Weights are based on: • term frequency within the document (tf) • Global term frequency within the corpus • This will be a problem when using parallel independent agents to do indexing

IR on Web Query Search & match Indexed files Query Processor Page ranking Document Processor Responses Browse Web Crawlers Web pages

Web: Document discovery • Corpus is very large • Dynamic • Open • Documents must be discovered • …. use Web crawler

Web Crawler • What is a Crawler? initinitial urls get next urlscheduled urls Web get pagevisited urls extract urls web pages

Parallel Crawler Advantages: • Faster…. • Imperative for large-scale crawling • Can be run on cheaper machines • Network load dispersion • Network load reduction Crawler1 Crawler2 Downloaded Web pages Web CrawlerN *Parallel Crawlers by Cho, Junghooet al. University of California, WWW2002, Honolulu, Hawaii, USA

Evaluation Metrics • Overlap 1 - (# of unique pages downloaded / # of page downloaded by team of crawler) • Coverage # of pages downloaded by the parallel crawler / Total # of reachable pages • Communication overhead # of exchanged messages / # of page downloads

Assignment of search areas • Partitioning the Web • Address division: .net, .ca , UdeM.ca • Topic • Static assignment ( see next page) • Dynamic assignment (see multi-agent collaborative search)

Partition function Multitude of ways to partition the web • Site-hashing Based on the hash value of the site name of a URL • URL – hashing Based on the hash value of all the URL • Hierarchical partition the web hierarchically based on the URLs of the pages Partitionning will come up again with Agents !

a f Crawling modes (Examples) * Firewall mode, Cross-over mode, Exchange mode Site1 (Crawler1)Site2(Crawler2) *Parallel Crawlers by Cho, Junghooet al. University of California, Los Angeles WWW2002, Honolulu, Hawaii, USA b c g d h i e

Firewall mode:download within partitions Crawler1: ab, ac Crawler2:fg, gh, gi Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e D and E are overlooked !

Cross over mode:download between partitions Crawler1: ab, ac; ag, gh, hd, de, gi Crawler2: fg, gh, gi; hd, de Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e Duplication of work !

Exchange mode:download within partitions, exchange info. Crawler1: ab, ac; then g  Crawler2 Crawler2: fg, gh, gi; then d  Crawler1 Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e Requires communication

Minimizing communication inExchange Mode • Batch communication • Allow replication 1) Because links to pages follows a Zipf distribution (... 20-80 factor) 2) Replicate some popular URLs at each Crawlers Zipf distribution incoming links incoming links page page

Evaluating quality • We want important pages • Quality measure:| Pages  Top_k| / | Top_k| • Pages: downloaded k pages • Topk: top k most important pages* *Indication of importance: backlink count

Comparison[2] From experiments[2]: 1) firewall mode : parallel crawler number < 4 & less quality 2) exchange mode: small network traffic & maximize quality 3) replicating between 10,000 – 100,000 (sic) popular URLs reduces 40% commu. overhead

Intelligent crawling* • Indiscriminate crawlers ( i.e. for Google) • Any new page is good • Topic-oriented crawlers • I.e. Call for tenders • We just want new pages on a topic of interest • Intelligent crawler * Intelligent Crawling on the WWW with Arbitrary Predicates, C. Aggarwal,et al., IBM TJ Watson Res. Ctr., WWW10, Hong-Kong 2001

Focused Crawling • Which node to explore next ? • Depth-first ? Breadth-first ? • Best-first ! But what is best? • Focused crawling is best, how to establish focus ? -- Linkage locality -- Sibling locality topicY X topic X topicY X topicY ... Y Y ? Y Y

Focused Crawling • Objective: given a specific query, find: -- Good sources of content (authorities)... many links TO -- Good sources of links (hubs) ... many links FROM authoritieshubs • Given a arbitrary query, can we auto-focus ? -- learning capability -- learning model

Learning Model • Analyze links from pages on the search periphery • Learning how to pick good links to follow visited web page to visit page hyperlink 1 2 C 3 4

Learning Model • Clues based on - content - URL tokens - linkage info - sibling structure • Different needs require different learning - crawler need learning during the crawl - reuse learning information • The Crawler should be intelligent

Intelligent Crawling • Priority list of URLs to be explored (Plist) • User defined predicate to compute interest of page (= processed query) • KB: knowledge base

Intelligent Crawling • Algorithm Intelligent-Crawler(); • Begin • Priority-List (PList )= {Starting Seeds }; • While not (termination) do • begin • Reorder URLs on PList using KB • Drop unimportant items from PList • W <= pop the first element on PList; • Fetch the Web page W; • Parse W and add all the outlinks in W to PList; • If W satisfies the user-defined predicate, then store W; • Update KB using content and link information for W; • end • End

Intelligent Crawler During the crawling process, we can accumulate some information Like: • number of URLs crawled, N1 • number of URLs crawled which satisfy predicate , N2 • # pages in which word i occurs which satisfy the predicate, N3 • # pages with keyword in URL which satisfy (or not) predicate …. • How to create a KB? A later example will illustrate URL based learning

Intelligent Crawler Example: User is interested in ‘online malls’ BUT only 0.1% web pages contain ‘online malls’ HOWEVER if word ’eshop’ is in URL then prob of page containing ‘online malls’ = 5% Thus we should add to KB fact that ‘ eshop ’ in URL is useful criterion in choosing pages to explore.

Formal view * C: a crawled web page satisfies the given predicate P(C): probability of event C, P(C) = N2 / N1; E: a fact that we know about a candidate URL Knowledge of the event E may increase the probability P(C) thus P(C|E) = P(C  E) / P(E) P(C|E) / P(C) = P(C  E) / (P(C) * P(E)) Calculate the interest ratio for the event C given event E as IR(C,E) IR(C,E) = P(C|E) / P(C) = P(C  E) / (P(C) * P(E)) The value of P(C  E), P(E) can be calculated during the crawling * from: Intelligent Crawling on the WWW with Arbitrary Predicates, C. Aggarwal,et al.,

Mall example Example: • 0.1% web pages contain ‘online malls’ & satisfy ( P(C)) • if word ’eshop’ occur ( E ) then the probability (P(C|E)) of satisfying increase to 5% • So interest ratio = 5% / 0.1% =50 IR(C,E) = P(C|E) / P(C)

Collaborative Search • 3 ways search for information Browsing, querying and filtering • Collaborative type [10] Collaborative browsing Mediated searching Collaborative information filtering Collaborative agents Collaborative reuse of results

Collaborative Search • What do we mean by collaboration ? • Human  computer  Human • Human  Computer • Computer agent  Computer agent

Collaborative Search • Man - machine Collaborative browsing --- Ariadne system[23] Collaborative reuse of results --- Pluribus[21] (2000) Collaborative information filtering --- Collaborative filtering[25] Mediated searching --- DIAMS [22] (2000) • Machine - machine ( … Collaborative agents ) meta-search engines: Meta Crawler, Mamma, Metagopher, Copernic topic-oriented collaborative crawler [11] (2002) Collaborative spider [16] (2002) UbiCrawler[5] (2003) Collaborator [19] (under development)

Existing systems meta-search engines • Meta Crawler, Mamma, Metagopher, Copernic query --------- passes ----- to other search engines collect ------ results -------- from other search engines combine ----- results ------user

Topic-oriented collaborative crawlers[11] (2002) • Each crawler is given a specific topic • It knows the topics of its colleagues • It sends URLs of pages it doesn’t care about to the one responsible for the topic Problems: • static predefined topic categories • static assignment partition function, • controller assign sites to each crawler

Collaborative spiders[16](2002) JATLite (Java Agent Template Lite), uses KQML, User agents + ONE scheduler agent , Collaborator agent (as a mediator) search, content mining, post-retrieval analysis system group user sharing information

UbiCrawler[5](2003) consistent hashing partition function buckets are agents, keys are hosts failure detector --- only synchronous component each agent keeps track of the visited URLs in a hash table pure Java application, RMI based, multi-thread agent

Collaborator[19](under development) a shared workspace framework for virtual teams 3 tier architecture, J2EE+Agent ( BlueJADE ), client tier, middle tier, enterprise information systems tier personal agents, session management agents desktop or wireless device Jade, FIPA

Conclusion Current collaborative search: - collaborative - dynamic - adaptive exploring - intelligent - decentralized Trend Agent

Multi-agent collaborative search Challenges ? agent_1 agent_2 agent_n Query? …. DataStore …. DataStore Web …. DataStore

Challenges Partition dynamic ? - dynamic assigning the web domain to agents Load balancing ? - each cache stores roughly the same # of pages Content look up ? - an agent can easily locate the storage that storing particular content Solution: Web Cache & Consistent Hashing

Web Caching • Content (URL -> content) • For download efficiency • Indexing information (Keyword -> URL) • Search efficiency

Browser caching 1.For efficiency www.abc.com 2. Each client has own cache caches clients

Proxy caches 1.each cache stores a subset of all pages www.abc.com 2. each client knows several caches Domain caches clients

Agent’s web cache communication User User Web agent agent agent Web cache Web cache Web cache

Collaborative Search

Collaborative Search

Presentation Transcript

Collaborative and Collaborative Learning

Collaborative Strategies

Collaborative Design

Collaborative

Collaborative Ecosystems

COLLABORATIVE LEARNING

Collaborative Teams

The 5 W’s of Collaborative Search

Collaborative Learning

Delirium Collaborative

Collaborative

Collaborative Culture?

Yes, we can – Social and Collaborative Search

COLLABORATIVE SEARCH TECHNIQUES

On the Complex Dynamics of Collaborative Tagging and Sponsored Search

Collaborative Ranking Function Training for Web Search Personalization

GoogleBuddy: Towards a Collaborative System for Learning and Sharing Search Knowledge

Collaborative

Collaborative Behaviors

Trilateral Collaborative Metrics Study on International Search Reports PCT MIA

Collaborative Ministry