130 likes | 144 Vues
CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005. A Presentation on Searching the Workplace Web R. Fagin, R. Kumar, K. McCurley, J. Novak, D. Sivakumar, J. Tomlin & D. Williamson WWW2003, Budapest, Hungary by Osama Ahmed Khan 11/03/2005 (It’s my birthday! ). Problem.
E N D
CSE 450 – Web Mining SeminarProfessor Brian D. DavisonFall 2005 A Presentation on Searching the Workplace Web R. Fagin, R. Kumar, K. McCurley, J. Novak, D. Sivakumar, J. Tomlin & D. Williamson WWW2003, Budapest, Hungary by Osama Ahmed Khan 11/03/2005 (It’s my birthday! )
Problem • Intranet Search vs. Internet Search Solution • A case study of IBM’s intranet Definition • Intranet: Corporate network similar and dissimilar to the Internet at the same time
Internet • Democratic: Reflects collective voice of many authors • Interesting Content: Attracting user traffic (Axiom 1) • Targets various ‘Best Answers’ (Axiom 2) • Spam-influenced: Various authorities contributing (Axiom 3) • Search-engine-friendly (Axiom 4)
Intranet • Autocratic: Reflects the view of the entity that it serves • Informative Content (Axiom 1) • Targets a single ‘Right Answer’ (Axiom 2) • Spam-free: Small number of authorities for building content (Axiom 3) • Search engine: Bad idea (Axiom 4)
Two-phase Approach • Identify a variety of ranking functions based on heuristic and experimental analysis of intranet structure • Rank Aggregation Architecture
IBM’s Dataset • Unbiased: May apply to other organizations
System Architecture • Crawler: Stores and produces structured data • Duplicate Elimination: Favorite representative from a group of similar pages • Inverted Indexing: 3 indices (Content, Title, Anchortext) • Global Ranking: 7 static lists (PageRank, Indegree, Discovery date, URL words, URL length, URL depth, Discriminator) • Query Runtime System • Result Markup and Presentation
Rank Aggregation Architecture • Input: Multiple ranked lists from various heuristics • Output: Final ranked list minimizing the total ‘inversions’ with respect to the individual ranked lists • Plug-and-Play: Allows addition and removal of individual heuristics
Conclusion • Intranet and Internet possess different structures • Separating ranking functions helps select a combination of best heuristics