Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure
90 likes | 242 Vues
Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure. -- Martin Klein & Michael L. Nelson Old Dominion University. Looks familiar?. “Moved” but not lost. Reasons for “404” Change in website structure Original webpage relocated in the same website
Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure
E N D
Presentation Transcript
Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure -- Martin Klein & Michael L. Nelson Old Dominion University
“Moved” but not lost • Reasons for “404” • Change in website structure • Original webpage relocated in the same website • Server/domain name issues • Original webpage captured by other websites
Rediscovering Missing Webpages • Search-based solutions • URL • Lexical Signature (LS) • Title • Social bookmarking tags • Link NeighbourhoodLexical Signature (LNLS)
Evaluation • Corpus • 500 random samples from Open Directory Project • “Pretend” to be missing • Search Engines: • Google/Yahoo/MSN • Metric • Percentage of webpages rediscovered from the top-N search results (N=1, 2-10, 11-100)
Results • LS • Majority either rediscovered in top-10 or undiscovered • Yahoo!: 67.6% top-1, 7.5% top-2-10, 22% undiscovered • Title • Similar distribution but with more webpagesrediscovered • Google: 69.3% top-1, 8.1% top-2-10, 19.7% undiscovered • Unquoted better than quoted • Tags and LNLS • Poor performance from both
Results • Combining LS and Title • Better performance than any single method • Yahoo! uniformly outperforms the rest • 76.4% top-1, 7.8% top-2-10, 13.6% undiscovered • Title analysis • Length of 3~6 words most frequent and well-performing • Further improvement by removing stopwords
Research Insights • Common but non-trivial problem • Simple methodology • Detailed, multi-step evaluation