210 likes | 319 Vues
Quantitative Comparisons of Search Engine Results. Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK ) Journal of the American Society for Information Science and Technology 2008. Abstract. Search engines
E N D
Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton (伍爾弗漢普頓 UK) Journal of the American Society for Information Science and Technology 2008
Abstract • Search engines • To find information or web sites • Webometric • Finding and measuring web based phenomena • Comparing the applications programming interfaces • Google, Yahoo!, Live Search • Webometric application • hit count, number of URLs, number of domains, number of web sites, number of top-level domains
Search Engine and Web Crawlers • Three key operations: • Crawling : identifying, downloading and storing to DB • Results matching: a search engine identifies the pages in its database that match any user query.
Search Engine and Web Crawlers • Results ranking • A search engine will arrange the matching URLs to maximize the probability that a relevant result is in the first or second pages. • Search term • Occur frequency • Number of click
Research Objectives • Are there specific anomalies that make the HCEs of Google, Live Search or Yahoo! unreliable for particular values? • How consistent are Google, Live Search and Yahoo! in the number of URLs returned for a search, and which of them typically returns the most URLs? • How consistent are the search engines in terms of the spread of results (sites, domains and top-level domains) and which search engine gives the widest spread of results for a search?
Data • 1,587 words • Blogs • Word frequency • http://cybermetrics.wlv.ac.uk/paperdata/ • Three engine searchs • Google, Yahoo! and Live Search • 1000 pages • Five webometrics • hit count, number of URLs, number of domains, number of web sites, number of top-level domains
Results - 1 • Hit count estimates Figure 2a,b,c. Hit count estimates of the three search engines compared (logarithmic scales, excluding data with zero values; r=0.80, 0.96, 0.83).
Results - 2 • Number of URLs returned Figure 3a,b,c. URLs returned by the three search engines compared (r=0.71, 0.68, 0.84)
Results - 3 • Number of domains returned Figure 4a,b,c. Domains returned by the three search engines compared (r=0.65, 0.69, 0.83).
Results - 4 • Number of sites returned Figure 5a,b,c. Sites returned by the three search engines compared (r=0.66, 0.69, 0.81)
Results - 5 • Number of TLDs returned Figure 6a,b,c. TLDs returned by the three search engines compared (r=0.74, 0.77, 0.84)
Results - 6 • Comparison within results
Conclusion • Google seems to be the most consistent in terms of the relationship between its HCEs and number of URLs returned. • Yahoo! is recommended if the objective is to get results from the widest variety of web sites, domains or TLDs.
Evaluating Search Engine Effects on Web-based Relatedness Measurement
Snippets • Six manifest records • snippets • hit count • number of URLs • number of domains • number of web sites • number of top-level domains
Dataset • WordSimilarity-353 Test Collection (TC-353) • TC353 Full (353 pairs) • TC353 Testing (153 pairs) • Three famous search engines • Yahoo! • Google • Live Search • Five domains • general web search (web09) • .Com • .Edu • .Net • .Org
The Model • A web-based relatedness WebMetric(X, Y) measures the association of two objects X and Y • where F is a transfer function and d is a dependency score. • The dependency score d reflects a mutual dependency of X and Y on the web. WebMetric(X, Y)= F(d(X,Y))
The Model • Given a search engine G and two objects X and Y • we employ two double-checking functions, fG(Y@X) and fG(X@Y), to estimate the dependence between X and Y WebMetric(X, Y) =
Figure 8. Behaviors of the Gompertz Curve and a Mapping Example
Experiments WebMetric(X, Y) =