350 likes | 472 Vues
This document explores the process of extracting search queries from web access logs to generate valuable metadata for web content. By analyzing the HTTP referrer field, we can identify queries and enhance tagging in the Stanford Infolab framework. Methods include embedding JavaScript to capture search queries, with extensive datasets such as Stanford Query Logs and Delicious tags for comparison. The findings demonstrate that extracted query tags can provide significantly more URLs than conventional tagging methods, offering a new, promising source of information for web content mapping.
E N D
Tagging with Queries: How and Why? Ioannis Antonellis antonell@cs.stanford.edu Hector Garcia-Molina hector@cs.stanford.edu Jawed Karim jawed@cs.stanford.edu
Content on the Web Back Link Text Search queries Page Text Forward Link Text Cnn ObamaCriticsnews Stanford Infolab
How? • Basic observation: http referrer field contains search query Stanford Infolab 3
How? Stanford Infolab
How? • Basic observation: http referrer field contains search query 1) Extract queries from web access log Stanford Infolab 5
Web Access Log a997c1950718d75c03f22ca8715e50b3 [28/Feb/2007:23:45:47 -0800] /group/svsa/cgi-bin/www/officers.php http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts a64344ffd6638d0f6fb2a0284f98b28b [28/Feb/2007:23:45:49 -0800] /group/King/ "http://www.google.com.au/search?hl=en&q=Martin+Luther+King&meta=" 413fa663474b2288c1661882e7e62aea [28/Feb/2007:23:46:02 -0800] /group/pandegroup/folding/results.html "http://www.google.com/search?sourceid=navclient-menuext&ie=UTF-8&q=RESULTS" 3d2edd4dfa7778da92875ee67a319433 [28/Feb/2007:23:46:03 -0800] /group/vpge/sgsi/entrepreneurship/ "http://www.google.com/search?hl=en&q=summer+institute+of+entrepreneurship" ac49793239a6c490023e460fd4863a48 [28/Feb/2007:23:46:06 -0800] / "http://www.google.com/search?sourceid=navclient&hl=ko&ie=UTF-8&rlz=1T4SUNA_ko___KR209&q=stanford" 1c9893680 Stanford Infolab
How? • Basic observation: http referrer field contains search query 1) Extract queries from web access log 2) Embed Javascript code in web pages that capture search queries Stanford Infolab 7
Embeddable code Stanford Infolab 8
How? • Basic observation: http referrer field contains search query 1) Extract queries from web access log 2) Embed Javascript code in web pages and capture search queries • Convince server administrator/page onwer Stanford Infolab 9
Query tags Stanford Infolab 11
Information value of Query Tags WebBase • Datasets: • Stanford Query Logs: 360,000 URLs, 900,000 query tags • Delicious@Stanford: 3,000 URLs, 5,500 tags Stanford Infolab 12
Experiments - Summary • URLs coverage • Query vs Delicious Tags • Query/Delicious Tags vs Pagetext Stanford Infolab
URLs coverage • Query logs provide tags for ~110 times more URLs than delicious • 13% of delicious URLs (380 URLs) only tagged by delicious Stanford Infolab 14
Query Tags • Query logs provide 42 query tags per URL on average Stanford Infolab 15
Delicious Tags • Delicious provides 3 tags per URL on average Stanford Infolab 16
Tags for common URLs • Query logs provide 250 query tags per URL on average for common URLs • Delicious provides 5 tags per URL on average for common URLs Stanford Infolab 17
Query Tags vs Page Text • For every URL, 1 out of 3 query tags are not present in the pagetext Stanford Infolab 18
Delicious Tags vs Page Text • For every URL, 1 out of 2 query tags are not present in the pagetext Stanford Infolab 19
Tags for common URLs • For common URLs, 1 out of 2 query/delicious tags not present in the pagetext Stanford Infolab 20
Conclusions Query tags: Can be extracted in a distributed fashion new promising source of information can provide substantially many, new tags, for a large fraction of the Web Stanford Infolab 21
Thank You! (DEMO) http://tags.stanford.edu Stanford Infolab 22
How? Stanford Infolab 33