250 likes | 372 Vues
Explore the findings from studies conducted on web searching and retrieval, including user queries analysis, relevance feedback, search techniques, and query classification. Discover insights on term characteristics, session patterns, and user behaviors.
E N D
Amanda Spink : Analysis of Web Searching and Retrieval Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004
Background • Amanda Spink • Self-described areas of work: • Information Retrieval • Web Retrieval • Human Information Behavior / Information Seeking • Medical Informatics • Ph.D. 1993 – Rutgers University • Thesis - Feedback in Information Retrieval • Studied under Tefko Saracevic
Background • Amanda Spink • Over 140 papers published • 5th in journal article production, • 18th in citation production among U.S. IS faculty • Institute for Information Science – most highly cited paper in Web Retrieval: • Real Life, Real Users, Real needs: A Study and Analysis of User Queries on the Web (2000)
Background • Amanda Spink • Associate Professor at University of Pittsburgh • School of Information Sciences • Prior faculty positions • Pennsylvania State University • School of Information Science & Technology • Web Research Group • University of North Texas • School of Library and Information Sciences
Background • Tefko Saracevic • Associate Dean • School of Communication, Information and Library Studies, Rutgers University • Related research • Test and Evaluation of IR systems • Relevance in Information Science • Analysis of web queries
Web Searching and Retrieval • Analyze user queries • Important for building future IR systems on Web • Focus on search terms • Failure analysis in query construction • Term Relevance Feedback (TRF) • Topics / Classification • Use of language
Studies Conducted • U.S. – Excite (www.excite.com) • “51K study” • 51,473 queries • 18,113 users • March 9, 1997 • “1M study” • 1,025,910 queries • 211,063 users • September 16, 1997
Studies Conducted • European - AllTheWeb.com • 1 million queries • 200,000 users • Logs from two days: • February 6, 2001 • May 28, 2002 • Most users from Norway and Germany
Studies Conducted • Issues with Web transaction logs • Where does session start and end? • Temporal boundary – Spink found 15 mins avg, • Others found 5mins, 12mins, 32mins, and 2 hours • Numerical boundary – 100 entries • How to eliminate non-individual users • Meta-search engines, other agents • No user insight into user’s process
Findings • Relevance Feedback • Advanced Search Techniques • Term Characteristics • Query Classification • American vs. European
Findings: Relevance Feedback • Term Relevance Feedback (TRF) rarely used • 51K study • 1,597 queries from 823 users (<5% of queries) • Those using TRF had longer sessions • Successful 60% of time • Implications: • Failure rate of 40% may be too high • IR designers could automatically perform TRF
Findings: Relevance Feedback • Mediated searching • 11% of search terms come from TRF • 37% from users, 63% from mediators • 2/3 of TRF contributed positively
Findings: Relevance Feedback • Identified 6 session states • Initial Query, Modified Query, Next Page, • New Query, Relevance Feedback, Prev Query • Identified 4 session patterns • Using the 6 session states • Implication: IR designers should accommodate these states and patterns
Findings: Relevance Feedback Relevance Feedback Session Patterns
Findings: Advanced Search Techniques • Includes: • Boolean operators • Modifiers +, - • Quotes (phrases) • Not often used by Web users, but used more by mediated search • Boolean <10%, Modifiers 9%, 6% phrases • Used incorrectly • Boolean: AND:50%, OR:28%, AND NOT:19% • Modifiers: 75% of time • Phrases: 8% • Users and advanced techniques do not get along!
Findings: Advanced Search Techniques • Boolean, most common problems: • Not capitalizing AND • Confusing ‘AND’ operator with ‘and’ conjunction • e.g. Science and Technology • Science AND Technology • Modifiers, most common problems: • Prefix rather than mathematical postix • +news +weather rather than news+weather • No space required, as is required with Boolean
Findings: Term Characteristics • Terms per query • 1: 26.6%, 2: 31.5%, 3: 18.2%, >7: 1.8% • Mediated searching: 7-15 terms • Distribution of terms not quite Zipf: • Top terms account for 10% of all terms • Single-use terms account for 9% of all terms • Not understood why this occurs
Findings: Query Classification Classification of queries based on Rutgers’ Web Classification
Findings: Query Classification • What users are looking for is not what is on Web: • Distribution of content: • 83% Commercial, 6% Educational, 3% Health • Example: 10% of searches are for Health • Searchers find classifications understandable • IR system presentation design
Findings: American & European Searching • Commonalities: • Three or fewer terms • American: 80%, European 85% • Predominantly use English terms • Relevance judgments: less than 15 minutes viewing retrieved documents • Information seeking sessions short
Findings: American & European Searching • Differences • Categories • American: Entertainment, Sex, Commerce • European: People-places-things, Computers, Commerce • American searchers spent more time searching e-commerce sites than European counterparts • Did not examine: • Use of advanced techniques • Relevance feedback • First in initial set of studies?
Findings: Summary • Number of query terms is about 2 • TRF is not used often • Boolean operators and modifiers not used often – difficulty in using them correctly • Users do not spend much time making relevancy judgments • Term frequency distribution is a few terms used often, many terms used only once
Findings: Summary • Most users had single query only and did not follow up with successive queries • Average viewing of 2 pages • 50% did not access beyond first page; more than 75% did not go beyond 2 pages
Implications / Further Research • Improve use of advanced search techniques • UI changes, Venn Diagrams • Improve use of relevance feedback • Automatic generation of TRF results • Improve classification of results • UI changes, result overview • Improve understanding of language use • Adapt IR designs to language • Examine cultural differences • TRF, advanced search techniques (same or different)
Amanda Spink - Web Searching and Retrieval • Questions