350 likes | 467 Vues
Explore temporal sensitivity in web search queries to understand how results change over time. Utilize query correlation and Google Trends to analyze and classify queries. Evaluate peak detection and clustering methods for improved accuracy.
E N D
Implementing Query Classification HYP: End of Semester Update prepared Minh
Previously… • Web search queries: • Understand user goal • Broder (et al 2002): • Queries are classified into 3 categories: • Informational • Navigational • Transactional
Previously… • Functional Faceted Web Query Classification • Ambiguity: Polysemous, General, Specific • Authority Sensitivity: Yes - No • Spatial Sensitivity: Yes - No • Temporal Sensitivity: Yes - No • Query’s 4-Tuple: <Am, Au, S, T> • 3 * 2 * 2 * 2 = 24 different combinations.
Temporal Sensitivity • Definition: • A keyword is temporal sensitive if the results returned by querying it on web search engine tends to change with respect to time. • Example: • Temporal sensitive: Liverpool, Beyonce, Jennifer Hawkins, etc.. • Non-temporal sensitive: video, buying car, etc..
Up-to-date Project Scope • Objective: to analyze the temporal sensitivity facet of web search queries. • Problem: find the temporal correlation between web queries
Web Query Histogram • Periodic queries: • Non-periodic queries: Champions League Final Liverpool
Queries Correlation • Correlation • Observation: 2 keywords are temporally related to each other
Proposed System Framework • Ask Google Trends for query’s histogram • Use histogram digitizer program (Plotparser by WeiHua) to get the numerical data • Query Correlation: • Calculate correlation coefficient between queries • Query classification
Queries Correlation: 1st attempt • Calculate Correlation coefficient: • Using data of 45 months: Jan 2004 until September 2007 • Calculate coefficient based on the entire histograms
Result classification: 1st attempt • Data of 15 different popular keywords, of which: • Periodic keywords: • Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!). • Related keywords: • PS2, Xbox, Jack Nicholson, Beyonce , chocolate, chocolateNews, Liverpool, EA Sport, Konami • All keywords are compare to each other based on correlation coefficient of their histograms. • (15*14)/2 = 105 instances
Result classification: 1st attempt • Classification based on threshold method: • Statistical result: • Threshold value: 0.25
1st attempt Problems: • Very low threshold value • Only one feature used. • Using entire histogram, while some keywords are only temporally related to each other at some periods of time. • Example: Valentine – Chocolate (Correlation appears during February)
Queries Correlation: 2nd attempt • Interesting period: • Period in which two query are highly related to each other • -> Segmentation (Clustering) problem
Clustering Using Simple K means • Algorithm to predict no. of clusters • Use WEKA to cluster the histogram
Query Correlation: 2nd attempt • Periodic keywords detection: • Identify repeated pattern using correlation • Periodic query tends to have highly correlation coefficient on repeated part.
Interesting Periods Projection • Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram
Result Classification: 2nd Attempt • Using previous dataset • Related keywords are compared with each of periodic keywords for correlation • Result: • Manage to increase threshold value to: 0.5
2nd attempt problems • K – means clustering does not guarantee correct interesting periods detection: • Due to the fact that we have to provide no. of cluster for K-means • -> implemented algorithm to determine no. of cluster failed to provide correct value • Small training data set. • Too simple method of threshold detector.
Queries Correlation: 3rd attempt • Need to find another way to identify interesting period. • Peak period: • Period in which there is a high peak in query volume • Peak detection problem: • Mapping and smoothing using convolution
Clustering using peak detection • Mapping:
Clustering using peak detection • Smoothing using convolution:
Clustering using peak detection • Peak Detection: using simple slope-change algorithm to determine peaks and valleys • (with threshold value: mean)
Interesting periods Projections • Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram and vice versa
Result Classification: 3rd attempt • Use large training data: • 47 popular keywords, of which: • 15 periodic keywords and 32 related keywords • Each related keyword is to compared with every periodic keyword to get correlation coefficient (Coef). • Data size: 15 * 32 = 480 instances
Result Classification: 3rd attempt • Apply Naïve Bayes Classifier (WEKA): • 6 features: • Average Coef from related keyword projection (AveRCoef) • Average Coef from periodic keyword projection (AvePCoef) • Overall Average Coef [= (AveRCoef+AvePCoef)/2] • Max Coef from related keyword projection (MaxRCoef) • Max Coef from periodic keyword projection (MaxPCoef) • Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]
Result Classification: 3rd attempt • Statistical Result: • Confusion Matrix
Future attempt: Query Normalization • Search volumes tends to increase as the Internet becomes more popular • Histogram for Top 20 most popular keywords of all time:
Future attempt: Normalization • Histograms need to be normalize to ignore this trend’s effect! • Proposed action: • Subtract time effect • Current Problem: More distortions are added due to scaling problem. • -> histogram from Google have been scaled. We have no information of raw data.
Future attempt: From Periodic to Non-periodic • Find the correlation between two non-periodic queries. • Proposed Problem: some keywords are highly searched after other keywords • Example: “tsunami” is usually searched after “earthquake” is issued.
Future attempt: From Periodic to Non-Periodic Earthquake Tsunami
Potential Applications • Results re-ranking: • Move result that is more up-to-date up on the result list • Example: when user ask for Beyonce during the time of Grammy -> result that related to Grammy will have a higher rank • Server Buffering: • When user query Beyonce, the web page that related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.