240 likes | 352 Vues
This paper by Manish Gupta from Yahoo! HotJobs explores the prediction of click-through rates (CTR) for job listings, utilizing various machine learning models. It analyzes the CTR as the ratio of clicks to views and studies its implications on publisher revenue and ad exchange bidding. Using a dataset from over 40,000 job postings, the research employs models like linear regression and gradient-boosted decision trees to predict CTR based on numerous features including job title characteristics and click history. The results and methodologies discussed aim to refine CTR estimation for enhanced job listing performance.
E N D
Predicting Click Through Rate for Job Listings Manish Gupta Yahoo! HotJobs Jan 22, 2009
CTR and its applications • CTR = Ratio of clicks to get full description of entity to views of a reduced version • Rank results • Impacts publisher revenue in pay for perf models • Bidding in ad exchanges • Trends can help detect click frauds
CTR for new job listings • Avg CTR = 2.29% • MLE would have high variance
Related work • Regelson and Fain • Estimate CTR using topic clusters (job categories) • Richardson et. al. • Describe features for predicting CTR for ads. • Our baseline: avg CTR for a test job (2.29%)
Refined Problem definition • Ideal: Predict CTR(job j, position p, user cluster u, context c) Data sparsity Huge feature vector • Predict CTR(job) Use CTR versus position curve • Predict CTR(job, position)
Data set • Used HotJobs data from Aug 11, 2008 to Aug 31, 2008 to predict CTR of jobs on Sep 1, 2008 • 40K jobs from 7k+ companies • 32K train set and 8K as test set • Jobs have location, company name, category, creation date, posting date, optional position wise click history, job source, title, snippet & job description.
Different models • Weka: Linear Regression and SMOReg • Treenet: Gradient Boosted Decision Trees • Feature selection: • Weka: wrapper with evaluator=linear regression and search=GreedyStepwise • Treenet: Variable importance metrics
Features • Features from Similar Jobs (60) • CTR of jobs with same title/company/state/city+state/category and their cardinalities posted in past one/two weeks or all jobs based on the click history of past one/two/three weeks • Features from Related Jobs (288) • CTR_mn of related jobs with m= |A-B| and n=|B-A| and cardinalities (0 ≤m,n≤ 5) posted in past one/two weeks or all jobs based on the click history of past one/two/three weeks
Features • Job Title Features (11) • #words, #capitalized words, isAllCaps, hasHighPunct, hasLongWords, hasNumbers, vocabulory features • Daily CTR Features for past 3 weeks (21) • Other Features (10) • Job Category, age, location specificity, job source, and job description page features • Other potential features • high-marketing-pitch words, brand value of company, spam feedback, seasonal variations
Experiments and results • Baseline: Predict avg CTR for a test job (2.29%) • Predicting avg - category-wise – CTR (A) • Linear Regression over 390 features (B) – uses only 142 regressors. • GBDT using Treenet over 390 features (C) – uses 300 regressors. (at 256_600_0.01_100)
Important features • Similar Jobs features • Same company, title, city+state using 1 week click history • Others features • Creation date, job description page size, date of update, posting date, job category • Related Jobs features • Related_11, related_12 jobs posted in past 1/3 weeks over 1/3 week click history
Wrapper based feature selection with linear regression and with Treenet’s variable importance (E) -11 features. Pruning the feature set
Linear regression with 369 features (F) – uses 187 regressors. • Treenet uses 282 regressors at 256_600_0.01_20 (G) In absence of click history …
None of the sets alone helps! Analysis of regressor distribution
More features • Dyadic models to predict user-personalized CTR with (job feature vector, user feature vector) dyads. • Auto model updates to correct model drift • We built a machine learning system to predict CTR for job listings and presented our results using various regression metrics. Conclusion and future work