100 likes | 121 Vues
Introduction to Dating Competition. COMP621U. 第一届全国大学生数据挖掘邀请赛. http://www.statmodelingcompetition.com March 22, 2011 ~ April 27, 2011 赞助 上海花千树信息科技有限公司 世纪佳缘 http://www.jiayuan.com/ 联合举办 中国科学技术大学管理学院 中国人民大学统计学院 统计之都( COS )网站 目标 是为某个以婚恋为目的的大型交友网站提供 会员推荐 的智能算法,改善会员推荐的精度,增加网站黏度
E N D
Introduction to Dating Competition COMP621U
第一届全国大学生数据挖掘邀请赛 • http://www.statmodelingcompetition.com • March 22, 2011 ~ April 27, 2011 • 赞助 • 上海花千树信息科技有限公司 • 世纪佳缘 http://www.jiayuan.com/ • 联合举办 • 中国科学技术大学管理学院 • 中国人民大学统计学院 • 统计之都(COS)网站 • 目标 • 是为某个以婚恋为目的的大型交友网站提供会员推荐的智能算法,改善会员推荐的精度,增加网站黏度 • 答辩时需提交: 论文、源代码
Workflow User: B User: A Step 1: the system “rec” user B to user A Step 2: user A “click” the photo of user B (or ignore) Step 3: user A “msg” (send a message to) user B (or ignore) • Relevance score • 2: “msg” • 1: “click” • 0: “rec” Impact: make a difference on ones’ whole lives
train.txt 8,599,012 lines 15,000 unique USER_ID_A 55,871 unique USER_ID_B 59,921 unique users (10,950 overlapped) test.txt 3,311,076 lines 10,433 unique USER_ID_A 54,409 unique USER_ID_B 57,352 unique users (7,490 overlapped) How to make use of “ROUND”? -> sequential information/constraint (?) -> only take the highest relevance (?) How to make use of “REC_TIMES” (in the last three months)? “rec”: 8,366,058 (97.29%) “click”: 184,291 ( 2.14%) “msg”: 48,663 ( 0.57%)
ALL USER-B (57,133) 2,724 Comm.: 53,147 1,262 TRAIN-A 15,000 • Pure CF can help those 7,546 TEST-A • User profiles (?) 7,546 TEST-A 2,887
profile_m.txt, profile_f.txt User Profile • Reduce to the problem of “learning to rank” • Extract feature vector from (user A, user B) pair • Extract the relevance score from the action (“msg”, “click”, “rec”) All users have profile information Male # vs. Female # is quite balanced ALL: Male: 344,552 Female: 203,843 We can learn more about the data distribution
Evaluation and Submit • What to submit? • Each line (USER-A): an list of orders of the corresponding USER-Bs • Performance evaluation: NDCG@10 • Average NDCG@10 of 10,433 TEST User-A • If NDCG@10 is comparable, NDCG@20 is also considered • The committee will also consider other issues for real deployment if the NDCG performance is very similar Cumulating Gain Position discount
Discussion • Learning to rank, CF (+content), association rule mining (since lots of features are categorical) • Transductive (semi-supervised learning) • More studies of the data distribution of training and test set is needed (whether there is significant mismatch) • Temporal information/constraints • One very important information is missing: USER-B’s photo • Latent factorization approach may help alleviate this a bit • Is there some information we can crawl from http://www.jiayuan.com/ ? • Shall we incorporate some prior knowledge as constraints (i.e. “门当户对”) ?
Product User User vs. User e.g. Dating competition. e.g. KDDCUP2011, Netflix. (1) Recommend people to people (much higher social impact) (2) The “like-minded” assumption in CF may not hold (4) The content information (e.g. user profile) is definitely very important (3) Proximity: asymmetric vs. symmetric (new recommendation model needed)
Others ------------------------- • Q:港澳台学生可以参加吗? • A:可以,欢迎。 ------------------------- ------------------------- • Q:如何获取建模数据集?我可以把数据集传给别人么? • A:本数据集仅能用于本次竞赛的分析、建模用途,且限于在线注册用户使用。不得用于任何其他商业用途。用于学术研究和论文发表目的的,请与上海花千树信息科技有限公司联系并获取授权。竞赛委员会不具有授权权力。 -------------------------