1 / 10

Introduction to Dating Competition

Introduction to Dating Competition. COMP621U. 第一届全国大学生数据挖掘邀请赛. http://www.statmodelingcompetition.com March 22, 2011 ~ April 27, 2011 赞助 上海花千树信息科技有限公司 世纪佳缘 http://www.jiayuan.com/ 联合举办 中国科学技术大学管理学院 中国人民大学统计学院 统计之都( COS )网站 目标 是为某个以婚恋为目的的大型交友网站提供 会员推荐 的智能算法,改善会员推荐的精度,增加网站黏度

lamp
Télécharger la présentation

Introduction to Dating Competition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Dating Competition COMP621U

  2. 第一届全国大学生数据挖掘邀请赛 • http://www.statmodelingcompetition.com • March 22, 2011 ~ April 27, 2011 • 赞助 • 上海花千树信息科技有限公司 • 世纪佳缘 http://www.jiayuan.com/ • 联合举办 • 中国科学技术大学管理学院 • 中国人民大学统计学院 • 统计之都(COS)网站 • 目标 • 是为某个以婚恋为目的的大型交友网站提供会员推荐的智能算法,改善会员推荐的精度,增加网站黏度 • 答辩时需提交: 论文、源代码

  3. Workflow User: B User: A Step 1: the system “rec” user B to user A Step 2: user A “click” the photo of user B (or ignore) Step 3: user A “msg” (send a message to) user B (or ignore) • Relevance score • 2: “msg” • 1: “click” • 0: “rec” Impact: make a difference on ones’ whole lives

  4. train.txt 8,599,012 lines 15,000 unique USER_ID_A 55,871 unique USER_ID_B 59,921 unique users (10,950 overlapped) test.txt 3,311,076 lines 10,433 unique USER_ID_A 54,409 unique USER_ID_B 57,352 unique users (7,490 overlapped) How to make use of “ROUND”? -> sequential information/constraint (?) -> only take the highest relevance (?) How to make use of “REC_TIMES” (in the last three months)? “rec”: 8,366,058 (97.29%) “click”: 184,291 ( 2.14%) “msg”: 48,663 ( 0.57%)

  5. ALL USER-B (57,133) 2,724 Comm.: 53,147 1,262 TRAIN-A 15,000 • Pure CF can help those 7,546 TEST-A • User profiles (?) 7,546 TEST-A 2,887

  6. profile_m.txt, profile_f.txt User Profile • Reduce to the problem of “learning to rank” • Extract feature vector from (user A, user B) pair • Extract the relevance score from the action (“msg”, “click”, “rec”) All users have profile information Male # vs. Female # is quite balanced ALL: Male: 344,552 Female: 203,843 We can learn more about the data distribution

  7. Evaluation and Submit • What to submit? • Each line (USER-A): an list of orders of the corresponding USER-Bs • Performance evaluation: NDCG@10 • Average NDCG@10 of 10,433 TEST User-A • If NDCG@10 is comparable, NDCG@20 is also considered • The committee will also consider other issues for real deployment if the NDCG performance is very similar Cumulating Gain Position discount

  8. Discussion • Learning to rank, CF (+content), association rule mining (since lots of features are categorical) • Transductive (semi-supervised learning) • More studies of the data distribution of training and test set is needed (whether there is significant mismatch) • Temporal information/constraints • One very important information is missing: USER-B’s photo • Latent factorization approach may help alleviate this a bit • Is there some information we can crawl from http://www.jiayuan.com/ ? • Shall we incorporate some prior knowledge as constraints (i.e. “门当户对”) ?

  9. Product User User vs. User e.g. Dating competition. e.g. KDDCUP2011, Netflix. (1) Recommend people to people (much higher social impact) (2) The “like-minded” assumption in CF may not hold (4) The content information (e.g. user profile) is definitely very important (3) Proximity: asymmetric vs. symmetric (new recommendation model needed)

  10. Others ------------------------- • Q:港澳台学生可以参加吗? • A:可以,欢迎。 ------------------------- ------------------------- • Q:如何获取建模数据集?我可以把数据集传给别人么? • A:本数据集仅能用于本次竞赛的分析、建模用途,且限于在线注册用户使用。不得用于任何其他商业用途。用于学术研究和论文发表目的的,请与上海花千树信息科技有限公司联系并获取授权。竞赛委员会不具有授权权力。 -------------------------

More Related