1 / 26

Download Estimation for KDD Cup 2003

Download Estimation for KDD Cup 2003. Janez Brank and Jure Leskovec Jo žef Stefan Institute Ljubljana, Slovenia. Task Description. Inputs: Approx. 29000 papers from the “ high energy physics – theory ” area of arxiv.org For each paper: Full text (TeX file, often very messy)

ham
Télécharger la présentation

Download Estimation for KDD Cup 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Download Estimationfor KDD Cup 2003 Janez Brank and Jure Leskovec Jožef Stefan Institute Ljubljana, Slovenia

  2. Task Description • Inputs: • Approx. 29000 papers from the “high energy physics – theory” area of arxiv.org • For each paper: • Full text (TeX file, often very messy) • Metadata in a nice, structured file (authors, title, abstract, journal, subject classes) • The citation graph (excludes citations pointing outside our dataset)

  3. Task Description • Inputs (continued): • For papers from 6 months (the training set, 1566 papers) • The number of times this paper was downloaded during its first two months in the archive • Problem: • For papers from 3 months (the test set, 678 papers), predict the number of downloads in their first two months in the archive • Only the 50 most frequently downloaded papers from each month will be used for evaluation!

  4. Our Approach • Textual documents have traditionally been treated as “bags of words” • The number of occurrences of each word matters, but the order of the words is ignored • Efficiently represented by sparse vectors • We extend this to include other items besides words (“bag of X”) • Most of our work was spent trying various features and adjusting their weight (more on that later) • Use support vector regression to train a linear model, which is then used to predict the download counts on test papers

  5. A Few Initial Observations • Our predictions will be evaluated on 50 most downloaded papers from each month — about 20% of all papers from these months • It’s OK to be horribly wrong on other papers • Thus we should be optimistic, treating every paper as if it was in the top 20% • Maybe we should train the model using only 20% of the most downloaded training papers • Actually, 30% usually works a little better • To evaluate a classifier, we look at 20% of the most downloaded test papers

  6. Cross-Validation Labeled papers (1566) Split into 10 folds 9 folds (approx. 1409) 1 fold (approx. 157) 30% most frequentlydownloaded (approx. 423 papers) 20% most frequentlydownloaded (approx. 31 papers) Train Model Evaluate Lather, rinse, repeat (10 times) Report average

  7. A Few Initial Observations • We are interested in the downloads within 60 days since inclusion in the archive • Most of the downloadsoccur within the first fewdays, perhaps a week • Most are probably comingfrom the “What’s new” page, which contains only: • Author names • Institution name (rarely) • Title • Abstract • Citations probably don’t directly influence downloads in the first 60 days • But they show which papers are good, and the readers perhaps sense this in some other way from the authors / title / abstract

  8. The Rock Bottom • The trivial model: always predictthe average download count(computed on the training data) • Average download count: 384.2 • Average error: 152.5 downloads

  9. Abstract • Abstract: use the text of the abstract and title of the paper in the traditional bag-of-words style • 19912 features • No further feature selection etc. • This part of the vector was normalized to unit length (Euclidean norm = 1) • Average error: 149.4

  10. Author • One attribute for each possible author • Preprocessing to tidy up the original metadata: Y.S. Myung and Gungwon Kang myung-y kang-g • xa = nonzero iff. a is one of the authors of the paper x • This part is normalized to unit length • 5716 features • Average error: 146.4

  11. Address • Intuition: people are more likely to download a paper if the authors are from a reputable institution • Admittedly, the “What’s new” page usually doesn’t mention the institution • Nor is it provided in the metadata,we had to extract it from TeX files (messy!) • Words from the address are represented using the bag-of-words model • But they get their own namespace, separate from the abstract and title words • This part of the vector is also normalizedto unit length • Average error: 154.0 ( worse than useless)

  12. Abstract, Author, Address • We used Author + Abstract (“AA” for short) as the baseline for adding new features

  13. Using the Citation Graph • InDegree, OutDegree • These are quite large in comparison to the text-based features (average indegree = approx. 10) • We must use weighting, otherwise they will appear too important to the learner • InDegree is useful • OutDegree is largely useless (which is reasonable) AA + InDegree

  14. Using the Citation Graph • InLinks = add one feature for each paper i; it will be nonzero in vector x iff. the paper xis referenced by the paper i • Normalize this part of the vector to unit length • OutLinks = the same, nonzero iff. x references i(results on next slide)

  15. InDegree, InLinks, OutLinks

  16. Using the Citation Graph • Use HITS to compute a hub value and an authority value for each paper ( two new features) • Compute PageRank and add this as a new feature • Bad: all links point backwards in time (unlike on the web) — PageRank accumulates in the earlier years • InDegree, Authority, and PageRank are strongly correlated,no improvement over previous results • Hub is strongly correlated with OutDegree, and is just as useless

  17. Journal • The “Journal” field in the metadata indicates that the paper has been (or will be?) published in a journal • Present in about 77% of the papers • Already in standardized form, e.g. “Phys. Lett.” (never “Physics Letters”, “Phys. Letters”, etc.) • There are over 50 journals, but only 4 have more than 100 training papers • Papers from some journals are downloadedmore often than from others: • JHEP 248, J. Phys. 104, global average 194 • Introduce one binary feature for each journal(+ one for “missing”)

  18. Journal

  19. Miscellaneous Statistics • TitleCc, TitleWc: number of characters/words in the title • The most frequently downloaded papers have relatively short titles: The holographic principle (2927 downloads) Twenty Years of Debate with Stephen (1540) Brane New World (1351) A tentative theory of large distance physics (1351) (De)Constructing Dimensions (1343) Lectures on supergravity (1308) A Short Survey of Noncommutative Geometry (1246)

  20. Miscellaneous Statistics • Average error: 119.561 for weight = 0.02 • The model says that the number of downloads decreases by 0.96 for each additional letter in the title :-) • TitleWc is useless

  21. Miscellaneous Statistics • AbstractCc, AbstractWc: number of characters/words in the abstract • Both useless • Number of authors (useless) • Year (actually Year – 2000) • Almost useless (reduces error from 119.56 to 119.28)

  22. Clustering • Each paper was represented by a sparse vector (bag-of-words, using the abstract + title) • Use 2-means to split into two clusters, then split each of them recursively • Stop splitting if one of the two clusters would have < 600 documents • We ended up with 18 clusters • Hard to say if they’re meaningful (ask a physicist?) • Introduce one binary feature for each cluster(useless) • Also a feature (ClusDlAvg) to contain the average no. of downloads over all the training documents from the same cluster • Reduces error from 119.59 to 119.30

  23. Tweaking and Tuning • AA + 0.005 InDegree + 0.5 InLinks + 0.7 OutLinks + 0.3 Journal + 0.02 TitleCc/5 + 0.6 (Year – 2000) + 0.15 ClusDlAvg: 29.544 / 119.072 • The “C” parameter for SVM regression was fixed at 1 so far • C = 0.7, AA + 0.006 InDegree + 0.7 InLinks + 0.85 OutLinks + 0.35 Journal + 0.03 TitleCc/5 + 0.3 ClusDlAvg: 31.805 / 118.944 • This is the one we submitted

  24. A Look Back…

  25. Conclusions • It’s a nasty dataset! • The best model is still disappointingly inaccurate • …and not so much better than the trivial model • Weighting the features is very important • We tried several other features (not mentioned in this presentation) that were of no use • Whatever you do, there’s still so much variance left • SVM learns well enough here, but it can’t generalize well • It isn’t the trivial sort of overfitting that could be removed simply by decreasing the C parameter in SVM’s optimization problem

  26. Further Work • What is it that influences readers’ decisions to download a paper? • We are mostly using things they can see directly: author, title, abstract • But readers are also influenced by their background knowledge: • Is X currently a hot topic within this community? ( Will reading this paper help me with my own research?) • Is Y a well-known author?How likely is the paper to be any good? • It isn’t easy to catch these things,and there is a risk of ovefitting

More Related