1 / 19

Automated Scientific Paper Classification

Automated Scientific Paper Classification. Linlin Jia. Outline. Motivation Related Work Problem Setting Basic Idea. Motivation. Search and organize papers into necessary categories according to different needs Improving the precision of Web searching

cbohman
Télécharger la présentation

Automated Scientific Paper Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Scientific Paper Classification Linlin Jia

  2. Outline • Motivation • Related Work • Problem Setting • Basic Idea

  3. Motivation • Search and organize papers into necessary categories according to different needs • Improving the precision of Web searching • Community Information Management (DBLife / libra / DBRef) • Personal Information Management • Paper-Reviewer dispatch • Any application requiring paper organization or selective and adaptive document dispatching. • Mining topic trend and key factors in research evolution process

  4. Outline • Motivation • Related Work • Problem Setting • Basic Idea

  5. Related Work • 知识工程(Knowledge Engineering)1960s • Machine learning(since 1990s) • Native Bayes 朴素贝叶斯 • K-nearest neighbors k-临近 • Support vector machines 支持向量机 • Maximum entropy 最大熵 • Neural networks 神经网络 • Decision trees 决策树 • Similarity measures • Bag-of-word • Cosine • Okapi • Drawback of content-based methods

  6. F G H I E A B A C B D C C D E F D A B E F Related Work • Measure of the relationship between two documents(web pages/papers) • small1973 • Co-citation • Kessler1963 • bibliographic coupling • Amsler1972 • amsler • DeanH1999 • Companion Algorithm (extend HITS) A and B are related (1) A and B are cited by the same paper, or (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. Paper A and B are associated because they are both cited by C,D,E and F. Citing Papers A and B are related because they cite papers C,D,E and F.

  7. Related Work • Hybrid methods • PMENBM03 • Combining Link-Based and Content-Based Methods using bayesian network • CaladoCMZNG • combining the decisions of linkage and text classifiers using a belief network strategy. • Fusion of Evidence • JoachimsCT2001 • Study linear combination of support vector machine kernel functions representing co-citation and textual information.

  8. Related Work • ZhangGFCFCC2004 • ZhangCFFGCC2005 • non-linear similarity functions through Genetic Programming techniques • VelosoMCGZ2006 • Rule-based combination • Drawback of above methods • Get low precision when data set has low link density • Not multi-label • high level category • Need big testing set

  9. Outline • Motivation • Related Work • Problem Setting • Basic Idea

  10. Problem Setting • Definition • C ={c1,c2,c3,…cn} is a set of predefined categories. • D ={d1,d2,d3,…dm} is a set of scientific papers • Φ: D×C→{T, F} • The meta data of papers are stored in database. • The categories are not just symbolic labels, their meaning is available. • Some exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is assumed to be available.

  11. Outline • Motivation • Related Work • Problem Setting • Basic Idea

  12. Analysis • Shortcomings of existing works • Can not interpret the results • Not use network-based machine learning method • Need a big data set and high link density • Extend the source • Authors with different backgrounds • Cross topics • Multi-label • Topic evolution • Time factor

  13. Basic Idea d1 c1 Ci=<L, Di> L: label Di: a set of papers which are classified in L(known papers of user i and other papers in directories named L d2 c2 d3 c3 d4 c4 d5 User directory in DBRef papers

  14. Basic idea • Step 1 extended content-based method • Extend text content by citeseer to overcome the limitation of small data set. • Step 2 extended link-based method • Add extra links to overcome the limitation of the low density data set • Step 3 combine

  15. Basic Idea C E A B F D

  16. Author Information • Social Network(co-author network) • How to combine social network and citation network? • Method 1 • Compute the dist of P1(A,B,C,D,E) and P2(A,C,B,D,E) • Compute P(ci|dist)

  17. Time Information • MourãoRA2008 • How to express the effect of temporal factor? • Is temporal factor effect the result of link-based method?

  18. Citation Text Information • Citeseer Citation text on papers external to our collection will be add

  19. Location Information • One word at different locations • Experiment: abstract • A word frequently occur, should be deleted • Experiment: keywords/General terms • The main content of paper is exp. • One citation at different locations • Cite A at Introduction/background section • Cite A at experiments section

More Related