Bibliometric Impact Measures Leveraging Topic Analysis

Bibliometric Impact MeasuresLeveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts Amherst

Goal: Measure the impact of papers, and research subfields. Important for: • Researchers understanding their own field. • Libraries deciding which journals to purchase. • Personnel committees deciding on hiring, promotion, awards.

Typical Impact Measures • Citation Count • Garfield’s Journal Impact Factor

Why are topical divisions useful in bibliometrics? Source: Journal Citation Reports (2004) Biochemistry and molecular biology: Citation counts Mathematics Can you compare the tallest building in NY to the tallest building in Stamford, CT?

Why are topical divisions useful in bibliometrics?

Why not use Journalas a proxy for Topic? • Journals not necessarily about one topic. • Topics may not have their own journal. • Open access publishing on the rise • 5% of the 200 most-cited papers in CiteSeer are tech reports! • Spidered web documents often do not include venue information.

This Paper Talk Outline Topical N-Grams a phrase-discovering enhancement to LDA A quick tour of 8 impact measureswith examples An introduction to Rexa, a new sibling of CiteSeerGoogle Scholar, etc. • Discovering fine-grained, interpretable topics from text • 8 impact measures leveraging topicsAnalysis on 1.5 million research papers and their citations. • Where did we get all this data from?

Clustering words into topics withLatent Dirichlet Allocation [Blei, Ng, Jordan 2003] GenerativeProcess: Example: For each document: 70% Iraq war 30% US election Sample a distributionover topics,  For each word in doc Iraq war Sample a topic, z Sample a wordfrom the topic, w “bombing”

Inference and Estimation • Gibbs Sampling: • Easy to implement • Reasonably fast r

Example topicsinduced from a large collection of text JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

Topics Modeling Multi-word Phrases • Topics based only on unigrams sometimes difficult to interpret • Topic discovery itself is confused because important meaning / distinctions carried by phrases.

Topical N-gram Model [Wang, McCallum 2005]   z1 z2 z3 z4 . . . y1 y2 y3 y4 . . . w1 w2 w3 w4 . . . D  2 1  1 2 W W T T

Features of Topical N-Grams model • Easily trained by Gibbs sampling • Can run efficiently on millions of words • Topic-specific phrase discovery • “white house” has special meaning as a phrase in the politics topic, • ... but not in the real estate topic.

A Topic Comparison Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function LDA algorithms algorithm genetic problems efficient

Topic Comparison LDA Topical N-grams (2+) Topical N-grams (1) policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods

Example Results on our Corpus Corpus: Over 1.6 million titles & abstracts from CS papers. Use topic analysis to select a subset of AI: machine learning, NLP, robotics, vision, etc. Also have citation links. Sample Topical N-gram topics Sample LDA topics

Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

Impact Measures Leveraging Topics • Topical Citation count • Topical Impact factor • Topical Diffusion • Topical Diversity • Topical Half-life • Topical Precedence • Topical H-factor • Topical Transfer

Topical Citation Count

Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3. 2004 Impact factors from JCR:

Topical Impact Factor over time

Broad Impact: Diffusion Journal Diffusion:# of journals citing Celldivided by the total number of citations to Cell, over a given time period, times 100 Problem: Relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Diffusion Diversity These are just the least cited topics! Better at capturing broad end of impact spectrum

Broad Impact: Diversity, for papers Topic Diversity: Entropy of the distribution of citing topic

Topical Longevity: Cited Half Life Two views: • Given a paper, what is the median age of citations to that paper? • What is the median age of citations from current literature? Collaborative Filtering is young, fast moving. Maximum Entropy looks further back, but is still producing new work. Neural Networks literature is aging.

Topical Precedence “Early-ness” Within a topic, what are the earliest papers that received more than n citations? Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

Topical Precedence “Early-ness” Within a topic, what are the earliest papers that received more than n citations? Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

H-factor H = maximum number K for which you have K papers, each with at least K citations. ...for journals [Braun et al, 2005]

Topical H-factor Year 1990 16 12 Natural Language Parsing (16) 173 12 Neural Networks (173) 120 12 Speech Recognition (120) 21 11 Hidden Markov Models (21) 71 11 Genetic Algorithms (71) 48 11 Optical Flow (48) 83 10 Reinforcement Learning (83) 49 10 Computer Vision (49) 22 10 Mobile Robots (22) 118 9 Word Sense Disambiguation (118) 160 9 NLP (160) 35 8 Planning (35) 106 8 Markov Chain Monte Carlo (106) 40 8 Maximum Likelihood Estimators (40) 131 8 Genetic Algorithms (131) 61 7 Genetic Programming (61)

Topical H-factor Year 1995 49 18 Computer Vision (49) 120 17 Speech Recognition (120) 146 15 Decision Trees (146) 176 15 Data Mining (176) 21 14 Hidden Markov Models (21) 71 14 Genetic Algorithms (71) 106 13 Markov Chain Monte Carlo (106) 138 13 IR And Queries (138) 118 12 Word Sense Disambiguation (118) 80 12 Web And VR (80) 16 12 Natural Language Parsing (16) 110 12 Bayesian Inference (110) 83 12 Reinforcement Learning (83) 150 12 Logic Programming (150) 22 12 Mobile Robots (22) 160 12 NLP (160)

Topical H-factor Year 2001 129 15 Web Pages (129) 186 15 Ontologies (186) 50 13 SVMs (50) 49 13 Computer Vision (49) 126 13 Gene Expression (126) 176 13 Data Mining (176) 29 12 Dimensionality Reduction (29) 111 12 Question Answering (111) 132 12 Search Engines (132) 16 11 Natural Language Parsing (16) 83 11 Reinforcement Learning (83) 184 11 Web Services (184) 164 11 HCI (164) 21 10 Hidden Markov Models (21) 118 10 Word Sense Disambiguation (118) 138 10 IR And Queries (138)

Topical Transfer Transfer from Digital Libraries to other topics

Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

Extract metadata (title, authors, abstract, venue, citations; 14 fields in total) Convert to text (with layout & format) Topic Analysis & other Data Mining Reference resolution (of papers, authors & grants) Browsable Web Interface Spider Web for PDFs Rexa System Overview NSF grant DB WWW Discriminativelytrainedgraph partitioning (competition-winningaccuracy) Home-grownJava+MySQL (~1m PDF/day) Enhancedps2text (better word stiching,plus layout in XML) ConditionalRandom Fields (99% word accuracy)

IE from Research Papers [McCallum et al ‘99] @article{ kaelbling96reinforcement, author = "Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore", title = "Reinforcement Learning: A Survey", journal = "Journal of Artificial Intelligence Research", volume = "4", pages = "237-285", year = "1996",

where Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence Finite state model Graphical model OTHERPERSONOTHERORGTITLE … output seq y y y y y t+2 t+3 t - 1 t t+1 FSM states . . . observations x x x x x t t +2 +3 t - t +1 t 1 input seq said Jones a Microsoft VP …

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] error 40% (Word-level accuracy is >99%)

Previous Systems

Previous Systems Cites Research Paper

More Entities and Relations Expertise Cites Grant Research Paper Person Venue University Groups

Bibliometric Impact Measures Leveraging Topic Analysis

Bibliometric Impact Measures Leveraging Topic Analysis

Presentation Transcript

A bibliometric analysis of chemoinformatics

Journal Impact Factors and Other Bibliometric Indicators

Bibliometric Analysis Tools for Research Portfolio Analysis and Management

Evaluation of journals based on bibliometric measures

Impact analysis

Identifying Research Strengths through Bibliometric Analysis

Repeated Measures Analysis

Measures of Association and Impact

Leveraging for Impact:

The PBRF and bibliometric measures

Population Impact Measures (PIM )

Impact Analysis

First the basic principles of bibliometric analysis

A bibliometric analysis of chemoinformatics

Bibliometric Analysis of the Economic Crisis Topic Anna Bykova Research Workshop Perm, 2013

TOPIC ANALYSIS

TOPIC ANALYSIS

Application of bibliometric (scientometric) analysis

Topic Analysis

Topic Analysis

Bibliometric Analysis of the Economic Crisis Topic Anna Bykova Research Workshop Perm, 2013

Bibliometric Tools