Text Mining and Bibliometric Indicators: A Pilot Study

Combining Full-text analysis & Bibliometric Indicatorsa pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 • Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) • Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) • Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

Introduction • Goal: mapping of scientific processes • Map of scientific papers • Characterization of emerging clusters • Extraction of new search keys • Using bibliometric as well as lexical indicators of ‘relatedness’ • Full-text analysis

Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

Data source • 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004) •  special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China) • Validation setup • Manual assignment in various classes ..

Data source

Research questions • Comparison text-basedmapping vs. expert classification • Extracted keywords • Comparison with bibliometric mapping

Methodology • Given a set of documents,

Methodology • Given a set of documents, • compute a representation, called index  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>

Methodology • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>

Methodology • Document processing • Remove punctuation & grammatical structure (‘Bag of words’ ) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus,.. ) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..)

T 3 T 2 T 1 Similarity between documents  Salton’s cosine: vocabulary Methodology • Compute index of textual resources:

Results – Term statistics • 19 papers • 3610 withheld terms (including ~400 bigrams) • Distance Matrix (19x19) • Apply MDS • Apply Clustering

Results – MDS

Results – MDS Policy Mathematicalapproaches Webometrics

Cut-off k=4 ? • Optimal parameters ? • ‘Stability-based method’ • Quantified correspondence with expert assignments ? • ‘Rand index’ .. Results – Clustering • Hierarchical clusteringWard method

Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant Results – Peer evaluation Webometrics Mathematicalapproaches Policy

Results – Reference age Histograms per paper

Results – Reference age Histograms aggregated by expert class

Results – Ref Age vs. % Serial Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials

Results – Term extraction • Calculation of seminal keywords for each article • Using TF-IDF weighting scheme • Normalized to norm 1 to accommodate for document length

Results – Full-text vs Abstract • Is a full-text analysis warranted • for term extraction ? • for mapping purposes ?

Less structure • Less overlap with expert classes: Full-text is an interesting sourcefor additional keywords and improved mapping Rand index = 0.6257 p-value = 0.464 ; not significant Results – Full-text vs Abstract

Conclusion • Keyword approach may be naïve • But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues • Complementary to bibliometric approaches • Weak indications towards benefits of using full-text articles • Future: extension of this pilot to larger samples

References • Bibliometrics; homepage Wolfgang Glänzel • http://www.steunpuntoos.be/wg.html • Bibliometrics; homepage Olle Persson • http://www.umu.se/inforsk/Staff/olle.htm • Text & Data mining; PhD thesis Patrick Glenisson • ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf • Optimal k in clustering; Stability method

Text Mining and Bibliometric Indicators: A Pilot Study