300 likes | 397 Vues
Explore the integration of full-text analysis and bibliometric indicators in scientific research. Extract new search keys, map scientific papers, and characterize emerging clusters using text-based relational analysis. Compare text-based mapping with expert classification and bibliometric mapping. Discover the methodology for processing documents, extracting terms, and computing textual resource indexes. Analyze results from term statistics, clustering, reference age histograms, and term extraction. Evaluate the benefits of using full-text analysis for keyword extraction and mapping. Consider future extensions of the study for larger samples.
E N D
Combining Full-text analysis & Bibliometric Indicatorsa pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 • Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) • Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) • Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)
Introduction • Goal: mapping of scientific processes • Map of scientific papers • Characterization of emerging clusters • Extraction of new search keys • Using bibliometric as well as lexical indicators of ‘relatedness’ • Full-text analysis
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Data source • 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004) • special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China) • Validation setup • Manual assignment in various classes ..
Research questions • Comparison text-basedmapping vs. expert classification • Extracted keywords • Comparison with bibliometric mapping
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Methodology • Given a set of documents,
Methodology • Given a set of documents, • compute a representation, called index <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>
Methodology • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>
Methodology • Document processing • Remove punctuation & grammatical structure (‘Bag of words’ ) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus,.. ) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..)
T 3 T 2 T 1 Similarity between documents Salton’s cosine: vocabulary Methodology • Compute index of textual resources:
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Results – Term statistics • 19 papers • 3610 withheld terms (including ~400 bigrams) • Distance Matrix (19x19) • Apply MDS • Apply Clustering
Results – MDS Policy Mathematicalapproaches Webometrics
Cut-off k=4 ? • Optimal parameters ? • ‘Stability-based method’ • Quantified correspondence with expert assignments ? • ‘Rand index’ .. Results – Clustering • Hierarchical clusteringWard method
Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant Results – Peer evaluation Webometrics Mathematicalapproaches Policy
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Results – Reference age Histograms per paper
Results – Reference age Histograms aggregated by expert class
Results – Ref Age vs. % Serial Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Results – Term extraction • Calculation of seminal keywords for each article • Using TF-IDF weighting scheme • Normalized to norm 1 to accommodate for document length
Results – Full-text vs Abstract • Is a full-text analysis warranted • for term extraction ? • for mapping purposes ?
Less structure • Less overlap with expert classes: Full-text is an interesting sourcefor additional keywords and improved mapping Rand index = 0.6257 p-value = 0.464 ; not significant Results – Full-text vs Abstract
Conclusion • Keyword approach may be naïve • But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues • Complementary to bibliometric approaches • Weak indications towards benefits of using full-text articles • Future: extension of this pilot to larger samples
References • Bibliometrics; homepage Wolfgang Glänzel • http://www.steunpuntoos.be/wg.html • Bibliometrics; homepage Olle Persson • http://www.umu.se/inforsk/Staff/olle.htm • Text & Data mining; PhD thesis Patrick Glenisson • ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf • Optimal k in clustering; Stability method