1 / 24

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor frizo.janssens@esat.kuleuven.be. Overview of the presentation. Introduction General context & objectives Clustering Text mining framework

Télécharger la présentation

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor frizo.janssens@esat.kuleuven.be Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren

  2. Overview of the presentation • Introduction • General context & objectives • Clustering • Text mining framework • Bibliometrics, citation analysis • Hybrid (integrated) clustering • Linear combination • Fisher’s inverse chi-square method • Dynamic hybrid mapping of bioinformatics • Conclusions • Further research

  3. General context • Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining • Complementary views on document set → other perceptions of similarity • Textual information: amount of words in common • Citation networks, bibliometric properties • Goal: • Integrate text mining & bibliometrics (hybrid approach) • Better clustering and classification performance • Mapping cognitive structure and dynamics of bioinformatics

  4. 10 women 10 men ? features Hair color Length Hair color Person 1 Person 2 (a) Person 3 ‘objects’ … Person 20 Length Interest in football Length Interested in football Hair color More Discriminative power (?) Person 1 (b) Person 2 Person 3 … Person 20 Length Hair color Distance matrix (e.g. Euclidean) Agglomerative hierarchical clustering Binary tree, (hypothetical) Dendrogram P1 P2 P3 … P20 Interest in football 2 2 clusters P1 0 … 4 P2 0 (c) 1 ‘linkage’ P3 0 3 … 0 Length P20 0 Hair color Agglomerative hierarchical clustering

  5. Doc 2 Doc 3 Doc n Towards Mapping Library and Information Science Frizo Janssensa,*, Jacqueline Letab,c, Wolfgang B-3000 Leuven (Belgium) c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil dHungarian Academy of Sciences, Institute for Research Policy Studies, Nádor u. 18, H-1051 Budapest (Hungary) * Corresponding author: Frizo Janssens, Katholieke Universiteit Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-300 Digital documents … Doc 1 Text extraction .txt .txt .txt .txt … Neglect structure, stop word removal, stemming, phrase detection, … ‘Bags of words’ remain … ‘Indexing’, weighting (e.g., TF-IDF) Term-by-document matrix A Doc 1 Doc 2 Doc 3 ... Doc n Term 1 0.4 0.2 0 ... 0 v o c a b u l a r y Term 2 Term 2 0.1 0.55 0 ... 0 Similarity between documents= cosine of angle between vectors Term 3 0.25 0 0.12 ... 0 Doc 2 Term 4 0 0.16 0.24 ... 0.03 ... ... ... … ... ... Term m 0 0.21 0 ... 0.42 Doc 1 0.1 0.1 Term 1 Indexing in Vector Space Model

  6. Bibliometrics and network analysis • Bibliographic coupling y x

  7. Hybrid (integrated) clustering • Integrate complementary information • Textual content • Citations • Other bibliometric indicators • Intermediate integration • Pairwise distances calculated in separate spaces • Incorporated before clustering

  8. Internal validation: number of clusters? • Dendrogram Text-based distance matrix Dtext documents 0 • Silhouette curves Hierarchical clustering 0 documents Integrated distance matrix Di 0 0 documents 0 • Text-based distances • Distances based on co-citation • or bibliographic coupling • Integrated distances 0 Distance matrix based on bibliometrics Dbibl documents Using 0 0 documents • Silhouette plot 0 0 documents 0 0 • Stability diagram • Weighted linear combination • Fisher’s inverse chi-square method Hybrid clustering: intermediate integration

  9. 700 140 000 Weighted linear combination (linco) • Di = α· Dtext + (1-α) ·DBIBL • Attractive, easy, and scalable • However, neglects differences in distributional characteristics ! • Histograms of mutual distances (<1) based on text (left) and BC (right) • Unequal or unfair contribution of data sources • Implicitly favoring text over bibliometric information or vice versa

  10. Fisher’s inverse chi-square method • ‘Omnibus statistic’ from statistical meta-analysis • Combine p-values from multiple sources • Freed from distributional differences • Avoids overcompensation of either data source

  11. distance matrices p-values documents documents documents a b c d 0 0 Dt e f g h y 0 0 ‘real’ text data documents documents terms i j k l 0 0 m n o p 0 p1 0 y q r s t randomize p-value p1 documents Integrated p-values documents k b l g 1 0 n e r q randomized text data cdf Cumul. share 0 terms h j d t documents documents 0 a s m i 0 0 Di y 0 p f c o dist 1 0 documents 0 documents documents pi 0 11 17 7 15 1 0 Fisher’s omnibus: 19 4 1 12 cdf 0 Cumul. share randomized citation data citations documents 2 18 9 6 pi = -2 ·log(p1λ· p21-λ) 0 8 16 13 14 0 z 0 dist 1 20 5 10 3 randomize p-value p2 documents documents documents 1 2 3 4 0 0 Dbc 5 6 7 8 z 0 0 ‘real’ citation data documents citations 9 10 11 12 documents 0 0 13 14 15 16 0 z p2 0 17 18 19 20 Fisher’s inverse chi-square method

  12. Fisher’s inverse chi-square method • Histogram of pairwise document distances for text and BC • Histogram of p-values for real data w.r.t. randomized datasets

  13. Conclusions from previous research • Text-only >> cited references • SVD greatly ameliorates results, especially for text (LSI) • Best performance: integration ! • Fisher's inverse chi-square • Significantly > text-only, link-only, & concatenation • No significant difference with linco’s when SVD • Generic, incorporate distances with highly dissimilar distributions • Weighted linco: good option if LSI is used • F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.

  14. Dynamic hybrid mapping of bioinformatics Total: 7401

  15. Number of clusters and LSI factors

  16. Number of clusters: stability diagram

  17. Number of clusters: link-based Silhouette values

  18. Dendrogram • 1. RNA structure prediction • 2. Protein structure prediction • 3. Systems biology & molecular networks • 4. Phylogeny & evolution • 5. Genome sequencing & assembly • 6. Gene/promoter/motif prediction • 7. Molecular DBs & annotation platforms • 8. Multiple sequence alignment • 9. Microarray analysis

  19. Dynamics

  20. Dynamic term networks

  21. Conclusions • Main contributions • Hybrid clustering (of bioinformatics) • Clustering and classification significantly improved • Generic: other application domains • Further Research • Fuzzy clustering • Semi-supervised clustering and active learning • Spectral clustering • Other matrix decompositions (e.g., NMF) • Multilinear (tensor) algebra • Mapping the world’s total yearly publication output • Detect emerging and converging clusters & hot topics • Science-technology interaction

  22. ? &

More Related