220 likes | 331 Vues
This tutorial covers network construction, ranking, clustering, topic modeling, and path finding using Matlab. Learn procedures to convert data into networks, perform citation matching, generate adjacent matrices, utilize PageRank algorithm, implement clustering techniques, and use Matlab toolboxes for topic modeling and bioinformatics tasks.
E N D
Informetric methods seminar Tutorial 2: Using Matlab for network construction, ranking, clustering, topic modeling, and path finding Erjia Yan
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
From data to networks • Bibliographical data
Web of Science format • Paper-to-paper citation network is the base • Web of Science cited references format: • First Author, Year Of Publication, Abbreviated Journal Name, Volume Number, Beginning Page Number • AANESTAD M, 2011, J STRATEGIC INF SYST, V20, P161 • All fields can be found in “full record + cited references” downloading option Some of the newer records may also have DOI. For a better match, it is better to remove the DOI from the cited references
Citation matching • For citing papers, extract these fields and format them into Web of Science cited reference format. • Now we have citing papers and cited references that have the same format • Use these two fields, construct an internal citation network that only contains those cited references that are cited by the citing papers in the data set
Procedures • If you can write an app for this, it would be great! • Otherwise, you can follow these instructions • Converting into • Use Access to construct the network • Have a table for citing papers • Import the converted citation pairs to Access • Use query to extract those pairs whose papers are in the table • Now you have the node info and link info • Import both into Matlab
Adjacent matrices • Now we have paper-to-paper citation networks, but in order to construct for instance author-to-author citation or author co-citation networks, we need to use adjacent matrices. Authors a cell number 1 (i,j)=1 indicates paper i is written by author j Papers
Procedures • Convert into • Add to the beginning of the file • Use Txt2Pajek on the linkage file • Import the edge section of the .net file to Matlab • Select M(1:n,n+1:m) where m is the col size. The selection is our author-paper adjacent matrix
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
PageRank • By David Gleich of Purdue University • http://www.mathworks.com/matlabcentral/fileexchange/11613-pagerank • pagerank(M,options) • options.c: the teleportation coefficient [double | {0.85}] • options.v: the personalization vector [vector | {uniform: 1/n}]
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Built-in functions • K-means • IDX = kmeans(X,k) • http://www.mathworks.com/help/stats/kmeans.html • Hierarchical clustering • http://www.mathworks.com/help/stats/hierarchical-clustering.html
Modularity-based clustering • By MIT Strategic Engineering • http://strategic.mit.edu/downloads.php?page=matlab_networks • [modules,module_hist,Q] = newmangirvan(adj,k) • [groups_hist,Q]=newman_comm_fast(adj)
VOSviewer clustering • By Nees van Eck and Ludo Waltman of Leiden University • http://www.vosviewer.com/relatedsoftware/ • A variant of the modularity-based clustering technique • [X, cluster_size, V] = VOS_clustering(A, P)
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Matlab Topic Modeling Toolbox • By Mark Steyvers of University of California Irvine • http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm • Input: The input is a bag of word representation containing the number of times each words occurs in a document.
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Bioinformatics toolbox • http://www.mathworks.com/help/bioinfo/ref/graphshortestpath.html • [dist, path, pred]=graphshortestpath(G,S,T) • from S to T in graph G • [dist] = graphallshortestpaths(G) • find all shortest path in graph G; dist is a distance matrix for the shortest path of each pair of nodes