DBconnect: Mining Research Community on DBLP Data

DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07 報告人:吳建良

Outline • Community • Motivation • Understand research community – recommend collaborations • Proposed Apporach • Rank the relevance with a random walk approach • DBconnect • A navigational system to investigate community relations • Conclusion

What is community? • In Graph Theory: • Densely connected groups of vertices, with sparser connection between groups • In Social Network Analysis: • Groups of entities that share similar properties or connect to each other via certain relations

Why is community important? • Interesting data with community structure: • Researcher collaboration, friendship network, WWW, Massive Multi-player on-line gaming, electronic communications… • Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.

Motivation • Understand the research network between authors, conferences and topics (rank entities by relevance for given entities) • Find and recommend research collaborators for given authors • Explore the academic social network

Proposed Approach • Build bipartite graph in the author-conference space • Limitation of traditional bipartite graph model • Extend the bipartite model to include co-authorship information • Further extend the model to tripartite to include topic information • Use random walk with restart on such models

An example • Author Publication Records in Conferences • a, b, c, d, e are authors • ac(3) means that author a and c published three papers together in • KDD(y) conference

Bipartite model for conference-author social network Weight(edge)=publishing frequency of author in a certain conference Limitation: Fail to represent any co- co-authorships To capture the co-author relations: Add a link between a and c  miss the role of KDD Make the link connecting a and c to KDD  make the random walk infeasible Add additional nodes to represent each co-author relation  impractical, a huge number of such relations

Extend the bipartite model to include co-authorship information • Add a virtual level of nodes to replace the conference partition, and add direction to the edges • A nodes then connect to their own split relation nodes with the original weight • C’ nodes to all author nodes • If the A node and C’ node have a co-author relation  edge weight: co-author frequency * a parameter f • Otherwise, the edge is weighted as original • Set f=k (k is the total author number of a conference) 3 3f 3 3f 7 7 7 3 7 7 7 7

Further extend the model to tripartite to include topic information • Research topic is an important component to differentiate any research community • Authors that attend the same conferences might work on various topics

Adding topic information • Very few conference proceedings have their table of contents included in DBLP • Table of contents include session titles • Extract relevant topics from DBLP • Use paper title, and find frequent co-locations in title text • Method • Manually select a list of stopwords to remove frequently used but non-topic-related words • Ex: Towards, Understanding, Approach, …

Adding topic information (cond.) • Count frequency of every co-located pairs of stemmed words • Select the top 1000 most frequent bi-grams as topics • Manually add several tri-grams • Ex: World Wide Web, Support Vector Machine, …

Random walk on DBLP social network • Problem to be solving: • Given an author node a A , compute a relevance score for each author b A • Simple example: conference-author network G Relational matrix M3×5

Random walk on DBLP social network (cond.) • Normalize M such that every column sum up to 1: Q(M) = col_norm(M), Q(MT) = col_norm(MT) • Construct the adjacency matrix J of G after normalization

Random walk on DBLP social network (cond.) • Normalized adjacency matrix J of G Q(M) Q(MT )

Random walk on DBLP social network (cond.) • A random walk on this graph moves from one node to one of its neighbors based on the probability • Probability: proportional to the weight of the edge over the sum of weights of all edges that connect to this node • EX: if we start from node SIGMOD, then build u as the start vector • u is a one-column vector, consisting of (3+7) elements • The value of element corresponding to SIGMOD is set to 1

Random walk on DBLP social network (cond.) • u=Ju • After step1 of the first iteration, the random walk hits the author nodes with b=1×0.44, d=1×0.33, e=1×0.22 • After step2 of the first iteration, the chance that the random walk goes back to SIGMOD is 0.44×0.8+0.33 ×1+0.22 ×0.22 = 0.73, and the other 0.27 goes to the other two conference nodes

Random walk on DBLP social network (cond.) • After a few iterations, the vector will converge and gives a stable score to every node • However, these scores are always the same no matter where the walk begins • Solved by random walk with restart • Given a restarting probability c • Use another vector v, and the value of element corresponding to SIGMOD is set to 1 • In each random walk iteration, the walker goes back to the start node with a restart probability u=(1-c)u + cv

Random walk on DBLP social network (cond.) • Random walk with restart algorithm(1) Input: node α A, a bipartite graph model G, restarting probability c, converge threshold ε. Output: relevance score vector B for author nodes. 1. Compute the adjacency matrices J(n+m) ×(n+m) of G. /* n conferences and m authors */ 2. Initialize vα = 0, set element for α to 1: vα(α) = 1. 3. While (△uα > ε ) uα = Juα uα = (1 − c) uα + cvα 4. Set vector B = uα(n+1:n+m). 5. Return B.

Random walk on DBLP social network (cond.) • Extend the bipartite model into a directed bipartite graph G'=(C',A,E') • A has m author nodes, and C has n conference nodes • C' is generated based on C and has n*m nodes • Assume every node in C is split into m nodes • First generate a matrix M(n*m)×m for directional edges from C' to A • Then form a matrix Nm×(n*m) for edges from A to C'

Random walk on DBLP social network (cond.) • The adjacency matrix J of G‘ • Algorithm(2): The random walk with restart algorithm for directed bipartite model

Random walk on DBLP social network (cond.) • Extend to the tripartite graph model G''=(C,A,T,E'') • Assume n conferences, m authors and l topics in G'‘ • Three corresponding matrices: Un×m, Vm×l and Wn×l • The adjacency matrices of G'' after normalization:

Random walk on DBLP social network (cond.) • Algorithm(3): The random walk with restart algorithm for tripartite model

DBLP dataset • Download the publication data for conferences from the DBLP website9 in July 2007 • It contains more than 300,000 authors, about 3,000 conferences and the selected 1,000 N-gram topics • The entire adjacency matrix becomes too big to make the random walk efficient • Use the METIS algorithm to partition the large graph into ten subgraphs of about the same size

The DBconnect System • http://kingman.cs.ualberta.ca/research/demos/content/dbconnect/ • A navigational system to investigate the community connections and relations • Displaying researcher statistics from academic search engines • Providing lists of recommended entities to given authors, topics and conferences

The DBconnect System (cond.) • Academic Information • Conference contribution, earliest publication year and average publication per year • H-index is calculated based on information retrieved from Google Scholar • Approximate citation numbers • Related Conferences • Based on author-conference-topic model • Related Topics • Based on author-conference-topic model

The DBconnect System (cond.) • Co-authors • Co-author name and number of paper • Related Researchers • Based on the directed bipartite graph model • Recommended Collaborators • Based on author-conference-topic model • Co-authors’ names are not shown here • The result implies that the given author shares similar topics and conference experiences with these listed researchers, hence the recommendation

The DBconnect System (cond.) • Recommended To • The recommendation is not symmetric • Author A may be recommended as a possible future collaborator to author B but not vice versa • EX: Jiawei Han has been recommended as collaborator for 6201 authors, but apparently only a few of them is recommended as collaborators to him • The given author has been recommended to the author lists • Symmetric Recommendations • The author lists have been recommended to the given author

Conclusion • Extend a bipartite graph model to incorporate co-authorship • Propose a random walk with restart approach • Find related conferences, authors, and topics for a given entity • Present DBconnect system • Help explore the relational structure and discover implicit knowledge within the DBLP data collection

DBconnect: Mining Research Community on DBLP Data