1 / 25

Author Name Disambiguation for Citations Using Topic and Web Correlation

Author Name Disambiguation for Citations Using Topic and Web Correlation. Prior work. Supervised classification approaches: Model all authors’ patterns from a set of training data. Unsupervised Classification approaches:

ebony
Télécharger la présentation

Author Name Disambiguation for Citations Using Topic and Web Correlation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Author Name Disambiguation for Citations Using Topic and Web Correlation

  2. Prior work • Supervised classification approaches: Model all authors’ patterns from a set of training data. • Unsupervised Classification approaches: Ambiguous citations are clustered into groups of distinct authors by measuring the similarities between the attributes in the citations.

  3. Proposed Approach • Topic Correlation • Web Correlation • Pair-Wise Grouping Algorithm

  4. Topic Correlation • Build a topic association network 1.利用Apriori算法构造有向图,权值为置信度(结果为一个超图)。 2.利用k-way hypergraph partition算法,将超图分解为一些簇。 3.这些簇叫做topic association network,研究课题的相关强度是citations在这个网络中的距离。

  5. Web Correlation • Use each title to query a search engine. • Filter the URLs of several digital libraries. • If two citations appear in the same URL, we use them as an instance of Web correlation.

  6. Pair-Wise Grouping Algorithm • Generate pairs of citations by using similarity metrics • Use the training data to train a binary classifier • Apply the classifier to determine whether the pairs are matched • Combine the predicted results to group the citations into appropriate clusters. • Filter out the pairs that would cause the clusters sparse.

  7. Pair-Wise Similarity Metrics • similarity metrics for Coauthor, Title, and Venue: 1.CSM 2.MSF • Similarity metrics for topic correlation: TSM • Similarity metrics for web correlation: MNDF

  8. Binary Classifier • A binary classifier is used to learn the distribution of pair-wise vectors. • The pairs predicted as matched are used to build citation clusters ( constructing an undirected graph).

  9. Cluster Filter • A threshold is set for choosing which bridges should be removed. • A bridge is removed if the numbers of vertices in two separate, but connected, components are above the given threshold.

  10. Detecting Ambiguous Author Names in Crowdsourced Scholarly Data

  11. Prior Work • Name disambiguation has been cast into the problem of clustering a set of publications into profiles such that each profile corresponds to a single author.

  12. Name Variations and Citations • Extract the name variations from a collection of publications • Sort them by number of citations • Look at the percentage of the total citations that are attributed to the top name variations.( A high percentage suggests that the name is not ambiguous.)

  13. Topic Consistency • Leverage the discipline tags crowdsourced from the users of the Scholarometer system • Detect different but related disciplines associated with an author name: • Map an author’s publications to topics, and measure the similarity between these topics. • Derive an author’s topic profile

  14. A brief survey of automatic methods for author name disambiguation

  15. Two problems • Synonyms: the same author may appear under distinct names • Polysems: distinct authors may have similar names.

  16. Proposed taxonomy

  17. Author Grouping Methods • Defining a similarity function: 1.Using predefined functions: the Levenshtein distance, Jaccard coefficient, cosine similarity, soft-TFIDF and others. 2.Learning a similarity function: Use the training data to produce a similarity function S from R*R(R: the set of references) to {0, 1}, where 1 means that the two references do refer to the same author and 0 means that they do not. 3.Exploiting graph-based similarity functions: Create a coauthorship graph G=(V, E) for each ambiguous group. The same coauthor names are represented by a vertex, and the weight is related to the amount of articles coauthored by the corresponding author names represented by the two vertices.

  18. Author Grouping Methods • Clustering Techniques: 1.Partitioning 2.Hierarchical agglomerative clustering 3.density-based clustering 4.Spectral clustering

  19. Author assignment methods • Classification: Assign the references to their authors using a supervised machine learning technique. • Clustering: Use probabilistic techniques to determine the author in a iterative way to fit the model.

  20. Explored evidence • Citation information: the attributes directly extracted from the citations, such as author/coauthor names, work title, publication venue title, year, and so on. • Web information: Data retrieved from the web that is used as additional information about an author publication profile. • Implicit evidence: Evidence inferred from visible elements of attributes, such as the latent topics of a citation.

  21. Summary of characteristics-Author grouping methods

  22. Summary of characteristics-Author assignment methods

  23. Open challenges • Very little data in the citations • Very ambiguous cases -- ambiguous references will have coauthors who have also ambiguous names (especially Asian names) • Citations with errors • Efficiency • Different knowledge areas -- our focus is only about computer science • Incremental disambiguation • Author profile changes • New authors

  24. pandasearch 重名问题研究计划 • 相关论文的阅读,找出最适合当前问题的解决措施。 • 着重从implicit evidence和web information(特别是学者个人主页和cv)入手。 • 从效率和准确度两个方向着手,着重准确度。 • 数据挖掘和机器学习基础知识的学习。

  25. pandasearch 重名问题实现计划 • Type of approach: author grouping methods– learning a similarity function. • Explored evidence: citation information, webinformation, implicit evidence.

More Related