1 / 24

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework. Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL ’ 05. Abstract. They consider the problem of ambiguous author names in bibliographic citations . Scalable two-step framework

angus
Télécharger la présentation

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL’05

  2. Abstract • They consider the problem of ambiguous author names in bibliographic citations. • Scalable two-step framework • Reduce the number of candidates via blocking (four methods) • Measure the distance of two names via coauthor information (seven measures)

  3. Introduction • Citation records are important resources for academic communities. • Keeping citations correct and up-to-date proved to be a challenging task in a large-scale. • We focus on the problem of ambiguous author names. • It is difficult to get the complete list of the publications of some authors. • “John Doe” published 100 articles, but DL keeps two separate purported author names, “John Doe” and “J. D. Doe”, each contains 50 citations.

  4. Problem • Problem definition: • The baseline approach:

  5. Solution • Rather than comparing each pair of author names to find similar names, they advocate a scalable two-step name disambiguation framework. • Partition all author-name strings into blocks • Visit each block and compare all possible pairs of names within the block

  6. Solution Overview

  7. Blocking (1/3) • The goal of step 1 is to put similar records into the same group by some criteria. • They examine four representative blocking methods • heuristics, token-based, n-gram, sampling

  8. Blocking (2/3) • Spelling-based heuristics • Group author names based on name spellings • Heuristics: iFfL, iFiL, fL, combination • iFfL: e.g. “Jeffrey Ullman”, “J. Ullman” • Token-based • Author names sharing at least one common token are grouped into the same block • e.g., “Jeffrey D. Ullman” and “Ullman, Jason”

  9. Blocking (3/3) • N-gram • N=4 • The number of author names put into the same block is the largest one. • e.g. “David R. Johnson”, “F. Barr-David” • Sampling • Sampling-based join approximation • Each token from all author names has an TFIDF weight. • Each author name has its token weight vector. • All pairs of names with similarity of at least θ can be put into the same block.

  10. Measuring Distances • The goal of step 2 is, for each block, to identify top-k author names that are the closest. • Supervised method • Naïve Bayes Model, Support Vector Machine • Unsupervised method • String-based Distance, Vector-based Cosine Distance

  11. Supervised Methods (1) • Naïve Bayes Model Training: • A collection of coauthors of x are randomly split, and only the half is used for training. • They estimate each coauthor’s conditional probability P(Aj|x) Testing:

  12. Supervised Methods (2) • Support Vector Machine • All coauthor information of an author in a block is transformed into vector-space representation. • Author names in a block are randomly split, 50% is used for training, and the other 50% is used for testing. • SVM creates a maximum-margin hyperplane that splits the YES and NO training examples. • In testing, the SVM classifies vectors by mapping them via kernel trick to a high dimensional. • Radial Basis Function kernel

  13. Unsupervised Methods(1) • String-based Distance • The distance between two author names are measured by the “distance” between their coauthor lists. • Two token-based string distances • Two edit-distance-based string distances

  14. Unsupervised Methods(2) • Vector-based Cosine Distance • They model the coauthor lists as vectors in the vector space and compute the distances between the vectors. • They use the simple cosine distance.

  15. Experiment

  16. Data Sets • They gathered real citation data from four different domains. • DBLP, e-Print, BioMed, EconPapers • Different disciplines appear to have slightly different citation policies and the conventions of citations also vary. • Number of coauthors per article • Use the initial of first name instead of full name

  17. Artificial name variants • Given the large number of citations, it is not possible nor practical to find a “real” solution set. • They pick top-100 author names from Y according to their number of citations, and generate 100 corresponding new name variants artificially. • “Grzegorz Rozenberg” with 344 citations and 114 coauthors in DBLP, we create a new name like “G. Rozenberg” or “Grzegorz Rozenbergg”. • Splitting the original 344 citations into halves, each name carries half of citations 172 • They test if the algorithm is able to find the corresponding artificial name variant in Y

  18. Artificial name variants • Error type: e.g. “Ji-Woo K. Li” • Abbreviation: “J. K. Ki” • Name alternation: “Li, Ji-Woo K.” • Typo: “Ji-Woo K. Lee” or “Jee-Woo K. Li” • Contraction: “Jiwoo K. Li” • Omission: “Ji-Woo Li” • Combinations • The quantify the effect of error types on the accuracy of name disambiguation is measured.

  19. Artificial name variants • (1) mixed error types of abbreviation (30%), alternation (30%), typo (12% each in first/last name), contraction (2%), omission (4%), and combination (10%) • (2) abbreviation of the first name (85%) and typo (15%)

  20. Evaluation metrics • Scalability • Size of blocks generated in step 1 • Time it took to process both step 1 and 2 • Accuracy • They measured the accuracy of top-k.

  21. Scalability • The average # of authors in each block • Processing time for step 1 and 2

  22. Accuracy • Four blocking methods combined with seven distance metrics for all four data set with k = 5. • EconPapers data set is omitted.

  23. Conclusion • They compared various configurations (four blocking in step 1, seven distance metrics via “coauthor” information in step 2), against four data sets. • A combination of token-based or N-gram blocking (step 1) and SVM as a supervised method or cosine metric as a unsupervised method (step 2) gave the best scalability/accuracy trade-off. • The accuracy of simple name spelling based heuristics were shown to be quite sensitive to the error types. • Edit distance based distance metrics such as Jaro or Jaro-Winkler proved to be inadequate for large-scale name disambiguation problem for its slow processing time.

More Related