1 / 12

Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation

Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation. AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer. Also By The Same Author. Name ambiguity a problem for automated information extraction. Two problems:

selma
Télécharger la présentation

Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Also By The Same Author:AKTiveAuthor, A Citation Graph Approach To Name Disambiguation AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer

  2. Also By The Same Author • Name ambiguity a problem for automated information extraction. • Two problems: • Same name, different object: David L. Harris (Harvey Mudd College, formerly Stanford and MIT) and David L. Harris (Sandia Labs, Albuquerque) • Different name, same object: Professor Nick Jennings, Nicholas Jennings, N. R. Jennings.

  3. Also By The Same Author • Existing Solutions: • By-hand disambiguation (eg DBLP). • Problem: slow, labour-intensive. • Text and context processing: Li et al (2005). • Problem: deals with names within text, not document authors. • Metadata machine-learning techniques: Han et al (2004, 2005). • Problem: Requires known ‘canonical’ set and 50% of data used in training.

  4. Also By The Same Author • AKTiveAuthor: Linking together paper authors using metadata analysis. • Specifically based on the following observation: • People cite their own work. When they cite an author with a similar name, 95-98% of the time it is the same person. • Step one: Initial clustering on last name.

  5. Also By The Same Author • Self-citation analysis: • Within a name-cluster, test papers against each other. • Does paper A appear in the bibliography of paper B, or vice versa? • Iteratively use this approach to build groups of papers, each representing one real-world author.

  6. Also By The Same Author • Co-authorship Analysis: • Standard approach in disambiguation (Han et al) and social network analysis (AKT Ontocopi). • Use co-authorship relationships to further match the groups created in the self-citation stage. • Source URL Analysis: • Extra linking provided using the ‘source URL’ metadata field. • Links papers by same author on different subjects across one time period.

  7. Also By The Same Author • Sanity Check: • Before committing to a ‘join’ on any of the three stages, check to see if it’s obviously not the same person. • Eg Norman L. Johnson and David E. Johnson (self-citation match). • Eg Earl and Erik Johnson (co-authorship match). • Eg Nicholas Jennings and N. Jennings allowed.

  8. Also By The Same Author • Metrics: • Essentially an information retrieval exercise. • Three measures, each per individual paper: • Precision: (number of relevant docs retrieved) / (number of docs retrieved). • Recall: (number of relevant docs retrieved) / (number of relevant docs overall). • F-measure: Harmonic mean of Precision and Recall, used as generic measure of IR success.

  9. Also By The Same Author • Results: • Tested eight name-clusters, checking against by-hand disambiguated results. • Precision ranged from 0.991 to 1.000 (mean 0.997). • Recall ranged from 0.705 to 0.935 (mean 0.818) • F-measure ranged from 0.824 to 0.965 (mean 0.899)

  10. Also By The Same Author • Analysis / Conclusions: • Precision higher than recall, mainly due to sanity check. • All three methods (self-citation, co-authorship and url source analysis) needed for best results. • Heavily-dominated name-clusters give best results (eg Giles (81.6% C Lee Giles)). • Large and small name-clusters equally good.

  11. Also By The Same Author • Future Work: • Original purpose: citation graph services, eg ‘view my papers’, ‘count my citations’, ‘calculate my impact’. • Improving the disambiguation algorithm: institutional affiliation data, tightening up co-authorship, better initial clustering.

  12. Also By The Same Author • Questions?

More Related