200 likes | 557 Vues
CiteGraph : A Citation Network System for MEDLINE Articles and Analysis. Qing Zhang 1,2 , Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester, MA, USA 2 University of Wisconsin Milwaukee, Milwaukee, Milwaukee, WI, USA 3 VA Central Massachusetts, Leeds, MA, USA. Outline.
 
                
                E N D
CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang1,2, Hong Yu1,3 1University of Massachusetts Medical School, Worcester, MA, USA 2University of Wisconsin Milwaukee, Milwaukee, Milwaukee, WI, USA 3VA Central Massachusetts, Leeds, MA, USA
Outline • Introduction • Background • Method • Evaluation • Analysis CiteGraph, MedInfo 2013
Introduction • Citation network is important for • Information retrieval • Journal Impact Factor, H-index • Co-authorship network is important • Few citation networks are available for research • We built CiteGraph CiteGraph, MedInfo 2013
Background • Citation network analysis • Power law distribution in citation networks • Article ranking, HITS and PageRank • Community structure of physics fields • Citation network tool for given legal issue using legal document citation network • Co-authorship network analysis • Research collaboration patterns • Author authority : Erdös Number • Literature search • CiteSeerX, Google Scholar CiteGraph, MedInfo 2013
The CiteGraph Data CiteGraph, MedInfo 2013
Citation Network Example CiteGraph, MedInfo 2013
Challenges • Yu, H and Lee M. 2006. Accessing Bioscience Images from • Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556. • (2) Hong Yu and Minsuk Lee. Accessing Bioscience Images from • Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556. 2006. • (3) Yu H, Lee H. 2006. Accessing Bioscience Images from • Abstract Sentences. Bioinformatics: 22 (14), e547–e556. CiteGraph, MedInfo 2013
Methods • Mapping between articles • Mapping articles to the PubMed ID • Author name disambiguation CiteGraph, MedInfo 2013
Methods • If two of the following matching result are true, we consider the two entities (for example the citation and the article) are matched • Title matching • the set of tokens contained in one title field is a subset of the tokens in the other, or • the number of tokens common to both fields is more than 80% of the size of the larger of the two fields. • Author list matching • two lists of surnames have one-on-one mapping • surnames in one entity (citation) is fully contained in the surname set of the second (article). • Journal name matching • remove stop words such as “of” • if the number of common initials in the journal titles was greater than 80% of the tokens in the longer journal name, they were considered equivalent.
Evaluation Results • 7 Annotators are invited to annotate the citation mapping and PMID mapping results • Each annotator is presented with 20 matching results of each task CiteGraph, MedInfo 2013
The CiteGraph Statistics 1.65 M articles 1.37 M authors 6.35 M citations CiteGraph, MedInfo 2013
The CiteGraph Statistics LivakKJ., Schmittgen TD., Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods. 2001 Dec;25(4):402-8. log y = 1.06 – 2.45* log x (p<0.05 t-test) CiteGraph, MedInfo 2013
The CiteGraph Statistics Largest connected component : 1.27 million authors (92.7%) The second largest connected component: 35 authors CiteGraph, MedInfo 2013
The CiteGraph Statistics Co-authorship spans from 1 to 35 years, while 83.7% of author pairs just appear once. CiteGraph, MedInfo 2013
The CiteGraph Statistics * The largest component is excluded when calculating the statistics in the table. Its size is 1.27 million (92.7% authors) CiteGraph, MedInfo 2013
Trends CiteGraph, MedInfo 2013
Conclusion • We created a citation/co-authorship networks with biomedical full text literature • Our networks have high accuracy and large scale, and it can benefit biomedical text mining communities • Article ranking • Research collaboration recommendation • Social network analysis • The network database can be downloaded per request CiteGraph, MedInfo 2013
Acknowledgement • National Institute of Health 1R01GM095476 to Hong Yu • A start-up fund from University of Massachusetts Medical School to Hong Yu • National Center for Advancing Translational Sciences of the National Institute of Health under award number UL1TR000161. CiteGraph, MedInfo 2013