BibPro: A Citation Parser System

BibPro: A Citation Parser System

Introduction • Integrating bibliographical information • Metadata • author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc. • Citation string • Thousands of variations • More than 2,000 formats in Endnote • Citation Parsing Problem • automatically recognize individual fields from a given citation string • A template citation parser

Goal

Machine Learning • Condition Random Field • F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. • Support Vector Machine • Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. • Hidden Markov Model • K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. • Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. • Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.

Knowledge Base • A tree-like knowledge representation scheme that organizes the knowledge of reference concepts in a hierarchical fashion • Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006. • A knowledge base automatically constructed from an existing set of sample metadata records of a given area • E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.

Template Base • Keep citation style as a template • ParaCite http://paracite.eprints.org/ • I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548. • Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.

Key Idea • Encode a citation string into a template for BLAST • Keeping citation style information into a protein sequence • Utilizing bioinformatics sequence tools • BLAST • Using Domain Knowledge • Reserved word • Knowledge database (optional) • Blocking rule (common sense knowledge)

Question • How many symbols can be used in a protein sequence? 23 symbols used in BLAST • How many fields should be extracted from a citation? choose the most common used field • Which punctuation marks are treated as partition marks base on domain knowledge • How do we transform a citation into a protein sequence and retain its structure feature? define a encode table

Encoding Procedure

Encoding Table

Encoding Knowledge • A [AUTHOR] • Name abbreviation • T [TITLE] • Length of Blocking • L [VENUE] • booktitle(conference): Proceedings Proc Workshop Conf Conference Symposium Sympos Symp International Intern Annual Annu • journaltitle: Transactions Trans Journal • techtitle(thesis): Tech rep Rpt TR Master Masters Ph PhD Thesis thesis Dissertation dissertation • V [VOLUME] • volume: Volume volume Vol vol Vo vo • issue: Number number Nr nr No no NO Nos • P [PAGE] • page: pp page pages PP Page Pages pg PG

Encoding Knowledge • Y [DATE] • month: January February March April May June July August September October November December Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept • year: 1900-2010 • F [EDITOR] • editor: eds Eds editors Editors editor Eds ED Ed ed edited • S [INSTITUTION] • institution: University Univ Department Dept Corporation • M [PUBLISHER] • publisher: Press Pub Publishers Inc Publications

Tokenizing and Encoding Citation M . Bianchini , P . Frasconi , and M . Gori , " Learning in multilayered networks used as autoassociators , " IEEE Transactions on Neural Networks , vol . 6 , pp . 512 -515 , March 1995 .

Blocking Mechanism • After encoding the citation, we can utilize semi-structured characteristic of citation by some special pattern of sequence • Using blocking rule to merge special pattern into a single unit • e.g ADXRA  A

Blocking(1/2)

Blocking(2/2) Index Form  ARGBRGLRVRPRYD (keep the blocking area information: start position and end position e.g. “A” start:0 end:11 )

Template Database • A record in the Template Database • A citation item with both citation string and metadata • Style Form • Index Form • Once the template database has been constructed, BibPro can provide the citation parsing service on-the-fly

Citation Style Template • Index Form (Unknown Answer) • Style Form (Known Answer)

Parsing (Template Matching)

Finding Citation Style Templates • Using Score mechanism • Finding similar citation style templates • Blast Score Matrix • which fields exist in query citation • the order of partition mark represent a citation style • Choose by IndexForm • Align query citation with citation style template • Score Matrix (dynamic programming) • Content Symbol map to Content Symbol • Partition Mark map to Partition Mark • Choose the most suitable citation style template according to alignment between IndexForm and StyleForm

Parsing (Alignment Extraction) (Query) Index Form ARGBRGLRVRPRYD |||:|||||||||| (Template) Style Form ARGTRGLRVRPRYD ARGTRGLRVRPRYD

System Architecture

Experiment • Dataset • INFOMAP Dataset • A total of 160,000 citation records were collected from digital libraries on the Web • Citation string data was generated for each of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR) • Cora Dataset • 500 records with diversity citation style • Flux-CIM Dataset • 2000 HS-domain records • 300 CS-domain records

Experiment • Evaluation • Token-Level • A is the number of true positive tokens • B is the number of false negative tokens • C is the number of false positive tokens • D is the number of true negative tokens • Field-Level

INFOMAP Dataset Result

Cora Dataset Result

Flux-CIM HS Domain Dataset

Flux-CIM CS Domain Dataset

Conclusion • Parsing citation is a challenging problem • Diversity in citation formats • We present a template-based parser • Parser System http://csclws.iis.sinica.edu.tw:8080/input.jsp • Template Generator System http://csclws.iis.sinica.edu.tw:8080/tpin.jsp

Reference • F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. • Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. • K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. • Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. • Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006. • E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM. • I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548. • Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008. • Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008. • S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. A basic local alignment search tool. J. Mol. Biol., 215, 1990, 403-410. • Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 1970, 443-453.

Thank You !

BibPro: A Citation Parser System