Improving Citation Extraction Using Regular Expressions and Pattern Recognition Techniques
This paper presents a novel method for extracting citations from web pages using regular expressions and citation patterns. We analyze existing citation structures and develop new patterns to enhance the extraction of author, title, and conference information. Our approach leverages known citation data to bootstrap the extraction system, improving accuracy and coverage from limited initial seed citations. We highlight challenges such as variations in titles and case sensitivity, and propose extensions for better boundary detection and more flexible pattern construction. Our findings demonstrate the effectiveness of our method in retrieving citations accurately from semi-structured text.
Improving Citation Extraction Using Regular Expressions and Pattern Recognition Techniques
E N D
Presentation Transcript
Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert
Extraction Task AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference • “Citation” = <Paper, Authors, Conference> • “Pattern” • regular expression
Citation DB Seed (e.g. 5 citations) Method Outline Web pages (HTML, text) Query Search (WIT) Citations Extract Citations using new patterns Extract Patterns using known citations Page-specific Patterns
AUTHOR, AUTHOR: TITLE . CONF 4 Patterns: AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 " Page: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/y/Yang:Qiang.html
AUTHOR, CONF CONF AUTHOR, AUTHOR, TITLE CONF AUTHOR, AUTHOR, AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) AUTHOR, (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) . AUTHOR, (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF AUTHOR: AUTHOR: AUTHOR: Finding New Citations
System Spits Out… • 6 seeds 60 citations • 36 of these (partial citations) • "Theory and Algorithms for Plan Merging " , " Ming Li" • "The Expected Value of Hierarchical Problem-Solving " , " Fahiem Bacchus" • "Handling feature interactions in process-planning " • 14 of these (partial strings) • "On D " • "On t " , " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani" • "An L " , " Ronan Sleep" • "To D “ • No new conferences (end-token)
Bootstrapping, Short-Lived • Highly restrictive regex’s • No recovery • More seeds and variety the better • Stupid Little Things • Mis-capitalization • Variations in titles (‘-’ vs. ‘ ’) • Etc, etc, etc…
Extensions ~ Improvements • Less strict string matching • Not case and punctuation sensitive • Better boundary detection • Start/end tokens, HTML wrapper detection? • Better pattern construction • e.g. n authors not 2 • NER • help find the right "window“ • A source of ENTITY marker • Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values • Evaluation with DBLP?
NER • Baseline model (News corpus) <ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. <ENAMEX_TYPE="PERSON"> S. Awodey. </ENAMEX> Topological Representation of the Lambda Calculus. September <ENAMEX_TYPE="PERSON"> 1998. Math. Struct. </ENAMEX> in <ENAMEX_TYPE="LOCATION"> Comp. Sci. (2000), vol. 10, pp. 81--96. </ENAMEX> • Adapted model (News + citation corpus) <ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the <ENAMEX_TYPE="ORGANIZATION"> International Conference on Acoustics, Speech, </ENAMEX> and Signal Processing. <ENAMEX_TYPE="PERSON"> L. Birkedal. </ENAMEX> A General Notion of Realizability. December 1999. Proceedings of <ENAMEX_TYPE="ORGANIZATION"> LICS 2000 </ENAMEX>
Lessons LearnedAnother Boring Text Slide • Semi-structured text is surprisingly difficult to read • Off-line training for wrappers and/or NER may help • Need very high-confidence rules to ensure precision • A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)