Terminology problems in literature mining and NLP

Terminology problems in literature mining and NLP John MacMullen SILS Bioinformatics Journal Club Fall 2003

Assumptions of the paper • “knowledge encoded in textual documents is organized around sets of domain-specific terms, which are used as a basis for sophisticated knowledge acquisition.” [938] • “Terms represent the most important concepts in a domain and characterize documents semantically.” [939] • “the basic problem is to recognize domain-specific concepts and to extract instances of specific relationships among them.” [938] SILS Bioinformatics Journal Club – Fall 2003

Current approaches to auto term recognition • Morpho-syntactic feature identification • Hybrid linguistic and statistical approaches • Machine learning techniques Problems • Terms are ambiguous and have variation; they are hardly ever mono-referential • The lack of naming conventions (controlled vocabularies), the existence of acronyms, and the large existing heterogeneous literatures increase complexity. SILS Bioinformatics Journal Club – Fall 2003

Context: Term variation problems in NLP SILS Bioinformatics Journal Club – Fall 2003

Terminology Processing Workflow 2,082 MEDLINE abstracts related to ‘nuclear receptors’ Nenadic, Spacsic & Ananiadou (2003), Fig 1 SILS Bioinformatics Journal Club – Fall 2003

ATR approach • C-values (“termhoods”) [940] • Term frequency • “Frequency of occurrence as a substring of other candidate terms” (receptor) • “Number of candidate terms containing the given candidate term as a substring” • “Number of words contained in the candidate term” • NC-values (“termhood estimations”) [940] • Includes context of candidate terms • “Frequency of co-occurrence with top-ranked context words” • NC-values = a linear combination of C-values and context factors for each term SILS Bioinformatics Journal Club – Fall 2003

Clustering & Evaluation • Clustering • CSL (contextual, syntactical, lexical) • Clustering implies underlying perspectives or queries • Evaluation • Recall – the probability a relevant item will be retrieved • Precision – the probability that a retrieved item will be relevant SILS Bioinformatics Journal Club – Fall 2003

Other questions • Corpus construction: “a larger corpus does not have a proportionally higher number of acronyms” [942] True? • “All term variants are considered jointly for the calculation of termhood” [942] What would happen if they weren’t? • In what ways is the hybrid similarity measure corpus dependent? [942] SILS Bioinformatics Journal Club – Fall 2003

References • Nenadic, G., Spasic, I., & Ananiadou, S. (2003). Terminology-driven mining of biomedical literature. Bioinformatics 19(8), 938-943. http://bioinformatics.oupjournals.org/cgi/reprint/19/8/938 SILS Bioinformatics Journal Club – Fall 2003

Terminology problems in literature mining and NLP

Terminology problems in literature mining and NLP

Presentation Transcript

Opinion Mining and Sentiment Analysis: NLP Meets Social Sciences

Mining Medical Literature

Opinion Mining and Sentiment Analysis: NLP Meets Social Sciences

Literature Review and Research Problems

Why e-commerce Problems solutions and Terminology

Mining the Medical Literature

NLP for Text Mining

Terminology mining at OCLC

LITERATURE Terminology

Biological literature mining

Literature Mining for the Biologists

Bilingual Terminology Mining

Problems in the Mining Industry

Literature Retrieval and Mining

Literature Mining and Systems Biology

Mining the Biomedical Research Literature

Literature Data Mining and Protein Ontology Development

Mining the Biomedical Literature

NLP Tools for Biology Literature Mining

NLP for Biomedicine - Ontology building and Text Mining -

Literature Mining BMI 730

Mining Biomedical Literature for Neuroanatomy