180 likes | 382 Vues
Information Extraction from Biomedical Text . Jerry R. Hobbs Artificial Intelligence Center SRI International. Introduction. Information Extraction: Extract entities, relations, events Capture structured information Domain specific Focus only relevant parts
E N D
Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International
Introduction • Information Extraction: • Extract entities, relations, events • Capture structured information • Domain specific • Focus only relevant parts • Mainly on economic and military interest? • Biomedical domain
Cascaded Finite-State Transducers • Separate Processing into several stages • FASTUS (Finite-State Automaton Text Understanding System) • Earlier Stages: • Smaller linguistic objects • Domain independent • Later Stages: • Domain dependent patterns
Cascaded Finite-State Transducers • Complex Words • Basic Phrases • Complex phrases • Domain Patterns • Merging Structures
Example gamma-Glutamyl kinase, the 1st enzyme of the proline biosynthetic pathway, was puried to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.
Target Database • Reaction Object: • Attributes ID • Pathway • Enzyme • .. • Enzyme Object • Attribute ID • Name • Molecular-Weight • Subunit-Component • Subunit-Number
Complex Words gamma-Glutamyl kinase, the 1st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits. • Recognizes • multiword fixed phrases • proper names • Rich in the biological domain • Use lexicon or ML and Statistic methods gamma-Glutamyl kinase, the 1st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.
Basic Phrases • Segment a sentence into noun groups, verb groups, and particles • Use Sager 1981 grammar
Appositives with their Head none groups “of” prepositional phrases to Their head noun groups Complex Phrases
Complex Phrases • Structures of basic and complex phrases, entities and events
Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.
Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.
Merging Structures • First 4 levels: processes within single sentence • This level: collect and combine information for on entity or relationship • Three Criteria: • The internal structure of noun groups • The nearness along some metric • Consistency and compatibility of the 2 structures
Compile – Time Transformations • Subject-Verb-Object pattern linguistic patterns (passive, relative clauses, etc)
Types of Specialized Domains • “noun-driven” approach • The type of an entity is highly predictive of its role in event • Loose S-V-O patterns • “verb-driven” approach • The role of the entities in events cannot be predicted from their type • Tight S-V-O patterns
Limitation of IE Technology • MUC (1990): • Name recognition: ~95% recall and precision • Event recognition: ~60% recall and precision • Possible reasons: • Process of merging • Only works with explicit information • Common cases are covered, how about those rare cases?