1 / 17

Information Extraction from Biomedical Text

Information Extraction from Biomedical Text . Jerry R. Hobbs Artificial Intelligence Center SRI International. Introduction. Information Extraction: Extract entities, relations, events Capture structured information Domain specific Focus only relevant parts

aliya
Télécharger la présentation

Information Extraction from Biomedical Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International

  2. Introduction • Information Extraction: • Extract entities, relations, events • Capture structured information • Domain specific • Focus only relevant parts • Mainly on economic and military interest? • Biomedical domain

  3. Cascaded Finite-State Transducers • Separate Processing into several stages • FASTUS (Finite-State Automaton Text Understanding System) • Earlier Stages: • Smaller linguistic objects • Domain independent • Later Stages: • Domain dependent patterns

  4. Cascaded Finite-State Transducers • Complex Words • Basic Phrases • Complex phrases • Domain Patterns • Merging Structures

  5. Example gamma-Glutamyl kinase, the 1st enzyme of the proline biosynthetic pathway, was puried to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.

  6. Target Database • Reaction Object: • Attributes ID • Pathway • Enzyme • .. • Enzyme Object • Attribute ID • Name • Molecular-Weight • Subunit-Component • Subunit-Number

  7. Complex Words gamma-Glutamyl kinase, the 1st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits. • Recognizes • multiword fixed phrases • proper names • Rich in the biological domain • Use lexicon or ML and Statistic methods gamma-Glutamyl kinase, the 1st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.

  8. Basic Phrases • Segment a sentence into noun groups, verb groups, and particles • Use Sager 1981 grammar

  9. Appositives with their Head none groups “of” prepositional phrases to Their head noun groups Complex Phrases

  10. Complex Phrases • Structures of basic and complex phrases, entities and events

  11. Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.

  12. Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.

  13. Merging Structures • First 4 levels: processes within single sentence • This level: collect and combine information for on entity or relationship • Three Criteria: • The internal structure of noun groups • The nearness along some metric • Consistency and compatibility of the 2 structures

  14. Compile – Time Transformations • Subject-Verb-Object pattern  linguistic patterns (passive, relative clauses, etc)

  15. Types of Specialized Domains • “noun-driven” approach • The type of an entity is highly predictive of its role in event • Loose S-V-O patterns • “verb-driven” approach • The role of the entities in events cannot be predicted from their type • Tight S-V-O patterns

  16. Limitation of IE Technology • MUC (1990): • Name recognition: ~95% recall and precision • Event recognition: ~60% recall and precision • Possible reasons: • Process of merging • Only works with explicit information • Common cases are covered, how about those rare cases?

More Related