190 likes | 308 Vues
The Artificial Intelligence Research Centre at the Russian Academy of Science in Pereslavl-Zalessky is focused on developing innovative information extraction (IE) systems. Our objective is to extract meaningful information from large texts for analytical purposes, generating structured data in specified formats. Our research addresses various applications including knowledge acquisition, query formulation, automatic summarization, and content visualization. Utilizing advanced tokenization, morphological, and microsyntactic analysis, our methodologies enhance the precision and relevance of information extraction tasks.
E N D
Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science 152020 Pereslavl-Zalessky Russia
INEX: Tools for Information Extraction Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science 152020 Pereslavl-Zalessky Russia +7 48535 98065 inex@epk.botik.ru
Information extraction Objective: • extract meaningful information of a pre-specified type from (typically large amounts of) texts for further analytical purposes Output: • data structures of a pre-specified format (filled scenario templates)
Examples • Sports report: <winner>, <loser>, <score>, <location>, <date>… • Database on rental accommodation opportunities: <location>,<renting price>, <bedrooms number>, <phone number>…
Possible IE application scenarios: • inference of new information (knowledge acquisition) • query formulation and answering in human-computer systems • automatic generation of abstracts and summaries • visualization of document content, etc.
The `Newsmaking’ task • <newsmaker> • <type of newsmaker> (person or organization) • <message> • <type of message> (original, cited, a reference to another newsmaker)
Tokenisation & sentence segmentation • Tokenisation identification of words, punctuation marks, delimiters, special characters • Sentence segmentation recognizing sentence boundaries
Morphological analysis • maps every word-form of the input text to (a) canonical form(s) • recognizes the word's morphological properties Results are typically ambiguous.
Filtering • reduces the text to be subjected to further processing to potentially relevant portions
Disambiguation • a side effect of other processes (e.g., microsyntactic analysis) • a stand-alone stage
Microsyntactic analysis • identifies noun phrases (NP) • identifies some regularly formed constructions (numbers, dates, personal proper names)
Macrosyntactic analysis • identifies clause boundaries • constructs clause hierarchy within a sentence
Named entity recognizer • identifies proper names • assigns semantic features to certain items
Information extraction rules • a domain knowledge representation formalism (scenario templates) • a set of patterns to identify template elements in a text (covering the many possible ways to talk about the target event elements)
IE pattern includes: • a set of rules that define how to retrieve this pattern in a text • a set of constraints imposed on textual elements to fit into a particular slot of the target
Coreference Resolver • recognizes different occurrences of the same entity in a text
Merging partial results • merging partially filled templates to produce a final, maximally filled template