150 likes | 252 Vues
Explore the world of document management systems for information retrieval, delving into annotation formats such as SGML and XML. Understand the language technologies used in extracting, categorizing, and summarizing documents. Learn about the importance of annotated text document LT and the principles of document annotation. Discover the intricacies of information retrieval through the representation of information units and document matching techniques. Dive into research issues concerning relevance ranking, efficient indexing, and formal ontologies in information content representation.
E N D
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at
IM = DM? • Is Information Management the same as Document Management? • No, because the relevant information may be distributed across several documents, or may only be a small part of a document • Then what is information management? • Extraction, storage, indexing and retrieval of information units contained in documents.
IM Applications • Document Retrieval • Routing • Question Answering • Factual Database Construction • Summarisation
Document Annotation • Document Annotation adds information to documents • Annotation Formats: SGML, XML, LaTeX, ... • Annotation Standards: HTML, NITF, TEI, CES, GDA, Map Task, TreeBank, DublinCore
Formal Properties of XML • Tree structures • nodes with attribute/value pairs • node content is a string which can contain XML trees • nodes can have identifiers • no type hierarchy
Language Technologies • Think of language technologies as processes that add annotations to documents, based on an analysis of the documents' linguistic content. • This point of view allows a uniform treatment of human-generated and LT-generated annotations.
Document-Level LT • Language Identification • Categorisation • Summarisation All of these can be applied to parts of documents also.
Collection-Level LT • Clustering • Topic detection and tracking • Multi-document summarisation
Fine-Grained LT • Morphology • Part-of-speech Tagging • (shallow) parsing • coreference resolution • information extraction
(Annotated) Text Document LT and Document Annotation Annotated Text Document LT
Information Retrieval • Retrieval of information units in response to an information need • How is the information need stated (keywords, questions, examples)? • How is the information need represented? • How are information units represented? • How are the representations matched?
How are documents represented? • XML trees • index of word/phrase occurrences • index of relations (represented as feature structures) word, phrase, relation index should have pointers to text locations
How are queries represented? • Words / phrases • relations (expressed as feature structures)
How are representations matched? • Unification • Apparent mismatches between query and representation can be resolved by relaxation of the query. • Required inference by forward or backward chaining, as required.
Research Issues • Relevance ranking for feature-structure based queries • Efficient indexing and matching of feature structures is required ( fast unification) • Information content (ontologies) to be represented in the formalism