Document Databases for Information Management

Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba erbach@ftw.at

IM = DM? • Is Information Management the same as Document Management? • No, because the relevant information may be distributed across several documents, or may only be a small part of a document • Then what is information management? • Extraction, storage, indexing and retrieval of information units contained in documents.

IM Applications • Document Retrieval • Routing • Question Answering • Factual Database Construction • Summarisation

Document Annotation • Document Annotation adds information to documents • Annotation Formats: SGML, XML, LaTeX, ... • Annotation Standards: HTML, NITF, TEI, CES, GDA, Map Task, TreeBank, DublinCore

Formal Properties of XML • Tree structures • nodes with attribute/value pairs • node content is a string which can contain XML trees • nodes can have identifiers • no type hierarchy

Language Technologies • Think of language technologies as processes that add annotations to documents, based on an analysis of the documents' linguistic content. • This point of view allows a uniform treatment of human-generated and LT-generated annotations.

Document-Level LT • Language Identification • Categorisation • Summarisation All of these can be applied to parts of documents also.

Collection-Level LT • Clustering • Topic detection and tracking • Multi-document summarisation

Fine-Grained LT • Morphology • Part-of-speech Tagging • (shallow) parsing • coreference resolution • information extraction

(Annotated) Text Document LT and Document Annotation Annotated Text Document LT

Information Retrieval • Retrieval of information units in response to an information need • How is the information need stated (keywords, questions, examples)? • How is the information need represented? • How are information units represented? • How are the representations matched?

How are documents represented? • XML trees • index of word/phrase occurrences • index of relations (represented as feature structures) word, phrase, relation index should have pointers to text locations

How are queries represented? • Words / phrases • relations (expressed as feature structures)

How are representations matched? • Unification • Apparent mismatches between query and representation can be resolved by relaxation of the query. • Required inference by forward or backward chaining, as required.

Research Issues • Relevance ranking for feature-structure based queries • Efficient indexing and matching of feature structures is required ( fast unification) • Information content (ontologies) to be represented in the formalism

Document Databases for Information Management