170 likes | 301 Vues
This document explores the concepts of information retrieval systems (IRS), focusing on the importance of text and multimedia integration. Text serves as the primary means of data communication, whereas multimedia enhances engagement and interaction. The discussion includes the role of metadata, coding schemes like ASCII and Unicode, various text formats (PDF, RTF), and markup languages (HTML, XML). Additionally, it covers multimedia data types, their storage formats, and the standards for graphics and virtual reality applications.
E N D
WMES3103 : INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES
INTRODUCTION • Text - main form of communicating data and information • Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive • Website with a combination ot text and multimedia will be visited by many as compared to one which is text-based only • IRS - text and multimedia is depicted via special languages.
Metadata • New concept on information – metadata • Information about data arrangement, data domain and relationship between the two • Data about data • 2 types – descriptive and semantic
descriptive Metadata – metadata which explain about document or one unit of information • Commonly used Metadata : • Authors • Date of publication • Source of publication • Length of document • Type of document
Metadata • semantic Metadata –resembles subject that can be obtain from the contents of the document – subjects heading • Keywords • LC Code
TEXT • With computers, we need to code text into binary digits • First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol • Then, ASCII changed to 8 bits to accommodate other languages, accents and diacritical marks • Oriental languages – Unicode – 16 bits
TEXT Formats • No one single format for a text document • Good IRS system should be able to retrieve information from any format • Initially, IRS will convert a document to an internal format but this had a lot of disadvantages • Now, many new format has been developed for document interchange
TEXT • RTF – Rich Text Format for word processing • PDF – Portable Document Format for displaying and printing documents • Postscript – powerful programming language for drawing • MIMT – Multipurpose Internet Mail Exchange to encode e-mail • Files are compressed – Compress (Unix), ARJ (PCs), ZIP • Convert binary files to ASCII text –uuencode/uudecode, binhex
MARKUP LANGUAGES • Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc. • Formal markup languages are more structured • Marks = tags - initial and ending tag surrounding the marked text • Standard metalanguage = SGML • New metalanguange for Web = XML (eXtensible Markup Language) = subset of SGML • Most popular markup language used for the Web = HTML (HyperText Markup Language)
MULTIMEDIA • Applications that handle different types of digital data originating from distinct types of media • Text, sound, images, video • Digital data distinct and different in volume, format, and processing requirements • Different types of formats necessary for storing each type of media
MULTIMEDIA • Different formats used commonly on the Web and in digital libraries • Images • Audio • Moving Images • Textual Images • Graphics and Virtual Reality
IMAGES • XBM, BMP, PCX – direct representation of a bit-mapped (or pixel-based) • GIF (Graphic Interchange Format) – includes compression and good for black or white or with small number of clours or gray levels (256) • JPEG (Joint Photographic Experts Group) – includes compression • TIFF (Tagged Image File Format) – used to exchange different documents between different applications and different computer platforms • TGA (Television Targa image file) – associated with video game boards • Various other image formats
AUDIO • Must be digitized before storage • AU, MIDI (standard format to interchange music between electronic instruments and computers), WAVE – for small pieces of digital audio • Audio libraries – RealAudio or CD formats • Animation or moving pictures • MPEG (Moving Pictures Expert Group) – related to JPEG • Others – AVI, FLI, QuickTime
TEXTUAL IMAGES • Images that contain mainly typed or typeset text • Obtained by scanning the documents • For archival purposes • Saved as images but with further compression • Textual and non-textual stored and compressed separately and when neded can be combined and displayed together
GRAPHICS AND VIRTUAL REALITY • 3-dimensional graphics found on Web • CGM (Computer Graphics Metafile) standard • Metafile = collection of elements • CGM standard specifies which elements are allowed to occur in which positions in a metafile • VRML (Virtual Reality Modeling Language) – file format for describing interactive 3D objects and worlds - universal interchange format for 3D graphics and multimedia - can be used for various applications
MULTIMEDIA DOCUMENTS MARKUP • HyTime = Hyper/Time-based Structuring Language – standard defined for multimedia documents markup • SGML architecture which specifies the generic hypermedia structure of documents