50 likes | 167 Vues
This overview delves into the methods and technologies employed in storing and retrieving information presented as documents. It discusses the representation of the "real world" through data abstraction, emphasizing the distinction between the ectosystem (external factors beyond designer control) and endosystem (factors controlled by designers). Key concepts covered include the performance of systems, data compression techniques, and the transition from data to wisdom. Various coding methods for text compression, including Huffman coding and Ziv-Lempel algorithms, are also explored.
E N D
Overview Focus: Methods and technologies to store and retrieve information in the form of documents that contain text and that may also contain tables, diagrams and images • In any information system, the “real world” is represented by a collection of data abstracted from observations of the real world and made available to the system • Need, reality, data, query
Overview (cont.1) Ectosystem: system factors that are not under the control of the designer Endosystem: system factors that the designer can specify and control (e.g., algorithms) Performance • Effectiveness • Efficiency • Economy
From Data to Wisdom • Data: impersonal, and equally available • Information: set of data matched to a need, personal, and time-dependant • Knowledge • Data, information, and rules • IR&S process description
Data Compression • Level of compression; character vs. word • Data model • Statistical: build statistical tables for sample • Adaptive: starts with a priori stat distributionfor the text symbols but modifies it as each char/word is encoded • Semi-static: Start with model for, say Chapter 1, then modify for better fit of Chapter 2, and so on
Types of Codes for Text Compression • Huffman: static, binary tree • Ziv-Lempel: adaptive, identify each text segment the first time it appears and then point back when it occurs again • Arithmetic: adaptive, text steam identifies by a number that represents the statistical distribution of the symbols, later modified as the text is encoded