350 likes | 492 Vues
Text Languages. J. H. Wang Mar. 4, 2008. Text. User Interface. 4, 10. user need. Text. Text Operations. 6, 7. logical view. logical view. Query Operations. DB Manager Module. Indexing. user feedback. 5. 8. inverted file. query. Searching. Index. 8. retrieved docs.
E N D
Text Languages J. H. Wang Mar. 4, 2008
Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process
Text Languages (Ch. 6) • Metadata • Text • Markup Languages • Multimedia
Introduction • Text • Main form of communicating knowledge • Document • Loosely defined, denote a single unit of information • Can be any physical unit • a file • an email • a Web Page
Introduction • Document • Syntax and structure • Semantics • Information about itself (metadata)
Introduction • Document syntax • Implicit, or expressed in a language (e.g, TeX) • Powerful languages: easier to parse, difficult to convert to other formats • Open languages are better (interchange) • Semantics of texts in natural language are not easy for a computer to understand • Trend: languages which provide information on structure, format and semantics and being readable by human and computers (e.g. SGML)
Introduction • New applications are pushing for format such that information can be represented independetly of style • Style: defined by the author, but the reader may decide part of it • Style can include treatment of other media
Metadata • “Data about the data” • e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. • Descriptive Metadata • Author, source , length • Dublin Core Metadata Element Set • Semantic Metadata • Characterizes the subject matter within the document contents • MEDLINE
Metadata • MARC (Machine Readable Cataloging Record) 100 0020 1 $aHagler, Ronald. 245 0074 14$aThe bibliographic... 250 0012 $a3rd. Ed. 260 0052 $aChicago :$bALA, $c1997
Metadata • Metadata information on Web documents • Cataloging, content rating, property rights, digital signatures • New standard: Resource Description Framework (RDF) • Description of Web resources to facilitate automated processing of information • Nodes and attched atribute/values pairs • Metadescription of non-textual objects • Keyword can be used to search the objects
Metadata • RDF Example <RDF:RDF> <RDF:Description RDF:HREF = “page.html”> <DC:Creator> John Smith </DC:Creator> <DC:Title> John’s Home Page </DC:Title> </RDF:Description> </RDF:RDF>
Metadata • RDF Schema Exemple
Text • Text coding in bits • EBCDIC, ASCII • Initially, 7 bits. Later, 8 bits • Unicode • 16 bits, to accommodate oriental languages
Text • Formats • No single format exists • IR system should retrieve information from different formats • Past: IR systems convert the documents • Today: IR systems use filters
Text • Formats • Formats for document interchange (RTF) • Formats for displaying (PDF, PostScript) • Formats for encoding email (MIME) • Compressed files • Uuencode/uudecode, binhex
Text • Information Theory • Amount of information is related to the distribution of symbols in the document • Entropy: • Definition of entropy depends on the probabilities of each symbol • Text models are used to obtain those probabilites
Text • Example – Entropy • 001001011011
Text • Example – Entropy • 111111111111
Text • Modeling Natural Language • Symbols: separate words or belong to words • Symbols are not uniformly distributed • binomial model • Dependency of previous symbols • k-order markovian model • We can take words as symbols
Text • Modeling Natural Language • Words distribution inside documents • Zipf’s Law: i-th most frequent word appears 1/i times of the most frequent word • Real data fits better with between 1.5 and 2.0
Text • Modeling Natural Language • Example – word distibution (Zipf’s Law) • V=1000, = 2 • Most frequent word: n=300 • 2nd most frequent: n=76 • 3rd most frequent: n=33 • 4th most frequent: n=19
Text • Modeling Natural Language • Skewed distribution – stopwords • Distribution of words in the documents • binomial distribution • Poisson distribution
Text • Modeling Natural Language • Number of distinct words (vocabulary) • Heaps’ Law: • Set of different words is fixed by a constant, but the limit is too high
Text • Modeling Natural Language • Heaps’ Law example • k between 10 and 100, is less than 1 • Example: n=400000, = 0.5 • K=25, V=15811 • K=35, V=22135
Text • Modeling Natural Language • Length of the words • defines total space needed for vocabulary • Heaps’ Law: length increases logarithmically with text size • In practice, a finit-state model is used • Space has p=0.2 • Space cannot apear twice subsequently • There are 26 letters
Text • Similarity Models • Distance Function • Should be symmetric and satisfy triangle inequality • Hamming Distance • Number of positions that have different characters reverse receive
Text • Similarity Models • Edit (Levenshtein) Distance • Minimum number of operations needed to make strings equal survey surgery • Superior for modeling syntatic errors • Extensions: weights, transpositions, etc
Text • Similarity Models • Longest Common Subsequence (LCS) survey – surgery LCS: surey • Documents: lines as symbols (diff in Unix) • time consuming • similar lines • Fingerprints • Visual tools
Markup Languages • Markup: formatting actions, structure information, text semantics, attributes, … • Tags: formatting commands • SGML: standard metalanguage for markup • XML: a subset • HTML: an instance of SGML
SGML • Standard Generalized Markup Language (ISO 8879) • A description of the document structure • The text marked with tags which describe the structure • DTD (Document Type Declaration) • Does not define the semantics (meaning, presentation, and behavior) • Tags: denoted by angle brackets (<tag>)
Output specifications are often added to SGML documents • DSSSL (Document Style Semantic Specification Language), FOSI (Formatted Output Specification Instance)
HTML • HyperText Markup Language • Created in 1992, version 4.0 in 1997 • CSS (Cascade Style Sheets) were introduced in 1997 to create visual effects • SGML: generic; it’s possible to define your own formats, handle large and complex documents, and manage large information repositories • not need for Web applications
XML • eXtensible Markup Language • It allows a human-readable semantic markup, which is also machine-readable • It enables automatic authoring, parsing, and processing of networked data • XSL (Extensible Style sheet Language) • XML counterpart of CSS • XLL (Extensible Linking Language) • Defines different types of links • Recent uses: MathML, SMIL, RDF, …
Multimedia • Images • Bit-mapped: XBM, BMP, PCX, PNG • Compressed: GIF, JPEG, TIFF • Audio • AU, MIDI, WAVE • Video • MPEG, AVI, …
Summary • Text is the main form of communicating knowledge • Documents have syntax, structure and semantics • Metadata: information about data • Formats of text • Modeling Natural Language • Entropy • Distribution of symbols • Similarity