Hypertext (1)

Hypertext (1) • Historically, text is sequential: read from beginning to end • Hypertext is non-sequential, with internal links from one part to another • Hypertext, the word, coined by Ted Nelson in 1966. • First hypertext system, Xanadu, named for Coleridge’s magical world.

Hypertext (2) Links in hypertext give access to: • topics or information directly related to the current idea • notes, such as footnotes or endnotes • explanations of special words or phrases • biographical information about people behind the current idea

Claims about Hypertext • Represents large body of information organized into numerous fragments • Fragments relate to one another • User needs only a small fraction of the fragments at any time • Exists only in cooperation with the reader • Is a legitimate literary concept

Claims about Hypertext (2) • Integrates three technologies • Publishing (as a book publisher would) • Computing (as the infrastructure) • Broadcasting (over a computer network) • Depends on computer environment for high-speed transitions between nodes • Modelled by network ADT

Using Hypertext • Browser, or hypertext engine: a computer-based system that allows links to be followed easily • Navigation aids: parts of the user interface that provide a sense of location and direction • Notation: a convenient way of specifying links as a hypertext author

WWW as a Hypertext System • Browser: Netscape, for example • Navigational aids: • Forward, back, home • History list • Colored anchors • Consistent titles • Notation: HTML

Network ADT • Model of hypertext • Similar to tree ADT, but allows cycles • Links have an explicit direction, capturing the idea of going forward and going back

Network ADT (2) • Definition:A network is a collection of nodes and links between pairs of nodes such that • Each link has a direction. • Each node is reachable from any other node. However, the path is not necessarily unique. • No node is linked to itself. • There are no duplicate links in the same direction.

Network ADT (3) • Observations: • There is no hierarchy; all nodes are considered the same. (In a tree, the root is special.) • Links have direction, but reverse travel is possible. (One can go backwards on a link, or forwards on a link that goes in the opposite direction.) • Cycles are allowed.

Directed Graphs • Both networks and rooted trees are examples of a connected directed graph, sometimes called a digraph. • Formally, a digraph is a set of nodes and a set of links joining ordered pairs of nodes. The link (A,B) that joins A to B is different from the link (B,A) that joins B to A

Navigation in Sequential Text • Low level: • Punctuation • Fonts • Separation into sentences and paragraphs • High level: • Chapters, sections, subsections • Table of contents • Index

Navigation in Sequential Text (2) • Page layout • Page numbers • Running heads • Displayed text

Navigating in Hypertext • Issues: • Where am I? Have I been here before? When? • How did I get here? • Where can I go? • Anchors (or links) • Implicit anchors (or links): clipboard, glossary, calculator • Computed links: next train • Back • Forward • Home

Navigating in Hypertext (2) • Within a node: • Save to disk • Print • Annotate • Scroll • Zoom

Navigating in Hypertext (3) • User interface support • Give power to the users through • short response time • low cognitive load • path clues, perhaps decaying over time • Follow a path forward or backward • Return to a node

Text Markup • Unified view of text and hypertext presentation • Foundation of all word processors • Describes all electronic manuscripts by • separating logical elements • specifying processing functions for these elements

Text Markup (2) • Originated by William Tunnicliffe (Sept. 1967), in talk advocating separating information content of document from format • Control formatting with embedded codes

Generalized Markup • Goal: allow editing, formatting, and retrieval systems to share documents • Devised by Goldfarb, Mosher, Lorie at IBM, 1969 • Formally defined • document types • explicit nested element structure • generic identifier associated with each element

SGML • Standard Generalized Markup Language • First draft standard, 1980 • ISO 8879, 1986 • Based on the ADT tree • Allows the description of a document, considered as a tree, to be embedded in the file containing the document

Functions of SGML • Tags documents in a formal language • Describes internal logical structures • Links files with an addressing scheme • Acts as a database language for text • Accommodates multimedia and hypertext • Provides a grammar for style sheets • Allows coded text reuse in surprising ways

Functions of SGML (2) • Represents documents independent of computing platform • Provides a standard for transfering documents among platforms and applications • Acts as a metalanguage for document types • Represents hierarchies • Extends to accommodate new document types

Generic Identifiers • Tagging vs. formatting • Tagging shows document structure • Formatting describes document display • Example: A paragraph is a sequence of closely connected sentences and can be delimited by a tag. A paragraph can be displayed with either • initial indenting or not • extra separation or not

Generic Identifiers (2) • Syntax • Beginning: < identifier > • End: </ identifier > • Attribute list, with assigned values, may follow identifier

Generic Identifiers (3) • Typical identifiers: • p paragraph • q quotation • ol numbered (ordered) list • ul unnumbered list • li list item • b bold face • i italics

Display of Text • ASCII codes for printing characters carry no information about display • Printed or displayed characters are described by their font.

Fonts • Fonts come in families, which are a group of fonts with similar design characteristics. • A font is a set of displayed characters in a particular design. To describe a font, we specify: • The font face, or type face, which is the design of the font. • The size, measured in points, which is the height of representative characters. • The appearance: bold, italic, underline, outline, shadow, small cap, redline, strikeout, etc.

Fonts (2) • Font families include standard modifications of a base font, such as italics and bold, to change the appearance. (This family is Times New Roman.) • Some families are sans serif, without the cross strokes accentuating the ends of the main strokes.

Fonts (3) • Typical examples of fonts are • Times New Roman • Arial • Century Schoolbook • Lucinda Calligraphy • Verdana

Fonts (4) • The size of this font is 32 points • This is 54 points • This is 24 points • There are exactly 72.27 points per inch

Fonts (5) To render a character in a font, one must • Know the computer code (ASCII) of the character • The font name and properties Then the computer creates the glyph that represents the character in the specified font.

Fonts (6) In the process, the computer uses the • Baseline: the invisible line on which characters are aligned. • x-height: the actual height of the character x • Kerning: spacing between two letters. Note that in printing “wo” the “o” slides under the “w” to form and locate the glyph

Input devices for text • Keyboard • Scanning with optical character recognition • Hand printed • Hand written (cursive) • Machine printed • Voice recognition • Pen-based

Input errors • Human-based, e.g. • Typographic • Poor writing • Machine dependent • Small typeface differences: O vs. D • Limits of technology • Pre-existing errors

Automatic error correction • Error rate for keyboard input = 98% OCR accuracy + automatic correction • Automatic correction also helpful in: • Computer-aided authoring • Communication enhancement for disabled • Natural language responses • Database interaction • Example: MS Word AutoCorrect

Automatic spelling correction • Three increasingly difficult tasks: • Non-word detection: string in text not in dictionary • Isolated word correction: thier automatically becomes their • Context-dependent correction: here automatically becomes hear

MS Word AutoCorrect

General spelling correction • Can allow human intervention, e.g. choose the correct spelling from a list of candidates • No context dependent general purpose correction tool exists yet.

Issues for spelling correction • Type of input device • Focus on adjacent keys: b vs. n • Focus on similar shapes: O vs. D • Interactive vs. automatic correction • How many choices are reasonable? (One for automatic correction.) • How accurate should guesses be? • Proper choice of dictionary

Proper Dictionary

Word list choice • Use lexicon--a word list appropriate to a particular topic • As opposed to dictionary -- a comprehensive list of words • Include provision for adding new words

Word list choice: Example 1 • Compare NY Times news wire text with Webster’s 7th Collegiate Dictionary • 8 million words in news wire text: • only 36% in dictionary • only 39% of dictionary words used in text

Example 1 (continued) • Of text words not in dictionary • 1/4 inflected forms (change in case, gender, tense) • 1/4 proper names • 1/6 hyphenated forms • 1/12 misspellings • 1/4 unresolved by investigators (new words, etc.) • How to handle proper names?

Example 2 • Corpus of 22 million words from a variety of genres • Effect of changing lexicon from 50,000 to 60,000 words? • Eliminated 1348 false rejections (words are now included in lexicon) • Created 23 false acceptances (originally misspelled, now occur in lexicon and therefore, treated as correctly spelled.)

Unintentionally correct spellings • Misuse of word: there for their, to for too • Typo: from for form • Quote from Mozart: I’ll see you in five minuets

Issues in detection • Given document as a sequence of words, lexicon as ordered list of words, report all document words not in lexicon, but: • How to handle upper case letters? • How to handle suffixes and prefixes? • What definition of word to use?

Issues in detection (2) • Upper case: Change all to lower case • Handles first word of sentence and proper names that are words: Bob Brown • Confuses: DEC (ok), Dec (abbreviation), dec (misspelling) • Must put back capitalization

Types of errors • From keyboard input, 80% of misspellings • Insertion • Deletion • Substitution, especially nearby keys • Transposition • Few errors occur in first letter • Mostly, length is same or changes by 1

Suggestion Strategies • Words with same first letter first • Order rest by change in length

Types of errors (2) • Improper spacing: run-ons or splits • Significant unsolved problem • Cognitive • recieve for receive; procede for proceed • conspiricy for conspiracy; mispell for misspell • Phonetic • abiss for abyss; nacherly for naturally

Spelling Rules • I before E except after C • Ex, Suc, Pro ceed. All others are cede, except supersede

Hypertext (1)

Hypertext (1)

Presentation Transcript

HITS Hypertext-Induced Topic Selection

Hypertext

Hypertext

HYPERTEXT

Hypertext

HTML Hypertext Markup Language

Hypertext

CSA4080: Adaptive Hypertext Systems II

Adaptive Hypertext

Hypertext

Against Hypertext

Hypertext

HYPERTEXT SYSTEM: OVERVIEW

Hypertext