110 likes | 263 Vues
This overview, led by Dr. Joseph DiVerdi, explores XML and XHTML character sets, which are essential for digital communication in diverse languages. It discusses the concept of characters and glyphs, explains how various character sets (like ISO-8859-1 and Unicode) accommodate different languages, and highlights the significance of encoding in international and multi-lingual applications. The session emphasizes the necessity of specifying encoding to ensure proper representation of characters, thereby enhancing global compatibility of web content.
E N D
Overview of XML & XHTML Instructor: Joseph DiVerdi, Ph.D., MBA
Character Sets • A Brief Digression...
Character Sets • Character • A Unit of a Written Language System ay, bee, see, dee, eff, gee, aych, eye • Glyph • An Actual Printed or Displayed Character = a b c 5 , $ ó
Character Sets • A Character May Associate With Several Glyphs • Close Quote - " or » • A Glyph May Correspond to Several Characters • Comma - Pause in Sentence or Decimal Indicator • In Certain Languages
Character Sets • Each Character Is Assigned • A Specific Numeric Value • Number of Characters in a Character Set • Limited by the Bit-depth of Its Encoding • 8-Bit Encoded Character Set - 256 characters • 16-Bit Encoded Character Set - 65,536 characters • HTML v2.0 & v3.2 are based on ISO 8859-1 • 8-Bit Character Set • AKA Latin-1
Character Sets • ISO-8859-1 Character Set • 8-Bit Depth • First 128 Values From US-ASCII Numeric Value Glyph Description 13 CR carriage return 48 0 digit zero 64 A uppercase aye 94 ^ caret 177 ± plus-or-minus 191 ¿ inverted question mark 255 ÿ lowercase wye w/umlaut
Character Sets (continued) • Common 8-bit character sets ISO 8859-1 Latin-1 ISO 8859-5 Cyrillic ISO 8859-6 Arabic ISO 8859-7 Greek ISO 8859-8 Hebrew SHIFT_JIS Japanese EUC_JP Japanese
Uses of Character Sets Languages Countries Character Sets French fr iso-8859-1 Greek el iso-8859-7 Hebrew iw iso-8859-8 Hungarian hu iso-8859-2 Icelandic is iso-8859-1 Italian it iso-8859-1 Japanese ja shift_jis, iso-2022-jp, euc-jp Romanian ro iso-8859-2 Russian ru koi-8-r, iso-8859-5 Serbian sr iso-8859-5 Slovak sk iso-8859-2 Spanish es iso-8859-1 Turkish tr iso-8859-9 Ukrainian uk iso-8859-5
Character Sets (continued) • 256 Characters are Sufficient • For Certain Languages • Insufficient for Others • Japanese (kanji) • Chinese • Korean • Vietnamese • Hence the Need For • 16-Bit Encoded Character Sets
Character Sets • 16-Bit Encoded Character Sets • Two Contiguous Bytes Represent One Character • 65,536 Possible Characters in One Set • Unicode is a 16-bit Character Set • Developed by the Unicode Consortium • Practically Identical to ISO 10646-1 • First 256 Slots Allocated to ISO 8859-1 • Backwards Compatible (woo-hoo!)
Character Sets • A Brief Digression... • Bottom Line • Specify Your Encoding As Required • Important For International Applications • Multi-Lingual Applications • There, now you know about it.