Understanding XML and XHTML Character Sets: Encoding and Internationalization

Overview of XML & XHTML Instructor: Joseph DiVerdi, Ph.D., MBA

Character Sets • A Brief Digression...

Character Sets • Character • A Unit of a Written Language System ay, bee, see, dee, eff, gee, aych, eye • Glyph • An Actual Printed or Displayed Character = a b c 5 , $ ó

Character Sets • A Character May Associate With Several Glyphs • Close Quote - " or » • A Glyph May Correspond to Several Characters • Comma - Pause in Sentence or Decimal Indicator • In Certain Languages

Character Sets • Each Character Is Assigned • A Specific Numeric Value • Number of Characters in a Character Set • Limited by the Bit-depth of Its Encoding • 8-Bit Encoded Character Set - 256 characters • 16-Bit Encoded Character Set - 65,536 characters • HTML v2.0 & v3.2 are based on ISO 8859-1 • 8-Bit Character Set • AKA Latin-1

Character Sets • ISO-8859-1 Character Set • 8-Bit Depth • First 128 Values From US-ASCII Numeric Value Glyph Description 13 CR carriage return 48 0 digit zero 64 A uppercase aye 94 ^ caret 177 ± plus-or-minus 191 ¿ inverted question mark 255 ÿ lowercase wye w/umlaut

Character Sets (continued) • Common 8-bit character sets ISO 8859-1 Latin-1 ISO 8859-5 Cyrillic ISO 8859-6 Arabic ISO 8859-7 Greek ISO 8859-8 Hebrew SHIFT_JIS Japanese EUC_JP Japanese

Uses of Character Sets Languages Countries Character Sets French fr iso-8859-1 Greek el iso-8859-7 Hebrew iw iso-8859-8 Hungarian hu iso-8859-2 Icelandic is iso-8859-1 Italian it iso-8859-1 Japanese ja shift_jis, iso-2022-jp, euc-jp Romanian ro iso-8859-2 Russian ru koi-8-r, iso-8859-5 Serbian sr iso-8859-5 Slovak sk iso-8859-2 Spanish es iso-8859-1 Turkish tr iso-8859-9 Ukrainian uk iso-8859-5

Character Sets (continued) • 256 Characters are Sufficient • For Certain Languages • Insufficient for Others • Japanese (kanji) • Chinese • Korean • Vietnamese • Hence the Need For • 16-Bit Encoded Character Sets

Character Sets • 16-Bit Encoded Character Sets • Two Contiguous Bytes Represent One Character • 65,536 Possible Characters in One Set • Unicode is a 16-bit Character Set • Developed by the Unicode Consortium • Practically Identical to ISO 10646-1 • First 256 Slots Allocated to ISO 8859-1 • Backwards Compatible (woo-hoo!)

Character Sets • A Brief Digression... • Bottom Line • Specify Your Encoding As Required • Important For International Applications • Multi-Lingual Applications • There, now you know about it.

Understanding XML and XHTML Character Sets: Encoding and Internationalization

Understanding XML and XHTML Character Sets: Encoding and Internationalization

Presentation Transcript

Overview of Open XML

XHTML

Markup Languages SGML, HTML, XML, XHTML

XHTML

XML, XSL, XSLT, XHTML and others

XML Overview

XHTML

XHTML

An Overview of XML

XHTML

XML Messaging Overview

XML Overview

Overview of HTML and XML

XHTML

XML/EDI Overview

XHTML

XHTML

XHTML

XHTML

Comparing HTML, XML, and XHTML

XHTML

XHTML