Using regular expressions to handle non-ASCII text

Using regular expressions to handle non-ASCII text

A motivating example

Program which puts data into database • Create a simple mySQL table • Write a program which accepts a string from a form and appends it to the database table • We will use it on the next few slides

First interaction • We use the form to submit the string Fred is here • Checking the database shows that the string was correctly stored

Second interaction • We use the form to submit the string ‘Fred is here’ said Tom • The program claims it handled the string correctly • But, checking the database shows that the slanted apostrophes look funny • The problem stems from the way the slanted apostrophes are encoded • The confusion is because ’ is not a standard ASCII character • It is not the same as the basic apostrophe ' which is a standard ASCII character

Third interaction • The problem is even worse if are developing a website to support customers who use languages besides English • Suppose we use the form to submit the Chinese string 我是爱尔兰人 • The program claims it handled the string correctly • But, checking the database shows something strange • The Chinese characters have been converted to HTML entity numbers

Fourth interaction • Actually, the treatment of Chinese is not as bad as if we use the program to handle other Latin script languages • Suppose we use the form to submit the Polish word znaków • The program claims it handled the string correctly • But, checking the database shows something strange about the way the letter óis handled

An interlude • To see the root of the problem, we need to understand how characters are handled • We will return to the use of regular expressions in website programming, but first we must look at character encoding

Character encoding

A file containing a Polish word (part 1) • Let's use Notepad to create a new file containing the Polish word znaków(which means symbols, signs or characters)

A file containing a Polish word (part 2) • Notepad allows us to save the file in different formats which it calls • ANSI, • Unicode, • Unicode big-endian • UTF-8

Comparing the formats • We can use XVI-32 to examine the different files • The ANSI file contains 6 bytes • The so-called Unicode file contains 14 bytes • although Microsoft call the format used in this file 'Unicode', the proper name for the format is UTF-16LE, where LE means 'little-endian' • The so-called Unicode big-endian file also contains 14 bytes • the proper name for the format used in this file is UTF-16BE, where BE means 'little-endian' • The UTF-8 file contains 10 bytes • ANSI was developed for English script • UTF-16LE, UTF-16BE and UTF-8 are implementations of an approach called Unicode, which was developed to support all language scripts • Let's examine these four formats

The ANSI format

Viewing the ANSI file in XVI32 • The file contains 6 bytes, one for each character • The English characters z, n, a, k and w are encoded using ASCII codes - byte values in the range 00 to 7F • But the code for óisis based on an extension to ASCII calledWindows-1252,which uses byte values in the range 80 to FF • thus, ó is represented as F3 • Extensions to ASCII which use values 00 to FF for various purposes are often called "code-pages" and Windows-1252 (also known as Microsoft Windows Latin-1) is often called CP-1252. • By the way, Windows-1252 is often confused with a similar, but slightly different, character code, ISO 8859-1 (a.k.a. ISO Latin-1)

Code pages • The CP-1252 or Microsoft Windows Latin-1 "code page" is only one of many different ways of using byte values in the range 80 through FF • Different code pages support different languages. • CP-1251, for example, uses byte values 80 through FF for Cyrillic, the alphabet used in Russian, Bulgarian, Serbian, Macedonian, Kazakh, Kyrgyz, Tajik, ... • When I lived in Thailand, the computers all used CP-874; this supports the Thai alphabet • In CP-874, the byte-value which represents ó inCP-1251, actually represents the symbol ๒(the Thai numeral for two - it is pronounced 'sawng') • Using different code pages was OK when files generated in one culture were never used outside that culture • But it's no good when a file generated in a country whose computers use one code page is opened in a country where computers use another code page • It is also a problem when one needs to deal with different languages in one document • This motivated development of the Unicode

Unicode

Code points • Unicode is an abstract code • as we shall see later, Unicode can be implemented in various ways • In Unicode, each symbol is represented by an abstract code point • A code point is usually written in the form U+ followed by a sequence of hex digits, for example U+007A • The U+ is actually meant to remind us of the set union symbol, ⊎, referring to the fact that Unicode is meant to be a union of character sets • Unicode provides enough code points for 1,114,112 symbols • However, most of these code points are still unused • which is why its promoters are reasonably confident that it will always provide enough code points to support all symbols likely to be developed • or, at least, all symbols developed by members of our species!

Planes • Unicode is intended to cope with all symbols existing or likely to be developed • It provides enough code points for 1,114,112 symbols • This huge set of code points is divided into 17 "planes", each of which contains 65,536 (216) code points • Plane 0, the Basic Multilingual Plane (BMP), contains code points for almost (but not quite) all symbols used in current languages • Plane 1, the Supplementary Multilingual Plane (SMP), contains historic scripts (hieroglyphs, cuneiform, Minoan Linear B), musical notation, mathematical alpha-numerics, emoticons and game symbols (playing cards, dominoes). • Plane 2, the Supplementary Ideographic Plane (SIP), is used for some Chinese, Japanese and Korean symbols that are not in Plane 0 • Planes 3-13 are still unused • Plane 14, the Supplementary Special-purpose Plane (SSP), contains special-purpose non-graphical characters • Planes 15 and 16, the Supplementary Private Use Areas, are available for use by entities outside the Unicode Consortium

Writing code points • Code points in the Basic Multilingual Plane (BMP) are written as U+ followed by four hex digits • for example, the code point for the letter z is written as U+007A • Code points outside the BMP are written using U+ followed by five or six hex digits, as required, • for example, the LANGUAGE-TAG character in Plane 14 is written as U+E0001 • while one private-use character in Plane 16 is written as U+10FFFD

Blocks • Within the Basic Multilingual Plane,code points are grouped in contiguous ranges called blocks • Each block has its own unique and descriptive name • Example blocks: • Basic Latin, • Latin-1 Supplement, • Greek and Coptic, • Cyrillic, • Armenian, • Hebrew, • Arabic, • Arabic Supplement, • Tibetan, • Ogham • Blocks contain contiguous code points but may be of different sizes • While the Basic Latin block contains 128 code points, the Cyrillic block contains 256 code points but the Armenian block contains only 96 and the Ogham block contains only 32 code points

Where to find details of these blocks • Unicode.org maintains a list of all blocks at http://www.unicode.org/charts/ • Clicking on a block name gives you a PDF file for the block • For example, the next slide shows the PDF file for the Ogham block

Example PDF file for a Unicode block • The PDF file for a Unicode block gives the following information for each symbol in the block • a picture of the symbol • its code point • a descriptive name for the symbol

Backward compatibility • Unicode was designed to be compatible with ASCII • Thus, the Basic Latin block contains all 128 ASCII standard characters • Each ASCII code maps directly to a Unicode code point in this block • For example, the letter z, whose ASCII code is 7A, has the code point U+007A • The letter n, whose ASCII code is 6E, has the code point U+006E • And so on

Backward compatibility (contd.) • The Latin-1 Supplement block also contains 128 code points • Some, but not all, of these code points are similar to the codes in the Windows-1252 (Microsoft Windows Latin-1) code page • Those code points in the Latin-1 Supplement block which do map directly to Windows-1252 codes include the code points for Latin letters with accents and other common diacritical marks such as umlauts • Thus, the accented letter ó, which has the Windows-1252 code of F3, has the Unicode code point U+00F3

Implementations of Unicode • Unicode is an abstract code • Various implementations include • UTF-32 • UTF-16 • UTF-8

UTF-32 • UTF-32is a fixed-length encoding of Unicode • Every code point is directly encoded using 32 bits, or four bytes

UTF-16 • UTF-16 is a 16-bit encoding of Unicode • Unlike UTF-32, it is a variable-length encoding • code points are encoded with one or two 16-bit code-units, • that is, in UTF-16 a code point is encoded as either two or four bytes

UTF-8 • Like UTF-16, UTF-8 is a variable-length encoding • It uses different number of bytes for different code points • Code points for the most common characters, the English letters, are represented as single bytes • Less common characters are represented as two bytes, • Rarer characters are represented as three or more bytes • This means that UTF is the most space-efficient representation of Unicode • We will see it in more detail later

Examining the Notepad formats • To put some flesh on this, let's examine the various formats in which Notepad stores the small file we saw earlier

The so-called Unicode format in Notepad • As we shall see, this format is actually a form of UTF-16 • Its proper name is UTF-16LE

Viewing the "Unicode" file in XVI32 (part 1) • The file contains 14 bytes • The first two bytes contain a byte order mark (BOM), which will be explained on a later slide • Then, each character is encoded as two bytes

Viewing the "Unicode" file in XVI32 (part 2) • The byte order mark is stored at the start of a file to tell programs whether the file is written in little-endian or big-endian format • The Unicode code point for the byte order mark is U+FEFF • Note that the BOM is actually stored is our file as FFFE • FFFE is the little-endian version of FEFF, so it tells us that the file is stored in little-endian format • So we know that each character in the rest of the file is encoded in little-endian format

Viewing the "Unicode" file in XVI32 (part 3) • The first two bytes of the file, the BOM, tell us that file is stored in little-endian format • Then, each character is encoded as two bytes, in little-endian format • The Unicode code point for zis U+007A • But, because the file is in little-endian format, the code point for z is stored in the file as 7A 00 • The Unicode code point for n is U+006E but this little-endian file stores it as 6E 00 • And so on for the other characters

Viewing the "Unicode" file in XVI32 (part 4) • Unicode was designed to be compatible with ASCII, so the Basic Latin block contains all 128 ASCII standard characters, each ASCII code mapping directly to a Unicode code point • z, whose ASCII code is 7A, has the code point U+007A and appears as 7A 00 in this little-endian file • n, whose ASCII code is 6E, has the code point U+006E and appears as 6E 00 in this little-endian file • and so on for a, k and w

Viewing the "Unicode" file in XVI32 (part 5) • The Latin-1 Supplement block also contains 128 code points • Some, but not all, of these code points are similar to the codes in the Windows-1252 (Microsoft Windows Latin-1) code page • The Windows-1252 codes for common Latin letters with accents or other diacritical marks do map directly to code points in the Latin-1 Supplementblock • Thus, the code point for ó, which has the Windows-1252 code of F3, has the code point U+00F3 and appears as F3 00 in this little-endian file

Big-endian Unicode

Viewing the Unicode big-endian file in XVI32 • The proper name for this format is UTF-16BE • The file has 14 bytes • The first two bytes contain the byte order mark and, then, each of the six characters is encoded as two bytes • The fact that the byte order mark, U+FEFF, is stored as FE FF tells that the file is in big-endian format • Thus, the code point for z, U+007A, is stored as 00 7A; the code point for n, U+006E, is stored as 00 6E; and so on

UTF-8

UTF-8 • A UTF-8 file represents characters using a space-efficient representation of Unicode code points • It uses different number of bytes for different code points • Code points for the most common characters, the English letters, are represented as single bytes, • Less common characters are represented as two bytes, • Rarer characters are represented as three or more bytes

UTF-8 (contd.) • Single-byte codes are used for the Unicode points U+0000 through U+007F • Thus, the UTF-8 codes for these characters are exactly the same as the corresponding ASCII codes. • As we shall see, these single-byte codes can be easily distinguished from the first bytes of multi-byte codes • The high-order bit of the single-byte codes is always 0 • As we shall see, the high-order bit in the first byte of a multi-byte code is always 1

UTF-8 (contd. ) • Each of the first 128 characters in UTF-8 need only one byte • This covers all ASCII (English) characters • Each of the next 1920 characters need two bytes • This covers the remainder of almost all Latin-derived alphabets • It also covers the Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Maldivian alphabets • It also covers the so-called Combining Diacritical Marks, which can be used to construct new letters as well as providing an alternative way of specifying the standard accented letters that are already covered above • Each of the rest of the 65,536 characters in the Basic Multilingual Plane (which contains nearly all characters used in living languages) need three bytes • Four bytes are needed for characters in the other Unicode planes, • these include less-common characters in the Chinese, Japanese and Korean scripts as well as various historic scripts and mathematical symbols

UTF-8 (contd.) • As seen before, the high-order bit of a single-byte code is always 0 • A multi-byte code consists of a leading byte and one or more continuation bytes. • The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position. • The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence • so the length of the sequence can be determined without examining the continuation bytes. • The remaining bits of the encoding are used for the bits of the code point being encoded, padded with high-order 0s if necessary. • The high-order bits go in the leading byte, lower-order bits in succeeding continuation bytes. • The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point. • We shall see an example of multi-byte coding in a later slide

Viewing the UTF-8 file in XVI32 (part 1) • As we shall see later, the first three bytes in this file, EF BB BF, form a "byte order mark", although this mark is not needed or recommended by the Unicode standard • The next four bytes, 7A 6E 61 6B, look like ASCII codes - they contain the single-byte UTF-8 encodings of U+007A, U+006E, U+0061 and U+006B (znak) • The next two bytes C3 B3, contain a two-byte encoding of ó, as explained in the next slide • The last byte contains a single-byte encoding of U+0077 (w)

Viewing the UTF-8 file in XVI32 (part 2) • The two-byte UTF-8 encoding foró, which has the code point U+00F3, is as follows • Since there are two bytes in the code , the leading byte is of the form 110x xxxxand the continuation byte has the form 10xx xxxx • U+00F3 has the following bits, 0000 0000 1111 0011 • The significant bits in this code point are 1111 0011 • There is room for 6 bits in the continuation byte, so it can contain the six low-order bits 11 0011, so this byte becomes 1011 0011, which is B3 • The two high-order bits, 11, will be placed in the leading byte • But there is room for 5 bits in the leading byte so these two bits must be padded with three high-order 0s • So the leading byte becomes 110 00011, that is 1100 0011, which is C3 • So the UTF-8 code for ó is C3 B3

Viewing the UTF-8 file in XVI32 (part 3) • We can now see that the first three bytes in this file, EF BB BF, are the UTF-8 encoding of the Unicode byte order mark U+FEFF • Since there are three bytes in the code , the leading byte is of the form 1110xxxxand each of the two continuation bytes has the form 10xx xxxx • So the bytes are 1110 xxxx 10xx xxxx10xx xxxx • All bits in the U+FEFF code point are significant 11111110 1111 1111 • There is room for 6 bits in the last byte, so it can contain the six lowest-order bits 11 1111, so this byte becomes 1011 1111, which is BF • The next six bits, 1110 11, will be placed in the middle byte, so this becomes 1011 10 11, which is BB • The leading byte gets the highest-order bits, becoming 1110 1111, which is EF • So the UTF-code for the byte order mark is EF BB BF

Let's check our understanding by considering some other languages

A web page in Hebrew • Consider this page http://www.haaretz.co.il/news/politics/1.2151492 • Let's copy the first word in the headline, ראש • (It's pronounced 'rosh' and means head, leader, boss, chief)

Let's save this word in a UTF-8 file • Start a new document in Notepad • Paste the word we have just copied • And save the file using the UTF-8 format

Inspecting the UTF-8 • Open the file with XVI32 • The first three bytes, EF BB BF, are familiar • They are the UTF-8 encoding of the Unicode code point for the byte order mark

Inspecting the UTF-8 (contd.) • There are six remaining bytes in the file, D7 A8 D7 90 D7 A9 • So we suspect there are two bytes per character, but let's check • Look at the first byte, D7, in binary format 1101 0111 • It must be a leading byte in a multi-byte code, because its first bit is a 1 • Indeed, it must be the first byte in two-byte code, because its first bits are 110 • So the first character in the file has a two-byte UTF-8 code, D7 A8 • Let's compute the Unicode code point and see the character

Using regular expressions to handle non-ASCII text

Using regular expressions to handle non-ASCII text

Presentation Transcript

Text Processing with Regular Expressions

Regular Expressions and Non-regular Languages

Regular Expressions using Ruby

Regular Expressions

Regular Expressions

Using regular expressions

Regular Expressions

Regular Expression ASCII Converting

Validation using Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions and Non-regular Languages

Regular Expressions

Regular Expressions

Text Extraction using Regular Expressions

Validation using Regular Expressions

Regular Expressions

Regular expressions

Text Extraction using Regular Expressions

Regular Expressions