Unicode Migration for Multilingual Database Applications

Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager

Presentation Goals • Outline Migration Steps • Describe Design Considerations • Leverage Existing Double-byte Implementation • Describe Impact on 4GL and Report Formats

PROGRESS Application Development Suite • Powerful tools for the rapid creation of distributed business applications • Creates character, GUI, or web-based clients with common source • Host-based, client-server, or n-tier distribution on variety of platforms • Scalable, robust RDBMS and open • International, double-byte enabled

Possible Configuration Options GUI Client Client-Server Web-based Client Database Server Progress Database Host-based Character Client Optional n-tier Application Server Other Database

Why do our customers need Unicode? • Many do not... However, • Multinationals deploy across regions with incompatible character sets, yet they must share data between them. • Programs are distributed worldwide with one container of text in many languages. • Certain applications require multilingual databases. E.g. Translation systems and web-based applications.

The Existing Architecture • 1.5M lines of C code • 0.3M lines of 4GL code • Double-byte enabled • CJK, 9 double-byte charsets supported • 2-byte only, no 3 or 4-byte • No shift-sequenced charsets • DBE changes earmarked, easy to find • 4 years, 3 developers, 2 QA

Estimated cost of implementing UCS-2, was very big! • Changing to 16-bit text units affects almost every source module • Largest cost is separating char variables based on usage for text or binary data. • Use 16-bit null terminators, ignore 8-bit “A” Þ 0041, 0000 “Ã” Þ 0100, 0000 • Pointer arithmetic (advance 2 bytes) • Sizing (bytes or characters) • New API to use new WIDE TEXT datatype

Product requirements for a multilingual version Minimize cost for application migration Minimize cost for application upgrade Minimize support cost One executable! Maintain user-definable character sets Add UTF-8 as just another character set UTF-8 algorithms are compatible with other charsets

Scaled down multilingual proposal: UTF-8 implementation Implement UTF-8 as 3-byte character set Leverage & extend double-byte enabling Places to change are already earmarked Restrict to composed characters for now Restrict to no surrogates Supports all the markets we are in UTF-8-enable 4GL and RDBMS first Provides multilingual logic and storage Java+other client technologies coming

Architecture changesUTF-8-enabling the string library N-byte enable character+string functions GetNextChar, GetPreviousChar GetCharacterSize (table-based) Modified IsFirstByte New GetColumnLength New datatype normalized “BIG” char Minor algorithm changes for efficiency Find Character

Architecture changesUTF-8-enabling character tables String libraries use character tables Alphanumeric, Lead-byte, Tail-byte Upper, lower case (700+ characters) New property ColumnCount New table formats Old architecture presumed 256 byte table Now organized by range lists and trie Update table compiler & allow hex entry

Architecture changesUTF-8-enabling sorting How to sort multilingual data? Binary sort used for double-byte data With UTF-8, Europe is 2-byte, CJK 3-byte Solution Binary sort on server Client uses native sort Bump key length limit for UTF-8 Next phase will be enhanced sort

Architecture changesCharacter conversion algorithms Existing, user-definable, conversions Single-byte character set table maps Double-byte Shift-JIS - EUCJIS algorithm New table-driven automated conversions Single-byte to UTF-8, and back Double-byte to UCS-2 and back UTF-8 - UCS-2 Trie for speed and memory optimization Requires significant QA for data integrity

Architecture changesImpact on the 4GL user 4GL is character set independent Almost all functions are character-based 3 functions require optional byte-basing Length, Substring, Overlay Options: Byte, Character Add new option: Column Format (Picture) Phrase “XXXX” has different meaning for UTF-8

Status • Functioning Well • Going to second beta • Implemented with very low cost • Performance is OK • Metrics not yet available • Testing is most significant cost • Reviewing all character set properties • Evaluating all conversions

Futures For the Progress International Team Multilingual Clients Enhanced Character Folding Enhanced Sorting For Progress Customers Deployment of multilingual databases Worldwide access to these databases Worldwide deployment of multi-language applications

Conclusions • Migration can be achieved in phases • Migration thru UTF-8 can be low cost • Double-byte applications can migrate easily to UTF-8 • Asian users can integrate with other languages now • Non-English users can integrate with Asian languages now

Any questions?

Unicode Migration for Multilingual Database Applications