Siebel CRM Unicode Conversion 2 – The DBA Perspective

Siebel CRM Unicode Conversion 2 – The DBA Perspective Brian HitchcockOCP 8, 8i, 9i DBA Sun Microsystems brian.hitchcock@sun.com brhora@aol.com DCSIT Technical Services DBA www.brianhitchcock.net Brian Hitchcock November 11, 2004 Page 1

CRM Unicode Conversion • Three separate presentations • 1) The overall conversion process • What we had, what we wanted, how to get there • Issues that come up during conversion • 2) Multi-byte data in the existing CRM db • What’s the issue, how did it happen • A general method to find and fix this problem • 3) The actual conversion • What really happened • Issues that came up and how they were resolved • Focus on DBA issues, not Siebel application

How Did I Get Involved? • Sleeping in a meeting… • Heard someone say • “We told the users to stop entering Japanese into the CRM system but we aren’t sure they stopped” • Woke up, said • “I’ve done that before…” • See “Case of the Missing Kanji” • Don’t wake up in meetings…

What’s The Issue? • Existing Siebel CRM system • Oracle 8.1.7.4 • Single-byte character set (WE8ISO8859P1) • Interface systems • Multi-byte character set(s) (UTF8) • Handle data between single,multi-byte apps • Want to convert to Unicode • Siebel, database, interfaces all should be UTF8 • Eliminate interface systems

Users What We Had Amer 8859P1 Emea Apac UTF8 UTF8 Tcustdb Apac Custdb Apac UTF8 UTF8 Tcustdb Emea Custdb Emea Custdb Amer Siebel CRM WE8ISO8859P1 8859P1 Ordering System Oracle Db WE8ISO8859P1

Users What We Wanted UTF8 Amer Emea Apac UTF8 Custdb Apac UTF8 Custdb Emea Custdb Amer Siebel CRM WE8ISO8859P1 UTF8 Ordering System Oracle Db AL32UTF8

What We Wanted • All data in one database • All languages • Unicode • Eliminate interface systems • Reduce support costs • Support increased CRM functionality • All data in one place • Supports new business functionality

Multi-byte Data In Source Db? • Source db is WE8ISO8859P1 • Single-byte character set • Doesn’t support multi-byte characters • That’s the official story • The reality is somewhat different • What, if any multi-byte data is in source db? • How to determine correct character set? • How to find, how to fix? • Japanese, Chinese, others?

But Wait, There’s More… • Not just multi-byte data to look for • Non-p1 character data also • Non multi-byte character data • Could be WE P1 (western European) • German, Italian, French etc. • Could be WE Pn • Polish, Greek, Russian etc. • How to find?

How Polish Was Handled • Use separate app that sends polish (P2) to CRM database • Stored in P1 db • Triggers move this polish data to TWCD • Triggers in TWCD • Know that it’s polish (P2) • Convert to UTF8 and send to WCD db • Therefore, multiple languages in Siebel P1 db

What’s the Problem? • Character data from multiple languages • Stored in oracle db • Db configured for P1 • P1 supports multiple WE languages • Does not support polish, Russian, etc. • Need to find all such character data • Non-p1 can be • Single-byte (polish, Russian, etc.) • Multi-byte (Japanese, Chinese, etc.)

Single-byte Character Sets • All Pn (8859-1, 8859-2, etc.) character sets • Share same range of byte codes, 0 to 255 • Above 0xA1 (decimal 161) • Same byte codes represent different characters • Example • WE8ISO8859P1 (8859-1) • Byte code 0xA3 (decimal 163) is character £ • EE8ISO8859P2 (8859-2) • Same byte code, 0xA3 is character Ł

Finding Non-p1 Char Data? • Logically • Examine db design, Siebel docs, figure out which tables designed to store language specific (local language) data • Some column (country code) in these tables to tell you which country data is from • Determine correct character set for data from each country • Convert these tables manually to AL32UTF8 as part of overall Unicode conversion process

Not Good • Want general method • No need to analyze the meaning of existing data • Need automated way to find all non-P1 char data • Can’t do it • No general way to determine if char data is P1 or P2 or Pn • As shown before, byte code 0xa3 (decimal 163) • Character £ in P1 • Character Ł in P2

Good • But, can find non-ASCII data in general • And then find multi-byte character data • Use separate approach to find non-P1 • Use PL/SQL code • Examine every table • Examine every column that holds character data • Determine which rows if any are ASCII • Rows that aren’t ASCII are ‘suspect’ • Identify tables that have any non-ASCII character data

Why Look For ASCII? • Character data that is ASCII • Only 7 bits used to encode character • 8th bit of every byte is 0 • For non-ASCII, 8th byte is set • WE8ISO8859Pn • Multi-byte, Japanese, Chinese, etc. • By eliminating all tables that are ASCII • No need to ask are they P1, P2, Pn or multi-byte • Greatly reduces the task

How To Find Non-ASCII? • Use SQL function convert • Convert a given column to ASCII character set • Compare resulting string with original • If original string is all ASCII • Will match converted string • If not a match • Column value is non-ASCII • Could be WE8ISO8859Pn • Could be multi-byte

Example Finding Non-ASCII • in WE8ISO8859P1 databasecreate table Psycho_Acircle (text VARCHAR2(100));insert into Psycho_Acircle values (chr(197)||'BCDE');insert into Psycho_Acircle values ('ABCDE');select * from Psycho_Acircle;TEXT-----ÅBCDEABCDEselect convert(text,'US7ASCII','WE8ISO8859P1') from Psycho_Acircle;CONVERT(TEXT,'US7ASCII','WE8ISO8859P1')---------------------------------------?BCDEABCDE ÅBCDE is not the same as ?BCDE

Not Included • Did not scan • LONG datatype columns • CLOB datatype columns • Didn’t have any in schema • PL/SQL code in database • Dev team determined this wasn’t needed

Scripts Strategy • Eliminate as much as possible • Identify all ASCII only tables • Left with set of non-ASCII tables • For remaining tables • Find likely Japanese character data • Verify it is Japanese • Copy to separate table • Remove from non-ASCII tables • Repeat for other languages • How to identify byte patterns for each language?

PL/SQL scripts • Scripts used • Scan_Table_1_Gen_Column_Info.sql • Scan_Table_2_Gen_Nonascii_rows_Info.sql • Scan_Table_3_Gen_NonasciiTables_NoLong.sql • Scan_Table_4_Gen_NonasciiTables_NonasciiCols_Only.sql • Scan_Table_5_Gen_NonasciiTables_YesLong.sql • Scan_Table_6_Gen_NA_EUCJP_info_sql_col_info.sql • Scan_Table_7_Gen_NA_EUCJP_Tables.sql • Scan_Table_8_Gen_NA_EUCJP_2_rows_info.sql

Scripts • Each script generates table(s) • Output of each script stored in table(s) • Next script uses tables • Lots of intermediate data stored • Helped develop scripts • Each script simpler • Provided extra output for developers, analysts to help them verify results • Is this data really Japanese?

What Does Each Script Do? • Scan_Table_1_Gen_Column_Info.sql • Scans all tables in a schema • Creates two tables • Table_Gen_Info • Info on all tables • Table_Column_Info • Info on character columns • Which contain any non-ASCII strings • Doesn’t include LONG columns • Can’t use SQL functions on LONG datatype

What Does Each Script Do? • Scan_Table_2_Gen_Nonascii_rows_Info.sql • Use table Table_Column_Info • Examine tables with non-ASCII character data • Creates two tables • Table_NonAscii_info • Number of rows, columns with non-ASCII data • Table_NonAscii_SQL • SQL to extract non-ASCII data from each table • Useful for developers, analysts to extract data from other environments

What Does Each Script Do? • Scan_Table_3_Gen_NonasciiTables_NoLong.sql • Use tables table_gen_info, table_nonascii_sql • Create copies of tables that have non-ASCII data • Copies contain only the non-ASCII rows • Have all character columns of original table • Helps identify which country data is from • Creates tables as select * from <tablename> • Doesn’t work on tables with LONG column • Tables named NONASCII_<tablename>

What Does Each Script Do? • Scan_Table_4_Gen_NonasciiTables_NonasciiCols_Only.sql • Similar to third (previous) script • Table copies only contain columns that have non-ASCII data • Does handle tables with LONG column • Creates tables of form NA_CO_<tablename> • Set of tables containing all non-ASCII data in the schema

What Does Each Script Do? • Scan_Table_5_Gen_NonasciiTables_YesLong.sql • Creates copies of tables having non-ASCII data • Copy tables have all char columns of base table • Only copies tables that have LONG column • Companion to third script • Deals with tables that have LONG column • Tables named NONASCII_<tablename> • Now have complete set of tables • Have all non-ASCII char columns of base tables

Katakana, Hiragana? • How to find Japanese character data? • Look at hex dump of character data and see lots of ¥_¥ and ¤_¤ • The byte code of ¥ is A4, ¤ is A5 • Many Japanese transliterated terms (company names) start with these bytes • Typical of EUCJP character set • Find rows that contain '%¥_¥%' or '%¤_¤%‘ • repeated ¥ or ¤ means EUCJP more likely • Verify that these rows are indeed Japanese

What Does Each Script Do? • Scan_Table_6_Gen_NA_EUCJP_info_sql_col_info.sql • For table copies with non-ASCII columns only • Look for specific pattern of '%¥_¥%' • Or '%¤_¤%‘ • Creates tables • Table_NA_EUCJP_Info • Table_NA_EUCJP_SQL • Table_NA_EUCJP_COL_INFO

6th Script • What does each table contain? • Table_NA_EUCJP_Info • Number of EUCJP rows in each non-ASCII table • Table_NA_EUCJP_SQL • SQL to extract EUCJP rows • Table_NA_EUCJP_COL_INFO • Number of EUCJP rows in each column

What Does Each Script Do? • Scan_Table_7_Gen_NA_EUCJP_Tables.sql • Create two copies of each table that has EUCJP • Contain rows that have EUCJP • First table, all char columns • Second, only EUCJP columns • Tables created have names • EUCJP_<tablename> • ECUJP_CO_<tablename>

After 7th Script • We have identified EUCJP rows • In non-ASCII tables • Copied these rows to separate tables • Delete these rows from the non-ASCII tables • As we identify rows from a specific char set • Remove them from the non-ASCII tables • Smaller and smaller set of unknown rows

What Does Each Script Do? • Scan_Table_8_Gen_NA_EUCJP_2_rows_info.sql • Find rows containing ¥ or ¤ • Could be Japanese • Could be WE

Results • For each script • Time to run • Output • %of total db that is non-ASCII • Demonstrates power of this approach • No attempt to speed up • Only need to scan once, no need for speed • Copy prod data to separate environment • Run scripts there, develop the SQL to correctly convert the non-ASCII data as needed • Apply to prod as part of Unicode conversion

Results • Scripts run against copy of production db • Database • 25Gb total, but 13Gb free space • 12Gb of actual data to scan • (be skeptical when people tell you they support multi-terabyte dbs, size of actual data counts) • Scripts create tables in the same schema they run in

Results • Script 1 – 2hours • Scanned 12Gb of data • 2483 tables, 63138 columns • Created two tables • Table_gen_info • Table_column_info

1st Script Results SQL> select * from Table_Gen_Info where rownum <=10; TABLENAME NUMROWS NUMCOLS NUMCHARCOLS NUMCLOBCOLS NUMLONGCOLS ------------------------------ ---------- ---------- ----------- ----------- ----------- ACCNT_STAT 15775 5 3 0 0 AMER_AR_OWNER 1085497 7 6 0 0 AMER_AR_T 1060 3 2 0 0 APAC_AR_OWNER 2770 6 6 0 0 AR_ADMIN 5578 35 31 0 0 AR_CON 3573 22 17 0 0 AR_STAT 88652 7 5 0 0 AUDIT_TABLE 53301 29 26 0 0 CONT_CREATED 515126 2 2 0 0 CON_CREATED 184744 2 2 0 0

1st Script Results SQL> select * from Table_Column_Info where rownum <=20; TABLENAME NUMROWS NUMCHARCOLS CHARCOLNUM CHARCOLNAME NUMNONASCIIROWS ------------------------------ ---------- ----------- ---------- ------------ --------------- ACCNT_STAT 15775 3 1 WCD 0 ACCNT_STAT 15775 3 2 STATUS 0 ACCNT_STAT 15775 3 3 R4_STATUS 0 ... ... ... AR_ADMIN 5578 31 1 R4_ID 0 AR_ADMIN 5578 31 2 R4_SR_NUM 0 AR_ADMIN 5578 31 3 X_DESC 72 20 rows selected. SQL>

2nd Script Results • 12 minutes • 68 tables that have non-ASCII char data • 68 SQL statements • Overall • We have 12Gb of data • 68/2483 tables have any non-ASCII char data • Only 3% of the tables • But they’re some of the biggest tables • Schema analysis much easier on 68 tables

2nd Script results SQL> select * from Table_NonAscii_Info where rownum <= 10; TABLENAME NUMROWS NUMNONASCIIROWS NUMCOLS NUMNONASCIICOLS ------------------------------ ---------- --------------- ---------- --------------- AR_ADMIN 5578 692 35 6 AR_CON 3573 107 22 3 AUDIT_TABLE 53301 17 29 1 CX_S_ADDR_ORG_XM 69470 275 19 5 C_ACCOUNT 17897 1114 20 1 C_ACT 6562 933 21 6 C_ADDRESS 25590 5490 28 6 C_AR 88638 3760 26 6 C_CONTACT 52574 10401 20 3 C_OPTY 2139 119 25 4

2nd Script Results SQL> select * from Table_NonAscii_SQL where rownum <= 10; TABLENAME LENGTHNONASCIISQL ------------------------------ ----------------- NONASCIISQL -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- AR_ADMIN 445 select count(*) from AR_ADMIN where 1=0 or X_DESC != CONVERT (X_DESC, 'US7ASCII', 'WE8ISO8859P1') or LAST_NAME != CONVERT (LAST_NAME, 'US7ASCII', 'WE8ISO8859P1') or FST_NAME != CONVERT (FST_NAME, 'US7 ASCII', 'WE8ISO8859P1') or ACCOUNT != CONVERT (ACCOUNT, 'US7ASCII', 'WE8ISO8859P1') or OWNER_LAST_NAME != CONVERT (OWNER_LAST_NAME, 'US7ASCII', 'WE8ISO8859P1') or R3_CREATED_LAST_NAME != CONVERT (R3_C REATED_LAST_NAME, 'US7ASCII', 'WE8ISO8859P1') AR_CON 233 select count(*) from AR_CON where 1=0 or OWNER_LAST != CONVERT (OWNER_LAST, 'US7ASCII', 'WE8ISO8859P1') or OWNER_FST != CONVERT (OWNER_FST, 'US7ASCII', 'WE8ISO8859P1') or R3_X_NOTES != CONVERT (R3_X_N OTES, 'US7ASCII', 'WE8ISO8859P1') AUDIT_TABLE 100 select count(*) from AUDIT_TABLE where 1=0 or FIELD2 != CONVERT (FIELD2, 'US7ASCII', 'WE8ISO8859P1')

3rd Script Results • 10 minutes • Create copies of non-ASCII tables • Copies contain all character columns • LONG columns not included • Creates 65 tables SQL> select table_name from user_tables where table_name like 'NONASCII%' and table_name not like '%_ORIG‘ and rownum <= 5; TABLE_NAME ------------------------------ NONASCII_AR_ADMIN NONASCII_AR_CON NONASCII_AUDIT_TABLE NONASCII_CX_S_ADDR_ORG_XM NONASCII_C_ACCOUNT

4th Script Results • 7 minutes • Create copies of non-ASCII tables • Copies contain only non-ASCII columns • Creates 68 tables SQL> select table_name from user_tables where table_name like 'NA_CO_%‘ and rownum <= 5; TABLE_NAME ------------------------------ NA_CO_AR_ADMIN NA_CO_AR_CON NA_CO_AUDIT_TABLE NA_CO_CX_S_ADDR_ORG_XM NA_CO_C_ACCOUNT

5th Script Results • 1 minute • Create copies of non-ASCII tables • Copies contain all character columns • LONG column included • Creates 3 tables • only 3 non-ASCII tables have LONG column TABLE_NAME ------------------------------ NONASCII_EIM_ACCNT_DTL NONASCII_EIM_OPTY_DTL NONASCII_S_CS_QUEST_LANG

6th Script Results • 27 minutes • Scan non-ASCII tables • Find '%¥_¥%' or '%¤_¤%‘ • Very likely EUCJP character set • Create three tables • Table_NA_EUCJP_Info (68 tables) • Table_NA_EUCJP_SQL (5 tables) • TABLE_NA_EUCJP_COL_INFO (213 columns) • 5 tables have EUCJP character data

6th Script Results SQL> select * from Table_NA_EUCJP_Info where rownum <= 10; TABLENAME NUM_NONASCII_ROWS NUM_NA_EUCJP_ROWS NUM_NONASCII_COLS NUM_NA_EUCJP_COLS ------------------------------ ----------------- ----------------- ----------------- ----------------- NA_CO_AR_ADMIN 5578 9 6 1 NA_CO_AR_CON 3573 4 3 1 NA_CO_AUDIT_TABLE 53301 0 1 0 NA_CO_CX_S_ADDR_ORG_XM 69470 0 5 0 NA_CO_C_ACCOUNT 17897 0 1 0 NA_CO_C_ACT 6562 0 6 0 NA_CO_C_ADDRESS 25590 0 6 0 NA_CO_C_AR 88638 0 6 0 NA_CO_C_CONTACT 52574 0 3 0 NA_CO_C_OPTY 2139 0 4 0

6th Script Results SQL> select * from Table_NA_EUCJP_SQL; TABLENAME LEN_NA_EUCJP_SQL ---------------- ---------------- NA_EUCJP_SQL -------------------------------------------------------------------------------------------------------- NA_CO_AR_ADMIN 91 select count(*) from NA_CO_AR_ADMIN where 1=0 or X_DESC like '%¥_¥%' or X_DESC like '%¤_¤%' NA_CO_AR_CON 97 select count(*) from NA_CO_AR_CON where 1=0 or R3_X_NOTES like '%¥_¥%' or R3_X_NOTES like '%¤_¤%' NA_CO_S_ADDR_ORG 97 select count(*) from NA_CO_S_ADDR_ORG where 1=0 or COMMENTS like '%¥_¥%' or COMMENTS like '%¤_¤%' NA_CO_S_CONTACT 142 select count(*) from NA_CO_S_CONTACT where 1=0 or COMMENTS like '%¥_¥%' or COMMENTS like '%¤_¤%' or X_DEPT like '%¥_¥%' or X_DEPT like '%¤_¤%' NA_CO_S_SRV_REQ 200 select count(*) from NA_CO_S_SRV_REQ where 1=0 or X_NOTES like '%¥_¥%' or X_NOTES like '%¤_¤%' or X_DESC like '%¥_¥%' or X_DESC like '%¤_¤%' or X_EMAIL_NOTES like '%¥_¥%' or X_EMAIL_NOTES like '%¤_¤%'

6th Script Results SQL> select * from TABLE_NA_EUCJP_COL_INFO where rownum <=10; TABLENAME NUMNONASCIIROWS NUMNACOLS NACOLNUM NAEUCJPCOLNAME NUMNAEUCJPROWS ------------------- --------------- ---------- ---------- ------------------------------ -------------- NA_CO_AR_ADMIN 5578 6 1 X_DESC 9 NA_CO_AR_ADMIN 5578 6 2 LAST_NAME 0 NA_CO_AR_ADMIN 5578 6 3 FST_NAME 0 NA_CO_AR_ADMIN 5578 6 4 ACCOUNT 0 NA_CO_AR_ADMIN 5578 6 5 OWNER_LAST_NAME 0 NA_CO_AR_ADMIN 5578 6 6 R3_CREATED_LAST_NAME 0 NA_CO_AR_CON 3573 3 1 OWNER_LAST 0 NA_CO_AR_CON 3573 3 2 OWNER_FST 0 NA_CO_AR_CON 3573 3 3 R3_X_NOTES 4 NA_CO_AUDIT_TABLE 53301 1 1 FIELD2 0

7th Script Results • 6 minutes • Create two copies of each EUCJP tables • First copy has all character columns of table • Second copy has only the EUCJP columns • Tables named • EUCJP_<tablename> • EUCJP_CO_<tablename>

7th Script Results SQL> select table_name from user_tables where table_name like 'EUCJP_%' minus select 2 table_name from user_tables where table_name like 'EUCJP_CO_%'; TABLE_NAME ------------------------------ EUCJP_AR_ADMIN EUCJP_AR_CON EUCJP_S_ADDR_ORG EUCJP_S_CONTACT EUCJP_S_SRV_REQ SQL> select table_name from user_tables where table_name like 'EUCJP_CO_%'; TABLE_NAME ------------------------------ EUCJP_CO_AR_ADMIN EUCJP_CO_AR_CON EUCJP_CO_S_ADDR_ORG EUCJP_CO_S_CONTACT EUCJP_CO_S_SRV_REQ

Siebel CRM Unicode Conversion 2 – The DBA Perspective