Exploring Oracle PDB Snapshot Clone: Low-Level Details and Use Case at Deutsche Bank

Oracle snapshot clone PDB low level details and Deutsche Bank use case

About me • 15+ years experience in Oracle databases development/administrating. OCE Oracle SQL. • MSSQL Server and Sybase experience. • Used to have a good Java background  • Author of a couple well-known web projects (kontramarka.ru -1st version, sonystyle.ru - 1st version, …) • My blog: https://dmitryremizov.wordpress.com/

Agenda • Our Continues Database Testing approach. • Low level technical details of chosen technology.

Why?

Our motivation • We have quite big and complex database/s with hundreds or even thousands of objects. • Our unit/integration tests are very sensitive to data in various dictionaries, “static data”, etc. • We don’t want to spend substantial amount of money on additional infrastructure.

Before state • Java in memory H2 database was used in some casesor • Temporary Oracle schema were created from scratch • In both cases each single unit test created its own set of objects

Issues with what we had before • Limited support of Oracle syntax • Schema mismatch with a real Oracle schema.

Requirements • Full support of Oracle syntax for current Oracle version. • Ability to keep in sync with a real schema.

NFRs (none functional requirements) • We need to spawn a new “test bed” very quickly (seconds not minutes). • Dictionaries doesn’t mean it is something small (50 mln unique instruments – shares, forwards, futures, etc. • It should work relatively fast as we have hundreds of use cases to test.

Oracle PDB snapshot clone is our technology of choice • Taking all those functional and non-functional requirements into account we have chosen Oracle snapshot clone technology. • Benefits: • A real Oracle database with full Oracle syntax support. • You can create a new Oracle database in a couple of seconds. • You won’t consume much additional disk space for that.

Some sample syntax for PDB clone • CREATE PLUGGABLE DATABASE CLN3 FROM ORCL SNAPSHOT COPYPATH_PREFIX = '/u01/app/oracle/oradata/orcl12c/cln3'FILE_NAME_CONVERT = • ('/u01/app/oracle/oradata/orcl12c/orcl','/u01/app/oracle/oradata/orcl12c/cln3')NOLOGGING;

CREDITS • This investigation and all further activity was mainly inspired by this post • https://oraganism.wordpress.com/2015/10/02/oracle-12c-pdb-snapshot-copy-on-regular-filesystem/ • by Jean-Christophe DAUCHY. • There were many contradictory messages about this feature in the past (even including official Oracle documentation). • See: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/refrn/CLONEDB.html

Oracle PDB snapshot clone overview • That is a feature (or technology) you can use within “container” database. • This is about ability to create a new PDB based on some template PDB. • You have to have some other requirements/prerequisites to be satisfied (but not much)

Let’s start to look deeper What is under the hood

NFRs actually motivated me to investigate chosen technology deeper.

Filesystem level details • Prerequisites: • Oracle 12.2c (potentially a bit earlier) • clone_db=true • Almost any Linux filesystem (sparse files support is required) • “Template PDB” in read only mode • ALTER PLUGGABLE DATABASE ORCL CLOSE; • ALTER PLUGGABLE DATABASE ORCL OPEN READ ONLY; • See: https://dmitryremizov.wordpress.com/2018/11/11/pdb-snapshot-copy-filesystem-level-details/

Filesystem level details 2 CREATE PLUGGABLE DATABASE CLN3 FROM ORCL SNAPSHOT COPYPATH_PREFIX = '/u01/app/oracle/oradata/orcl12c/cln3'FILE_NAME_CONVERT = ('/u01/app/oracle/oradata/orcl12c/orcl','/u01/app/oracle/oradata/orcl12c/cln3')NOLOGGING; Let’s compare our target and source/template datafiles on OS level: • lltotal 2072252-r–r—–. 1 oracle oinstall7938048Oct 27 09:54 APEX_1941389856444596.dbf-r–r—–. 1 oracle oinstall1237327872Oct 27 09:54 sysaux01.dbf-r–r—–. 1 oracle oinstall 367009792 Oct 27 09:54 system01.dbf-rw-r—–. 1 oracle oinstall 67117056 Oct 27 09:31 temp01.dbf-r–r—–. 1 oracle oinstall 398467072 Oct 27 09:54 undotbs01.dbf-r–r—–. 1 oracle oinstall99622912 Oct 27 09:54 users01.dbf • lltotal 160-rw-r—–. 1 oracle oinstall7938048 Nov 11 03:17 APEX_1941389856444596.dbf-rw-r—–. 1 oracle oinstall1237327872 Nov 11 03:17 sysaux01.dbf-rw-r—–. 1 oracle oinstall 367009792 Nov 11 03:17 system01.dbf-rw-r—–. 1 oracle oinstall 67117056 Nov 11 03:17 temp01.dbf-rw-r—–. 1 oracle oinstall 398467072 Nov 11 03:17 undotbs01.dbf-rw-r—–. 1 oracle oinstall99622912 Nov 11 03:17 users01.dbf At 1st glance they look the same but this totalblock stuff.

A real physical space occupation • du -h `ls` • 16K APEX_1941389856444596.dbf • 16K sysaux01.dbf • 40K system01.dbf • 56K temp01.dbf • 16K undotbs01.dbf • 16K users01.dbf

There is a miracle underneath • Filesystem level utilities can help us xfs_bmap(8) - Linux man page Name xfs_bmap - print block mapping for an XFS file Synopsis xfs_bmap [ -adlpv ] [ -nnum_extents ] file Description xfs_bmap prints the map of disk blocks used by files in an XFS filesystem. The map lists each extent used by the file, as well as regions in the file that do not have any corresponding blocks (holes). Each line of the listings takes the following form:extent: [startoffset..endoffset]: startblock..endblock Holes are marked by replacing the startblock..endblock with hole. All the file offsets and disk blocks are in units of 512-byte blocks, no matter what the filesystem's block size is. See: https://linux.die.net/man/8/xfs_bmap/

inode level investigation shows a clear difference • xfs_bmap /u01/app/oracle/oradata/orcl12c/orcl/users01.dbf • xfs_bmap /u01/app/oracle/oradata/orcl12c/cln3/users01.dbf • /u01/app/oracle/oradata/orcl12c/cln3/users01.dbf:0: [0..15]: 36847536..368475511: [16..31]: 36847624..368476392: [32..194575]: hole • /u01/app/oracle/oradata/orcl12c/orcl/users01.dbf:0: [0..15]: 33829848..338298631: [16..10255]: 36450240..364604792: [10256..12815]: 39722952..397255113: [12816..30735]: 39725928..397438474: [30736..51215]: 39743872..397643515: [51216..71695]: 39764616..397850956: [71696..74255]: 39785160..397877197: [74256..97295]: 39787728..398107678: [97296..120335]: 39813864..398369039: [120336..145935]: 39859168..3988476710: [145936..156175]: 39884784..3989502311: [156176..161295]: 39510400..3951551912: [161296..168975]: 39493656..3950133513: [168976..194575]: 36901952..36927551

You can see that only first 2 blocks(8k) of the user01.dbf file (remember:for xfs_bimap - disk blocks are in units of 512-byte blocks ) have mappings to real disk blocks. Other blocks of the file are kind of virtual. • This lead us to an idea thatLinux sparse file technology is used behind the scene. So no real disk space is occupied by the files, only inode contains information about the file blocks “holes”. • 2 blocks allocation can be confirmed on the database side as well: • SELECT v.snapshotfilename, v.clonefilename, v.blocks_allocated FROM V$CLONEDFILE v WHERE clonefilename='/u01/app/oracle/oradata/orcl12c/cln3/users01.dbf';

DML case and filesystem level details UPDATE hr.departments set department_name= department_name || ' department' WHERE department_id=60; • xfs_bmap /u01/app/oracle/oradata/orcl12c/cln3/users01.dbf • /u01/app/oracle/oradata/orcl12c/cln3/users01.dbf:0: [0..15]: 36847536..368475511: [16..31]: 36847624..368476392: [32..147663]: hole3: [147664..147679]: 36872464..368724794: [147680..194575]: hole • du -h users01.dbf24K users01.dbf You can see that only one more real 8K (147679 – 147664+ 1 = 16* 512 B = 8K) disk block has appeared somewhere in the middle of the file

DDL case and filesystem level details CREATE TABLE HR.H_BIG AS SELECT * FROM dba_objects; ALTER SYSTEM FLUSH BUFFER_CACHE; • xfs_bmap /u01/app/oracle/oradata/orcl12c/cln3/users01.dbf • /u01/app/oracle/oradata/orcl12c/cln3/users01.dbf:0: [0..15]: 36847536..368475511: [16..31]: 36847624..368476392: [32..63]: 36880496..368805273: [64..147663]: hole4: [147664..147679]: 36872464..368724795: [147680..150175]: hole6: [150176..150191]: 36872976..368729917: [150192..154239]: hole8: [154240..154287]: 36880224..368802719: [154288..154495]: 36872768..36872975 • ----- omitted some lines27: [186368..186399]: 36880592..3688062328: [186400..204815]: 36881712..3690012729: [204816..207375]: 36874768..3687732730: [207376..215055]: 36946144..36953823 You can see that many ranges of real/physical blocks have appeared in “cloned” datafile. However a real occupied disk size are still much less than “declared” one. du -h users01.dbf16Musers01.dbf ls -lh users01.dbf-rw-r—–. 1 oracle oinstall108M Nov 11 16:40 users01.dbf

It looks like segments have the following structure of “Swiss cheese” So you can have necessary block/s either from left of right side

HOW DO YOU DECIDE? What block you need ?

Keep mapping somewhere • Something else

Physical reads trace • 2 cases: regular PDB and “cloned” PDB to be investigated and compared. (on syscall level) • Method I used: • Attach GDB to some Oracle user process (on server), like: • gdb attach <pid> • Set a break point on pread64 function and examine some input parameters of that function, like:break pread64commandsp $rdi --- $rdi – file descriptorx/4s $rsi -- buffer to read data top $rdx -- number of bytes to readend • ssize_tpread(intfd, void *buf, size_tcount, off_toffset); See: https://linux.die.net/man/2/pread64https://wiki.osdev.org/System_V_ABI#x86-64

“Physical reads trace” results • Regular PDB • “Cloned” PDB • Breakpoint 1, 0x00007fe8b7a8df40 in pread64 () from /lib64/libpthread.so.0 • $55 = 259 — ../pdb1/users01.dbf$56 = 8192 • (gdb) c • Continuing. • Breakpoint 1, 0x00007fe8b7a8df40 in pread64 () from /lib64/libpthread.so.0 • $57 = 259— ../pdb1/users01.dbf • $58 = 40960 • Breakpoint 1, 0x00007fe5a840df40 in pread64 () from /lib64/libpthread.so.0$107 = 263 — ../cln1/users01.dbf)$108 = 8192 amount of byte to read • (gdb) cContinuing.Breakpoint 1, 0x00007fe5a840df40 in pread64 () from /lib64/libpthread.so.0$109 = 264 –../orcl/users01.dbf$110 = 8192(gdb) cContinuing.Breakpoint 1, 0x00007fe5a840df40 in pread64 () from /lib64/libpthread.so.0$111 = 263../cln1/users01.dbf$112 = 40960 • (gdb) cContinuing.Breakpoint 1, 0x00007fe5a840df40 in pread64 () from /lib64/libpthread.so.0$113 = 264 –../orcl/users01.dbf$114 = 40960 On the left side just an usual PDB (PDB1) on the right side “fresh cloned” PDB (CLN1) created from ORCL “template” DB. It still doesn’t have any specific blocks by itself.

File descriptor decipher • Actually it is easy on Linux systems to find out such information, you can go to /proc/<pid>/fdand issue a “ls –la” command

Interpretation You see - you read as twice as more physical blocks for “cloned” PDB case.

“Physical reads trace” results(case2) Here I decided to create a “brand new” segment in my “cloned” PDB, that segment should not contain any blocks from “parent” PDB • Regular PDB • “Cloned” PDB • Breakpoint 1, 0x00007fe8b7a8df40 in pread64 () from /lib64/libpthread.so.0 • $55 = 259 — ../pdb1/users01.dbf$56 = 8192 • (gdb) c • Continuing. • Breakpoint 1, 0x00007fe8b7a8df40 in pread64 () from /lib64/libpthread.so.0 • $57 = 259— ../pdb1/users01.dbf • $58 = 40960 • Breakpoint 1, 0x00007f8f65435f40 in pread64 () from /lib64/libpthread.so.0 • $259 = 263— ../cln1/users01.dbf$260 = 8192 • (gdb) • Continuing. • Breakpoint 1, 0x00007f8f65435f40 in pread64 () from /lib64/libpthread.so.0 • $261 = 263— ../cln1/users01.dbf • $262 = 8192 • (gdb)

Interpretation No need to read “parent”, you read exactly the same amount of blocks

“Physical reads trace” results(case3) Here I have created a cloned PDB,updated several records in itand compare with also cloned PDB but with zero “own” blocks • “Cloned” PDB untouched (case 1) • “Cloned” PDB - mixed • Breakpoint 3, 0x00007f243ed45f40 in pread64 () from /lib64/libpthread.so.0$17 = 263 — that is file descriptor (../cln2/users01.dbf)$18 = 8192Breakpoint 3, 0x00007f243ed45f40 in pread64 () from /lib64/libpthread.so.0$19 = 264 –../orcl/users01.dbf$20 = 8192Breakpoint 3, 0x00007f243ed45f40 in pread64 () from /lib64/libpthread.so.0$21 = 263 — (../cln2/users01.dbf)$22 = 40960Breakpoint 3, 0x00007f243ed45f40 in pread64 () from /lib64/libpthread.so.0$23 = 264 –../orcl/users01.dbf$24 = 40960 • Breakpoint 3, 0x00007f487b31df40 in pread64 () from /lib64/libpthread.so.0$9 = 261 cln1/users01.dbf $10 = 8192Breakpoint 3, 0x00007f487b31df40 in pread64 () from /lib64/libpthread.so.0$11 = 262 orcl/users01.dbf $12 = 8192Breakpoint 3, 0x00007f487b31df40 in pread64 () from /lib64/libpthread.so.0$13 = 261 cln1/users01.dbf $14 = 40960Breakpoint 3, 0x00007f487b31df40 in pread64 () from /lib64/libpthread.so.0$15 = 262 orcl/users01.dbf $16 = 32768

Interpretation 3 in this case 2nd IO lead to reading 40960 bytes from cloned dbdatafile/s and then reads missing 32768 bytes from “parent” dbdatafile/s.

General idea • I am again set up my gdb tracing session but in this time have added a “back trace” command: • break pread64commandsp $rdip $rdxbt11end • Looks like after each PIO Oracle goes to checkdata facility and if some data is missing - gives another physical read call. See: https://dmitryremizov.wordpress.com/2018/10/30/snapshot-cloning-and-physical-io-part-4-algorithm-how-does-it-work/

Some tracing example • Breakpoint 5, 0x00007fb781ddbf40 in pread64 () from /lib64/libpthread.so.0$171 = 261— that is file descriptor (../cln1/users01.dbf)$172 = 65536 — read bytes #0 0x00007fb781ddbf40 in pread64 () from /lib64/libpthread.so.0#1 0x0000000010f35410 in skgfqio ()#2 0x0000000010e08b74 in ksfd_skgfqio ()#3 0x0000000010df59c2 in ksfd_io ()#4 0x000000000d65a3c3 in ksfdss_read ()#5 0x0000000010df4f8f in ksfdread ()#6 0x0000000010dbc5f9 in kcfrbd1 ()#7 0x00000000015e0899 in kcbz_table_scan_read () • Breakpoint 5, 0x00007fb781ddbf40 in pread64 () from /lib64/libpthread.so.0$173 = 262— that is file descriptor (../orcl/users01.dbf)$174 = 65536 read bytes #0 0x00007fb781ddbf40 in pread64 () from /lib64/libpthread.so.0#1 0x0000000010f35410 in skgfqio ()#2 0x0000000010e08b74 in ksfd_skgfqio ()#3 0x0000000010df59c2 in ksfd_io ()#4 0x0000000010df4f46 in ksfdread ()#5 0x000000000d65dea7 in ksfdss_checkdata ()#6 0x000000000d65a44f in ksfdss_read ()#7 0x0000000010df4f8f in ksfdread ()#8 0x0000000010dbc5f9 in kcfrbd1 ()#9 0x00000000015e0899 in kcbz_table_scan_read()

Let’s reformulate a bit.The general idea of I/O on “snapshot clone” PDB 1st attempt 222nd attempt sparse file

New statistics for “PDB clone” case • SELECT s.statistic#, name FROM v$statname s WHERE name like ‘% snap %’;

Methodology used for stats investigation • I detected an address where my session stats array is located(described here:https://dmitryremizov.wordpress.com/2018/11/06/where-session-stats-live-12-2c/ • Set up watchpoints on values change.

Some sample output • Breakpoint 3, 0x00007fb6f8cc6f40 in pread64 () from /lib64/libpthread.so.0 --read $18 = 261 --read “clone datafile”$19 = 40960 • Breakpoint 3, 0x00007fb6f8cc6f40 in pread64 () from /lib64/libpthread.so.0$20 = 262 - --read “base datafile”$21 = 40960 • Hardware watchpoint 6: *(*(0x000000006A8D9078+0x820) +8*108)Old value = 266New value = 2670x000000000d661f5c in ksfdss_update_stats()$22 = "108 physical read snap IO requests base“ —stats updates • Hardware watchpoint 8: *(*(0x000000006A8D9078+0x820) +8*110)Old value = 266New value = 2670x000000000d65e02d in ksfdss_checkdata()$23 = "110 physical read snap IO requests no data" —stats updates

PDB clone stats explained

What we have understood so far • A new database creation process is very fast and scales as O(1) in respect of template database size. • Performance of new “cloned” PDB should not be very bad. • “Fresh” data is read the same way as usual PDB • Worst case scenario can lead to 2 times degradation • Remember about buffer cache. • So we expect that performance degrades not so drastically.

A real life?

Initial idea:Each unit test spawns its own “thin” database

BUT

“Serialization issue” -12c • If you try to run simultaneously several/many PDBs creation you will see that it does not scale well. • Creation time increases almost linearly with a number of databases to create. • So it seems we have some resource we are syncing on.

What we are waiting on?

waitprof.sql result(Tanel’s Poder script) % Total Total Event Distinct Avg time SID STATE EVENT P1 P2 P3 SEQ# Time Timems Events ms/Event ------- ------- ----------------------------- ------- ----------------- -- ---- ------ --------- ------ ---- 45 WAITING library cache lock lock address= 37.02 11221.5651 11221.565 0000000085E25FC0 45 WAITING SQL*Net message from client #bytes= 1 25.45 7714.986 1 7714.986 45 WAITING enq: CF - contention 0= 0 8.05 2438.955 48 50.812 45 WAITING rdbmsipc reply = 0 7.81 2365.847 9 262.872 See: https://github.com/tanelpoder/tpt-oracle/blob/master/waitprof.sql

Majority of time (~10 sec) we were waiting on a “library cache lock” • lock address is “0000000085E25FC0” • We can omit “SQL*Net message from client” as it is only our measurement method artifact.

Detection of “lock object” x$kgllk – potentially stands for “kernel global lock” and lists all library object locks (held and requested) • SELECT • lob.kglobtyp, • lob.kglnaobjobject_name, • lob.kglhdadr, • lob.kglhdnsd, • lob.kglobtyd, • lk.kgllkmodlock_mode_held, • lk.kgllkreqlock_mode_req • FROM • sys.x$kgloblob, • sys.x$kgllklk, • sys.v$sessionses • WHERE ses.sid = YOU_SID_HERE • and lob.kglhdadr = lk.kgllkhdl • and lk.kgllkreq != 0 • and lk.kgllkuse = ses.saddr x$kglob – potentially stands for “kernel global object” and lists all library cache objects

Exploring Oracle PDB Snapshot Clone: Low-Level Details and Use Case at Deutsche Bank