Cisco CallManager Database Replication

Cisco CallManager Database Replication Vajrender (Sunny) Akkera Richardson, Texas CUCM Team

Agenda • CallManager Database Architecture • DB Replication Flow Diagram • What could possibly break DB replication • How to verify if DB Replication is broken • Troubleshooting Database Replication issues • Replication Logs • Closing

DB Architecture : Install/Ugrade • In 5.0 and 5.1 • The publisher upgrade migrates data prior to reboot to the new version. • The subscriber starts replication setup after it is upgraded and rebooted. • Replication setup pushes data from the publisher to the subscriber. The subscriber’s local database is ready for failover only after replication is complete.

DB Architecture : Install/Ugrade In 6.X + • The publisher upgrade migrates data and performs an ontape (Informix utility) backup prior to reboot to the new version. • The subscriber upgrade gets the publisher ontape backup via SFTP, and restores that data to the subscriber. (This gets the data close in content which is imperative for services reading data local.) The subscriber starts replication setup after the upgrade and reboot. • Replication setup audits the data and pushes differences between the publisher and subscriber to the subscriber. Change notification is sent to the local services for each change. The local database is ready before replication is complete. The replication setup timeout is set-able via CLI “utils dbreplication setrepltimeout 900” (15 minutes) • User Facing Features (listed on a later slide) are backed up locally on all servers prior to upgrade and reboot and restored after reboot so that any changes made by users during the upgrade are not lost.

DB Architecture CallManager 5.X

DB Architecture CallManager 6.X

User Facing Features (UFF) This Data can be written into the local DB • Call Forward All (CFA) • Message Waiting Indication (MWI) • Privacy Enable/Disable • Do Not Disturb Enable/Disable (DND) • Extension Mobility Login (EM) • Monitor (for future use, currently no updates at the user level) • Hunt Group Logout • Device Mobility • CTI CAPF status for end users and application users • Credential hacking and authentication

DB Architecture: Replication 6.X • Replication is now fully meshed. A change on any server gets propagated to every other server. • Only UFF data is writeable on a subscriber, so that is the only data that will replicate from a subscriber. • Logically, most data is still hub-and-spoke from a replication perspective, since most data is still only updateable on the publisher. • Replication queues on the subscriber are now used. • Perfmon counters for replication are now used on subscribers. • Replication now impacts data availability and change notification.

DB Architecture: Replication 5.X

DB Architecture: Replication 6.X

DB Replication Flow Diagram

Steps to DB Replication These steps are done automatically by the replication scripts when the system is installed. When we do a “utilsdbreplication reset all”, these steps get done again. • Define publisher - This will help to set it up to start replicating • Define template on publisher and realize it - This tells publisher what tables to replicate. • Define each subscriber • Realize template on subscriber: This will tell subscribers what tables they will get/send data for. • Synchronize the data using cdr sync. When we look at the log files, we see output from steps 3, 4,and 5. Each subscriber will define by itself, but the realize and sync step shows up in the ‘dbl_repl_output_Broadcast_.logfile’. There may be one subscriber, or many in the "batch".

What could possibly break Replication • Connectivity between nodes • Host Files Mis-match • Communication on UDP port 8500, not in phase 2 • DNS not configured properly (forward/reverse lookup) • NTP not reachable • ‘A Cisco DB’ and ‘A Cisco DB Replicator’ not reachable • Dbmon hung/stopped

DB Replication Troubleshooting • How do we tell if replication is broken • Commands to diagnose and fix replication • If you can’t fix it, which trace files to collect

How to tell if Replication is broken? • Replication failure alert • Replication status counter not being in good state (can be watched proactively) • CLI for replication status shows tables suspect or missing servers. • CM Database Status Report under Unified Reporting

How to tell if Replication is broken? What the replication state counter means: 0 = Initialization 1 = Number of replicates is not correct (old sys) 2 = Replication is good 3 = Replication is bad 4 = Replication setup did not succeed (this meaning is for 5.1.3 and all 6.X versions) .

How to tell if Replication is broken?

How to tell if Replication is broken? show perf query class "Number of Replicates Created and State of Replication” admin:show perf query class "Number of Replicates Created and State of Replication" ==>query class : - Perf class (Number of Replicates Created and State of Replication) has instances and values: ReplicateCount -> Number of Replicates Created = 348 ReplicateCount -> Replicate_State = 2

How to tell if Replication is broken?

Troubleshooting Steps • Verify Connectivity • Verify Host Files • Connectivity on UDP port 8500 • Verify NTP reachability and Network Validity • DB Replication Commands

Troubleshooting : Verify Connectivity Utils network connectivity This command can take up to 3 minutes to complete. Continue (y/n)?y Running test, please wait ... . Network connectivity test with the publisher completed successfully. Note : Command can be run only on the Subscribers Utils network host <hostname/ipaddress> • Verifies DNS resolution Utils network ping <hostname/ipaddress> • Helps verify connectivity between nodes.

Troubleshooting : Verify Host Files • /etc/hosts • /etc/services • /home/informix/.rhosts • /usr/local/cm/db/informix/etc/sqlhosts

Troubleshooting : Verify Host Files admin:show tech network hosts -------------------- show platform network -------------------- /etc/hosts File: #This file was generated by the /etc/hosts cluster manager. #It is automatically updated as nodes are added, changed, removed from the cluster. 127.0.0.1 localhost 14.128.62.3 CM613 14.128.62.6 CM613SUB

Troubleshooting : Verify Host Files admin:show tech dbstateinfo Database State Info Output is in cm/trace/dbl/showtechdbstateinfo20593.out admin:file view activelog cm/trace/dbl/showtechdbstateinfo20593.out (Hit ‘e’ to go to the end of the file) #SQL Hosts: g_hdr group - - i=1 g_cm613_ccm6_1_3_1000_16 group - - i=2 cm613_ccm6_1_3_1000_16 onsoctcp CM613 cm613_ccm6_1_3_1000_16 g=g_cm613_ccm6_1_3_1000_16 b=32767 g_cm613sub_ccm6_1_3_1000_16 group - - i=3 cm613sub_ccm6_1_3_1000_16 onsoctcp CM613SUB cm613sub_ccm6_1_3_1000_16 g=g_cm613sub_ccm6_1_3_1000_16 b=32767 # .rhosts: localhost CM613 CM613SUB

Troubleshooting : Verify Host Files

Troubleshooting : Data Access Failure (Ipsec) admin:utils firewall list … ACCEPT tcp -- CM613SUB anywhere tcpdpt:cm613_ccm6_1_3_1000_16 ACCEPT udp -- CM613SUB anywhere udpdpt:1500 ACCEPT tcp -- CM613SUB anywhere tcpdpt:1501 ACCEPT udp -- CM613SUB anywhere udpdpt:1501 … • This example above is from a pub (CM613) where CM613SUB is the sub. Sub should have similar entries for pub. If they do not, probably a network issue. • Check the ipsec tunnel status from the CLI. Ensure all servers in cluster have good status (TCP and ACCEPT on port 1500 and is named by server). Else Verify the Cluster Manager Logs. - File list activelog platform/log/clustermgr* - File view activelog platform/log/clustermgr00000002.log Example : 06/14/2010 23:22:03.009 clm|HMAC_SHA1 match failed IP(14.128.62.6)| (Failed) 03/25/2010 06:52:39.864 clm|hostname: CM613SUB state POLICY_INJECTED| (Success)

Troubleshooting : Data Access Failure (Ipsec) // Cluster Manager Log (file list activelog platform/log/clustermgr*) 03/25/2010 06:52:24.547 clm|exec'ing: /root/.security/drf/setdrfdetails.sh 03/25/2010 06:52:24.636 clm|Binding to /usr/local/platform/conf/clm/unix_socket 03/25/2010 06:52:24.636 clm|creating 2 state machines 03/25/2010 06:52:24.637 clm|succeeded to create sm for: CM613SUB 03/25/2010 06:52:24.637 clm|exec'ing: sudo /root/.security/ipsec/disable_ipsec.sh --desthostName=CM613SUB --op=delete 03/25/2010 06:52:26.215 clm|hostname: CM613SUB state INITIATOR| 03/25/2010 06:52:26.356 clm|exec'ing: /etc/init.d/iptables start 03/25/2010 06:52:27.340 clm|ignoring initiation from other side peer hostname(CM613SUB) 03/25/2010 06:52:33.804 clm|exec'ing: /etc/init.d/iptables start 03/25/2010 06:52:35.750 clm|for initator(CM613SUB): entering the policy injected state 03/25/2010 06:52:39.864 clm|hostname: CM613SUB state POLICY_INJECTED

Troubleshooting : Data Access Failure (Ipsec) admin:utils network capture port 8500 Executing command with options: size=128 count=1000 interface=eth0 src= dest= port=8500 ip= 22:09:10.479943 CM613.8500 > CM613SUB.8500: isakmp: phase 2/others ? #71[C] (DF) 22:09:10.481232 CM613SUB.8500 > CM613.8500: isakmp: phase 2/others ? #71[C] (DF) 22:09:15.474954 CM613SUB.8500 > CM613.8500: isakmp: phase 2/others ? #71[C] (DF) 22:09:15.475677 CM613.8500 > CM613SUB.8500: isakmp: phase 2/others ? #71[C] (DF) • Verify the communication is in phase 2 in both directions (pub->sub, sub->pub). If you have multiple nodes in the cluster, all the nodes must be in ‘phase 2’ with every other node in the cluster.

Troubleshooting : Verify NTP reachability and Network Validity admin:utils diagnose test Log file: /var/log/active/platform/log/diag4.log Starting diagnostic test(s) =========================== test - disk_space : Passed (available: 849 MB, used: 4998 MB) skip - disk_files : This module must be run directly and off hours test - service_manager : Passed test - tomcat : Passed test - validate_network : Passed test - system_info : Passed (Collected system information in diagnostic log) test - ntp_reachability : Passed test - ntp_clock_drift : Passed test - ntp_stratum : Passed Diagnostics Completed

DB Replication Commands

DB Replication Commands Utils dbreplication status • This command displays the status of database replication by comparing the database content of subscribers to the Publisher. It will indicate if the servers in the cluster are connected, and if the data is in sync. • This command can be run on all nodes of a cluster. Utils dbreplication stop • This command stops replication setup on the local server • This command is run prior to running ‘repair’ or ‘reset’ on respective nodes.

DB Replication Commands Utils dbreplication repair • This command repairs database replication • This command is run when “utils dbreplication status” shows connected and few tables are out of sync. Syntax: utils dbreplication repair {all | hostname} Utils dbreplication reset • This command resets and restarts database replication. • It can be used to tear down and rebuild replication when the system has not set up properly. • Ensure no cdr process is running by using the show process search cdr command. Syntax: utils dbreplication reset {all | hostname}

DB Replication Commands Utils dbreplication setrepltimeout Syntax : utils dbreplication setrepltimeout timeout Timeout - The new database replication timeout, in seconds. Value Range is between 300 and 3600. • The default database replication timeout equals 5 minutes (value of 300). • When the first subscriber requests replication with the pub, this timer will be set. • When the timer expires, the first sub plus other subs that requested replication within that time period begin data replication with the pub in a "batch". • For large clusters, you can use the command to increase the default timeout value, so more subs will be included in the batch. • This timer should be set on the publisher after publisher has been upgraded and booted up on the upgraded partition, but before first sub has been switched over to new release. Then, when the first sub requests replication, the pub will set the timer based on this new value. Note: It is recommended you restore this value back to the default of 300 (5 minutes) once the entire cluster is upgraded successfully and subs have successfully set up replication.

DB Replication Commands Utils dbreplication runtimestate • This command helps to make sure the Publisher is able to communicate with all the subscribers DBLRPC service aka Database Replicator. Verify the RPC column. • Typically run before running the ‘reset’ command. admin:utils dbreplication runtimestate DB and Replication Services: ALL RUNNING Cluster Replication State: Replication status command started at: 2010-05-13-15-53 Replication status command COMPLETED 427 tables checked out of 427 No Errors or Mismatches found. DB Version: ccm7_1_3_10000_11 Number of replicated tables: 427 Cluster Detailed View from PUB (2 Servers): PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT) & details ----------- ------------ ------ ---- ----------- ----- ------- ----- ----------------- Publisher 14.128.62.72 0.063 Yes Connected 0 match N/A (3) PUB Setup Completed subscriber 14.128.62.73 0.384 Yes Connected 0 match N/A (3) Setup Completed

DB Replication Commands Utils dbreplication clusterreset • This command can be used to debug database replication, but should only be used if "utils dbreplication reset all" has previously been tried and has failed to restart replication on the cluster. • This command will tear down and rebuild replication for the entire cluster. • After using this command, each sub needs to be rebooted. • Also, once the subs have been rebooted, you must go to the pub and issue the CLI command "utils dbreplication reset all". Syntax : utils dbreplication clusterreset Utils dbreplication dropadmindb • This command drops the Informix syscdr database on any server in the cluster. • You should run this command only if database replication reset or cluster reset fails and replication cannot be restarted. Syntax : utils dbreplicatin dropadmindb

DB Replication Command : Example Utilsdbreplication status • Good Status • Check the output to be sure each server is connected, and no tables are suspect • The status should list all the subscribers as being connected at the top of the file, and no tables are suspect SERVER ID STATE STATUS QUEUE CONNECTION CHANGED -----------------------------------------------------------------------g_bldr_ccm4_ccm 2 Active Local 0g_bldr_ccm5_ccm 3 Active Connected 0 Sep 6 16:27:15

DB Replication Command : Example -Bad Status – Servers out of Sync • If RTMT counter value for replication state is 2 or 3 for all nodes of the cluster, then replication is set up. • Replication state 3 states, there are a few tables that are out of sync. • You would run a ‘dbreplication repair’ to clear this issue. (Slide 31) SERVER ID STATE STATUS QUEUE CONNECTION CHANGED -----------------------------------------------------------------------g_bldr_ccm4_ccm 2 Active Local 0g_bldr_ccm5_ccm 3 Active Connected 0 Sep 6 16:27:15 ---------- Suspect Replication Summary ---------- For table: ccmdbtemplate_bldr_ccm4_ccm_1_27_processnodereplication is suspect for node(s):g_bldr_ccm5_ccm For table: ccmdbtemplate_bldr_ccm4_ccm_1_34_replicationdynamicreplication is suspect for node(s):g_bldr_ccm5_ccm -------------------------------------------------

DB Replication Command : Example • Bad Status – Replication not setup properly • one or more nodes or some servers shows "Quiescent" or "Dropped" Status. • This would typically show a replicate state of 0 or 4. • You would run a ‘dbreplication reset’ to clear this issue. SERVER ID STATE STATUS QUEUE CONNECTION CHANGED -----------------------------------------------------------------------g_bldr_ccm4_ccm 2 Active Local 0g_bldr_ccm5_ccm 3 Active Dropped 636 Sep 10 14:01:20 Possible causes : • A communications problem/ network error(publisher and subscriber cannot talk. • IPSec is broken for the node. • One or more ports that is required by the database is not opened on the firewall. • Host files not setup properly.

Replication Logs From the Publisher • File get activelog cm/log/informix/dbl_repl*.log • File get activelog cm/trace/dbl/dbl_repl*.log • File get activelog cm/log/informix/ccm.log* • File get activelog cm/ltraces/dbl/sdi/dbmon*.txt From the Subscribers • File get activelog cm/log/informix/ccm.log* • File get activelog cm/trace/dbl/sdi/dbmon*.txt Download the following unified reports • Database Status • Cluster Overview • Replication Debug

Replication Logs admin:file list activelog /cm/trace/dbl date det 15 Jun,2010 10:45:17 <dir> dblj 15 Jun,2010 10:45:17 <dir> ncsj 15 Jun,2010 10:45:17 <dir> sdi 19 Nov,2009 18:53:44 1,847 dbl_repl_cdr_define_subscriber_ccm7_1_3_10000_11-2009_11_19_18_53_21.log 19 Nov,2009 18:59:57 299,786 dbl_repl_cdr_Broadcast_2009_11_19_18_58_44.log 19 Nov,2009 18:59:57 1,261 dbl_repl_output_Broadcast_2009_11_19_18_58_44.log

Replication Logs : Sample Define [# cat dbl_repl_cdr_define_nw104a_196-2007_09_24_16_43_13.logpassed dbname [ccm6_1_0_9901_391]dbname passed[ccm6_1_0_9901_391] local_dbname [ccm6_1_0_9901_391]-------Inside deleteQuiescent-------subscriber name: g_nw104a_196_ccmsucmd to execute [su -c 'cdr list serv > /tmp/cdr_list_serv_local_quiescent' - informix]-------Exiting deleteQuiescent------- sucmd_err [su -c 'ulimit -c 0;cdr err --zap' - informix ]Executing [su -c 'ulimit -c 0;cdr define server --connect=nw104a_196_ccm --idle=0 --init --sync=g_nw104a_212_ccm g_nw104a_196_ccm --ats=/var/log/active/cm/log/informix/ats --ris=/var/log/active/cm/log/informix/ris;' - informix]After Executing [su -c 'ulimit -c 0;cdr define server --connect=nw104a_196_ccm --idle=0 --init --sync=g_nw104a_212_ccm g_nw104a_196_ccm --ats=/var/log/active/cm/log/informix/ats --ris=/var/log/active/cm/log/informix/ris;' - informix]---------------START--------------------Inside getServCountonpublisher-------sucmd to execute [su -c 'cdr list serv > /tmp/cdr_list_serv_local' - informix]---Inside-------locateFailure servcount_on_publisher is [1] sleeptime is[10]SERVER ID STATE STATUS QUEUE CONNECTION CHANGED-----------------------------------------------------------------------g_nw104a_196_ccm 17 Active Local 0g_nw104a_212_ccm 2 Active Connected 0 Sep 24 16:43:20Count on node [g_nw104a_196_ccm] is [1] count_on_publisher [1]-------LocateFailure-------Returning--------------servcount_on_publisher is [1]--------------END-------------sucmd [su -c 'ulimit -c 0;cdr err -a' - informix >> /usr/local/cm/db/cdr_err_define.out 2>&1]size of cdr_err.out is [64]

Replication Logs : Sample Define In the above dbl_repl_cdr_define_nw104a_196-2007_09_24_16_43_13.log output, • Servers show Local or Connected which is good. • Shows size of cdr_err.out is [64] which is good

Replication : Sample dbl_repl_output_broadcast [root@nw104a-212 dbl]# cat dbl_repl_output_Broadcast_2007_09_24_16_59_57.logsucmd [su -c 'ulimit -c 0;cdr err --zap' - informix >> /var/log/active/cm/trace/dbl/dbl_repl_cdr_Broadcast_2007_09_24_16_59_57.log 2>&1] Starting Broadcast RT...(g_nw104a_196_ccm g_nw104a_198_ccm g_nw104a_199_ccm g_nw104a_201_ccm g_nw104a_202_ccm g_nw104a_200_ccm g_nw104a_203_ccm g_nw104a_205_ccm g_nw104a_206_ccm g_nw104a_194_ccm g_nw104a_208_ccm g_nw104a_209_ccm ) sucmd [su -c 'ulimit -c 0;cdr realize template ccmdbtemplate g_nw104a_196_ccm g_nw104a_198_ccm g_nw104a_199_ccm g_nw104a_201_ccm g_nw104a_202_ccm g_nw104a_200_ccm g_nw104a_203_ccm g_nw104a_205_ccm g_nw104a_206_ccm g_nw104a_194_ccm g_nw104a_208_ccm g_nw104a_209_ccm ' - informix >> /var/log/active/cm/trace/dbl/dbl_repl_cdr_Broadcast_2007_09_24_16_59_57.log 2>&1]realizeclockstart [1190671197.81]Time taken to do realize template [116.477200985] cmd[rm -f /usr/local/cm/db/cdr_err_realize.out]sucmd [su -c 'ulimit -c 0;cdr err -a' - informix >> /usr/local/cm/db/cdr_err_realize.out 2>&1] size of cdr_err.out is [64] Before cdr check sucmd [su -c 'ulimit -c 0;cdr err --zap' - informix >> /var/log/active/cm/trace/dbl/dbl_repl_cdr_Broadcast_2007_09_24_16_59_57.log 2>&1] sucmd [su -c 'ulimit -c 0; cdr check replicateset -m g_nw104a_212_ccm -s ccmdbtemplate -e delete -R g_nw104a_196_ccm g_nw104a_198_ccm g_nw104a_199_ccm g_nw104a_201_ccm g_nw104a_202_ccm g_nw104a_200_ccm g_nw104a_203_ccm g_nw104a_205_ccm g_nw104a_206_ccm g_nw104a_194_ccm g_nw104a_208_ccm g_nw104a_209_ccm --firetrigger=follow' - informix >> /var/log/active/cm/trace/dbl/dbl_repl_cdr_Broadcast_2007_09_24_16_59_57.log 2>&1]Time taken to do cdr check[2038.29240179] cmd[rm -f /usr/local/cm/db/cdr_check.out] sucmd [su -c 'ulimit -c 0;cdr err -a' - informix >> /usr/local/cm/db/cdr_check.out 2>&1]size of cdr_check.out is [64]

Replication : Sample dbl_repl_output_broadcast In the above output file, you need to look for: • A successful realize • A successful sync or check • size of cdr_check.out is [64] which is good

Q and A

Cisco CallManager Database Replication