Backup And Recovery Requirements

Backup And Recovery Requirements • Routine backups must have minimal impact to development environment • VOBS must be locked for a minimal amount of time during backup • Routine backups must capture relevant data in a way that can be quickly and accurately recovered • Data validation is required prior to backing up data to tape • All relevant data must be backed up at the same time (i.e. registry, configuration, VOB storage)

Backup And Recovery Requirements (Continued) • Recovery time must minimize impact to developers - typical VOB server with 80 to 90 VOBs and 100 - 200 GB of storage (hub servers 130 – 160 VOBs) • Typical recovery scenario (takes a week): • Restore data from backup media, i.e. tape (days!) • Data validation on restored data, i.e. checkvob & dbcheck (days! 10-15GB VOBs with 3-4 GB db’s) • Sync replicas to get changes since last backup (this alone takes about 8 - 12 hours) • Reset client machines (rebooting required?) • Minimize downtime during recovery -- needs to be minutes/hours, not days or weeks

Warm High Availability (WHA) Configuration • Aspects of WHA Implementation: • Using SAN technology • Snapshot to minimize VOB locks • Specialized ClearCase configuration • Currently only on VOB servers, could implement View servers the same way • Now some details!

WHA Configuration (Continued • Using SAN technology • Any server can dynamically control any storage device, allowing for quick fail over of VOB servers • Use of a “shadow” disk for initial backup medium • Snapshot to minimize VOB locks • Minimizes VOB lock times to less than 2 minutes • Specialized ClearCase configuration • allows fail-over to new server with no required changes to the ClearCase registry and configuration • More details later!

WHA Configuration (Continued) • Hardware configuration • SAN configuration • ClearCase configuration

WHA Configuration (Continued) • Hardware configuration • Unix Solaris servers • SAN storage appliance – currently about 5 - 6 TB in San Diego of ClearCase storage (VOBs and Views) • Each VOB server has primary disk storage plus 2 “shadow images” of the VOB storage (3 copies on disk) • Large servers: 16GB RAM, 4 CPUs, GB network and 2GB interface to storage device • We have implemented WHA on all our VOB servers, large and small

WHA Configuration (Continued) • SAN configuration • Many-to-many connectivity between servers and storage locations • Dynamic control of storage locations • Accommodates snapshots and shadow images (where dbcheck is run) • Using 2 shadow images – one day apart • Oldest one has successfully passed dbcheck and is/has been dumped to tape • Newest one is undergoing dbcheck • Always have a validated copy of all necessary data on disk for restoration

WHA Configuration (Continued) • ClearCase configuration • Currently using ClearCase 4.2 • When implementing a recovery, NO ClearCase configuration changes are required (i.e. registry) • Backup ALL relevant data at the same time • VOB data and /var/adm/atria located on same disk location • DNS alias used instead of real host name for ClearCase license server • Use Logical vs. Physical VOB storage locations in registering • DNS alias used for VOB servers (VOB server can change by moving the alias)

WHA Configuration (Continued) • ClearCase configuration (continued) • Use Logical vs. Physical VOB storage locations in registering -- The path to the VOB storage must be the same, independent of host and storage location • Create links to VOB storage, example: • /local/mnt (this mount point always exists and is always shared • Use links to create logical  physical mapping, need unique logical paths for all VOB storage within the same region /local/mnt/VOBSA  /net/dnsalias/local/mnt2/vobs /local/mnt/VOBSB  /net/dnsalias/local/mnt3/vobs

WHA Configuration (Continued) • ClearCase configuration (continued) • Once links are created, register and tag (mkvob, mkreplica…). Must use fully qualifying method: -host <dns alias of VOB server> -hpath <the linked path, not physical path> -gpath <the global and linked path> • Never use the real host name or real physical path!! • To switch servers: Restore data, move host alias, create links, stop and start ClearCase • The clients and view servers must reacquire the new VOB storage mount points, so restart ClearCase or reboot the clients

WHA Configuration (Continued) • ClearCase configuration (continued) • Example -- /vobs/bsc • Host name is cyclone and VOB storage location: /local/mnt2/vobs/bsc.vob (physcial) /local/mnt/VOBS/bsc.vob (logical) • DNS alias cyclone == edbvobA • Register and tag /vobs/bsc to DNS alias and logical link instead of physical storage location /net/edbvobA/local/mnt/VOBS/bsc.vob -VS- /net/cyclone/local/mnt2/vobs/bsc.vob

WHA Configuration (Continued) • ClearCase configuration (cont) • Example of lsvob (2 VOB servers, 3 storage locations): * /vobs/mgw/msf_erab /net/mother/local/mnt/VOBSA/mgw/msf_erab.vob public * /vobs/mgw/msf_eedn /net/mother/local/mnt/VOBSA/mgw/msf_eedn.vob public * /vobs/mgw/msf_etm /net/mother/local/mnt/VOBSA/mgw/msf_etm.vob public * /vobs/cello/ose /net/mother/local/mnt/VOBSC/cello/ose.vob public * /vobs/ewu/perl /net/stepmother/local/mnt/VOBSB/ewu/perl.vob public * /vobs/ewu/freeware /net/stepmother/local/mnt/VOBSB/ewu/freeware.vob pu * /vobs/stre/det /net/stepmother/local/mnt/VOBSB/stre/det.vob public

WHA Configuration (Continued) • ClearCase configuration (continued) • DNS alias used for VOB servers (VOB server can change by moving the alias) • The registered path and host is always the same no matter what physical host is the VOB server! • Always use the alias, for MultiSite as well. Machines can come and go but the VOB server host name is always the same • There is both a Rational and SUN white paper documenting this configuration and setup! http://www.rational.com/media/partners/sun/Ericsson_final.pdf

Backup Process • All setup is completed and WHA implemented • Lock VOBs (less than 2 minutes) • We use SUN Instant Image TM to snapshot VOB storage partition • both VOB storage and /var/adm/atria is located here (we also have trigger scripts and …) • Snapshot is to shadow1 • another disk partition, could be totally different disk • Shadow2 passed data validation with “dbcheck” yesterday and is being dumped to tape

Backup Process (Continued) • Once backup to shadow1 complete, “dbcheck” will be started for data validation • Once data validation is successful -- and it’s a new backup day -- shadow1 becomes shadow2, and shadow2 becomes shadow1, and it starts all over • If error found during dbcheck we take immediate corrective action – keep validated copy on disk (shadow2) while we check out the production data • There is ALWAYS a “good copy” on the shadow2 disk!

Recovery Process • Typical recovery scenario: • Get another server or fix broken one – you have to give it the same server hostname or change the ClearCase registry information! • Restore data from backup tape (100 - 200 GB, 2+ days) • Do data validation, checkvob and dbcheck (2+ days) • Restore replica (MultiSite users) for 80+ VOBs, this takes at least 8 – 12 hours • Clean up clients – typically a crash means NFS/MVFS is messed up – REBOOT! • Is that it? I wish it was! Developers can’t work! • WHA recovery scenario?

Recovery Process (Continued) • WHA recovery scenario - get another server or fix broken one • ANY server can act as the new VOB server. Of course using an existing VOB/View server would degrade performance • Get VOBs on-line and back in service as fast as possible, WHA means I can “cut-over” to another server again later!

Recovery Process (Continued) • WHA recovery scenario - Get another server or fix broken one (cont) • STEPS (same for any WHA cut-over): • Move the DNS alias to new server • create the links (links for /var/adm/atria and VOB physical storage locations from /local/mnt/VOBS?) • Since /var/adm/atria was backed up with the VOB storage, they are in sync • Just turn ClearCase off/on and – NEW VOB SERVER!

Recovery Process (Continued) • WHA recovery scenario - Restore data from backup tape,100 - 200 GB • Not 2+ days • We don’t go to tape, unless we’ve had a real disaster! • We don’t do a “restore” we have 2 copies on disk! • Use shadow1 if data validation is complete or confidence level high – shadow2 is only 24-48 hrs old • Mount shadow disk to new VOB server (SAN makes this easy)

Recovery Process (Continued) • WHA recovery scenario - Restore data from backup tape (cont) • Create the links to the VOB physical storage location • Much faster than transferring 100 – 200 GB data from tape! • 15 minutes MAX!

Recovery Process (Continued) • WHA recovery scenario - Do data validation, checkvob and dbcheck • Not 2+ days • Takes a “very” long time (100-200GB of VOBs, some with 4-6GB databases) • Checkvob and dbcheck is run on all servers monthly • Daily successful dbcheck runs on shadow disk – high confidence

Recovery Process (Continued) • WHA recovery scenario - Do data validation, checkvob and dbcheck (cont) • If shadow1 has completed dbchecks, use it, if not use shadow2 • NO time spent on data validation during recovery because it was done during the backup phase! • Would like checkvob and other data validation utilities that can be run on off-line VOBs!

Recovery Process (Continued) • WHA recovery scenario - Restore replica • MultiSite heavily used with syncing internally every 30 minutes – checked in changes will be available in another replica since the shadow image was snapshot! • Get the changes since the snapshot from other replica • By default – restorereplica wants to sync with ALL replicas (NOT all 30-40 we have ) • **CAREFULL**

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Lots of VOBs, 80+ , this will still take at least 8 – 12 hours to only 2-4 replicas • Must get update packets (that have changes since the backup) from other replicas • See example of commands on next slides!

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Example of commands: mt restorereplica (default requires updates from all replicas) OR mt restorereplica replica:ewuhub_bscng_aim replica:ewucth_bscng_aim replica:ewubo_bscng_aim ** MUST INCLUDE THE REPLICA THAT WAS THE LAST REPLICA THAT WAS EXPORTED TO JUST BEFORE THE CRASH!! – NEED TO AVOID DIVERGENCE IN THE VOB REPLICAS! * Check via lsepoch, make sure the replica with record of the most changes that took place in the restored replica is included! (mt lsepoch ewuhub_bscng_aim@/vobs/bscng/aim)

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) **WARNINGS – POSSIBLE DIVERGANCE** ** MUST INCLUDE THE REPLICA THAT WAS THE LAST REPLICA THAT WAS EXPORTED TO JUST BEFORE THE CRASH!! – NEED TO AVOID DIVERGENCE IN THE VOB REPLICAS! • Cheak for latest replica sync’d to • lsepoch • lshistory

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Check via lsepoch, make sure the replica with record of the most changes that took place in the restored replica is included! • With ClearCase 4.X you can use –actual to query remote replicas

Recovery Process (continued) • WHA recovery scenario - Restore replica (continued) • Check via lsepoch • EXAMPLE: restored replica is ewucello_bscng_aim mt lsepoch –actual ewuhub_bscng_aim@/vobs/bscng/aim oid:834d7251.f24c11d4.a4df.00:01:80:b8:c7:b4=450831 (ewucello_bscng_aim) mt lsepoch –actual ewucth_bscng_aim@/vobs/bscng/aim oid:834d7251.f24c11d4.a4df.00:01:80:b8:c7:b4=450745 (ewucello_bscng_aim)

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Example of commands – to find last replica exported to. This is not trivial, you have to check each replica you have been syncing with: Example: mt lsreplica –invob /vobs/nmis Replicas (14): boulder_nmis, bscclassic_nmis, cbssw_nmis, edbbsc_nmis, edbbsm_nmis, edbspe_nmis, edbtetra_nmis, ewubo_nmis, ewucth_nmis, ewuhub_nmis, ewustre_nmis, ramstest_nmis, servicenet_nmis, streit2_nmis These replicas are the only ones the restored replica syncs with: boulder_nmis, bscclassic_nmis, ewubo_nmis, ewucth_nmis, ewuhub_nmis

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) Example (cont): /vobs/nmis (must do lshistory at each remote replica site!) cleartool lshistory replica:boulder_nmis cleartool lshistory replica:bscclassic_nmis cleartool lshistory replica:ewubo_nmis cleartool lshistory replica:ewucth_nmis cleartool lshistory replica:ewuhub_nmis Example results: 12-Jun.15:55 root import sync from replica "bscclassic_nmis" to replica “ewuhub_nmis” • Review the output of the above commands, see which was the last replica to be sent an export sync packet

Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Now run the restorereplica command with appropriate replica(s) identified! (we use ALL replicas we sync with, but not replicas we never sync with) mt restorereplica replica:boulder_nmis replica:bscclassic_nmis \ replica:ewubo_nmis replica:ewucth_nmis replica:ewuhub_nmis • Now send export packets to those replicas and send packets with changes back. The VOB is locked until the replica you are restoring gets update packets from each! • Once all changes have been processed by the restored replica, you can unlock the VOBs and go to the next step

Recovery Process (Continued) • WHA recovery scenario - Clean up clients • Typically a crash means NFS/MVFS is messed up • Easiest way to get clients and servers working properly is to – REBOOT • To try and clean-up clients without a reboot see the basic script on the next page

Recovery Process (Continued) • WHA recovery scenario - Clean up clients (continued) • Script #!/bin/sh -x /usr/sbin/fuser -uck /view for VOB in `/usr/atria/bin/cleartool lsvob -s` do /usr/sbin/fuser -uck $VOB > /dev/null 2>&1 done /usr/atria/bin/cleartool umount -all > /dev/null 2>&1 for MNT in `df | grep local/mnt | grep -v "/dev/dsk" | cut -f1 -d "("` do umount $MNT > /dev/null 2>&1 done rm -r /vobs/* /etc/init.d/atria stop

Recovery Process (Continued) • WHA Restore completed! But developers can’t work! • Build issues – need error handling in build scripts • VOBs and Views may have been created or deleted since the backup: • Created since backup - storage exists without entry in registry • Deleted since backup - registry entry exists without storage • FIRST – MAKE SURE ALL VOB AND VIEW SERVER PROCESSES HAVE BEEN KILLED – this eliminates lots of potential problems (stop and restart ClearCase on all systems)

Recovery Process (Continued) • Build issues • Case #1: VOBs that have been restored HAVE references to DO’s • DO’s physically exist in VOB (no problem) • DO’s exist in view (ref count = 1) (again no problem) • DO’s references exist in VOBs, but the DO data DOES NOT exist anymore (maybe removed since backup by rmview or rmdo) • Case #2: VOBs that have been restored DO NOT have references to DO’s that exist • DO’s exist in a single view, reference count == 1, reference in the view but not the VOBs • DO’s were promoted so references exist in multiple views (ref count > 1) – but not in the VOBs

Recovery Process (Continued) • Build issues – Case # 1 • VOBs that have been restored HAVE references to DO’s • DO’s references exist in VOBs, but the DO data DOES NOT exist anymore • maybe removed since backup by rmview or rmdo

Recovery Process (Continued) • Build issues – Case #1 (continued) • Since DO pointers exist in the restored VOB, these DO’s are considered during configuration lookup of builds. Results in Warnings! But it does rebuild the DO’s clearmake -C sun -f /vobs/wds/build/include/Makefile.if -e clearmake: Warning: Unable to evaluate derived object "libimc.a.1@@07-Nov.19:10. 220156" in VOB directory "/vobs/bscng/ccl/imc/imc_if/lib.sp750@@" ** recoverview does NOT clean this up, you just keep getting warnings! We created a script to clean this up, but you might be able to just ignore the messages!

Recovery Process (Continued) • Build issues – Case #1 (continued) • If view has been deleted, ERROR message will be generated (scripts need error handling) >>> (clearmake): Build evaluating lib1.a >>> (clearmake): Build evaluating one.o No candidate in current view for "one.o" >>> (clearmake): Shopping for DO named "one.o" in VOB directory "/vobs/stre/do_test/.@@" >>> (clearmake): Evaluating heap derived object "one.o@@05-Jun.12:24.74" >>> clearmake: Error: Unable to find view by uuid:5b997e3d.78b711d6.ad2c.00:01:80:b6:87:eb, last known at "lime:/tmp/do3.vws". >>> clearmake: Error: Unable to contact View - ClearCase object not found >>> clearmake: Warning: View "lime:/tmp/do3.vws" unavailable - This process will not contact the view again for 60 minutes. NOTE: Other processes may try to contact the view. >>> clearmake: Warning: Unable to evaluate derived object "one.o@@05-Jun.12:24.74" in VOB directory "/vobs/stre/do_test/.@@"

Recovery Process (Continued) • Build issues – Case #2 • VOBs that have been restored DO NOT have references to DO’s that exist • DO’s exist in a single view, reference count == 1, reference in the view but not the VOBs • DO’s were promoted so references exist in multiple views (ref count > 1) – but not in the VOBs

Recovery Process (Continued) • Build issues – Case #2 (continued) • DO’s exist in a single view, reference count == 1, reference in the view but not the VOBs • DO’s were promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • Recoverview can be used to clean this up, needs to be run in each view with a problem. Moves stranded DO’s to view .s/lost+found: recoverview –vob <vob uuid> -tag <view tag>

Recovery Process (Continued) • Build issues – Case #2.2 (continued) • DO’s promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • *careful, view server processes have not been terminated! lime /vobs/stre/do_test 53 ct setview do2 lime /vobs/stre/do_test 51 ct ls -l view private object .cmake.state version Makefile@@/main/1 Rule: element * /main/LATEST derived object four.o [no config record] derived object lib1.a [no config record] dir version lost+found@@/main/0 Rule: element * /main/LATEST derived object one.o [no config record] derived object three.o [no config record] derived object two.o [no config record]

Recovery Process (Continued) • Build issues – Case #2.2 (continued) • DO’s promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • *view server processes have been terminated! lime /vobs/stre/do_test 52 ct ls .cmake.state Makefile@@/main/1 Rule: /main/LATEST cleartool: Error: Trouble looking up element "four.o" in directory ".". cleartool: Error: Trouble looking up element "lib1.a" in directory ".". lost+found@@/main/0 Rule: /main/LATEST cleartool: Error: Trouble looking up element "one.o" in directory ".". cleartool: Error: Trouble looking up element "three.o" in directory ".". cleartool: Error: Trouble looking up element "two.o" in directory ".".

Recovery Process (Continued) • Build issues – Case #2.2 (continued) • DO’s promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • *view server processes have been terminated! > ls -l ./one.o: No such file or directory ./two.o: No such file or directory ./three.o: No such file or directory ./four.o: No such file or directory ./lib1.a: No such file or directory

Recovery Process (Continued) • Build issues – Case #2.2 (continued) • With proper shutdown of the view server process, ClearCase automatically purges the references and enter a log message in /var/adm/atria/view_log: 06/12/02 10:54:44 view_server(24163): Warning: Cover object mother:/local/mnt2/workspace/vobs/stre/do_test.vbs:336e07d7.7e2b11d6.b659.00:01:80:b6:87:eb for 0x8000000a not found in VOB: ClearCase object not found 06/12/02 10:54:44 view_server(24163): Warning: Cover object mother:/local/mnt2/workspace/vobs/stre/do_test.vbs:336e07df.7e2b11d6.b659.00:01:80:b6:87:eb for 0x80000007 not found in VOB: ClearCase object not found 06/12/02 10:54:44 view_server(24163): Warning: Cover object 06/12/02 10:54:53 view_server(24163): Warning: Vob stale 0x8000000d: Purging

Recovery Process (Continued) • VOBs and Views may have been created or deleted since the backup: • VOBs or Views created since backup - storage exists without entry in registry • VOBs or Views deleted since backup - registry entry exists without storage • At least the registry is in sync with the data that was restored • ClearCase configuration and VOB storage on same device, gets backed up at the same time!

Recovery Process (Continued) • VOBs and Views may have been created or deleted since the backup (continued): • You can use rgy_check to help clean this up /usr/atria/etc/rgy_check –views | vobs • It helps if you have standard storage locations for VOBs and Views, you know where to look • Sometimes you just need to wait for users to complain! Remember those “error/warning” msg! • Views are suppose to be temporary working space right!

Backup And Recovery Requirements