1 / 46

Backup And Recovery Requirements

Backup And Recovery Requirements. Routine backups must have minimal impact to development environment VOBS must be locked for a minimal amount of time during backup Routine backups must capture relevant data in a way that can be quickly and accurately recovered

kesia
Télécharger la présentation

Backup And Recovery Requirements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Backup And Recovery Requirements • Routine backups must have minimal impact to development environment • VOBS must be locked for a minimal amount of time during backup • Routine backups must capture relevant data in a way that can be quickly and accurately recovered • Data validation is required prior to backing up data to tape • All relevant data must be backed up at the same time (i.e. registry, configuration, VOB storage)

  2. Backup And Recovery Requirements (Continued) • Recovery time must minimize impact to developers - typical VOB server with 80 to 90 VOBs and 100 - 200 GB of storage (hub servers 130 – 160 VOBs) • Typical recovery scenario (takes a week): • Restore data from backup media, i.e. tape (days!) • Data validation on restored data, i.e. checkvob & dbcheck (days! 10-15GB VOBs with 3-4 GB db’s) • Sync replicas to get changes since last backup (this alone takes about 8 - 12 hours) • Reset client machines (rebooting required?) • Minimize downtime during recovery -- needs to be minutes/hours, not days or weeks

  3. Warm High Availability (WHA) Configuration • Aspects of WHA Implementation: • Using SAN technology • Snapshot to minimize VOB locks • Specialized ClearCase configuration • Currently only on VOB servers, could implement View servers the same way • Now some details!

  4. WHA Configuration (Continued • Using SAN technology • Any server can dynamically control any storage device, allowing for quick fail over of VOB servers • Use of a “shadow” disk for initial backup medium • Snapshot to minimize VOB locks • Minimizes VOB lock times to less than 2 minutes • Specialized ClearCase configuration • allows fail-over to new server with no required changes to the ClearCase registry and configuration • More details later!

  5. WHA Configuration (Continued) • Hardware configuration • SAN configuration • ClearCase configuration

  6. WHA Configuration (Continued) • Hardware configuration • Unix Solaris servers • SAN storage appliance – currently about 5 - 6 TB in San Diego of ClearCase storage (VOBs and Views) • Each VOB server has primary disk storage plus 2 “shadow images” of the VOB storage (3 copies on disk) • Large servers: 16GB RAM, 4 CPUs, GB network and 2GB interface to storage device • We have implemented WHA on all our VOB servers, large and small

  7. WHA Configuration (Continued) • SAN configuration • Many-to-many connectivity between servers and storage locations • Dynamic control of storage locations • Accommodates snapshots and shadow images (where dbcheck is run) • Using 2 shadow images – one day apart • Oldest one has successfully passed dbcheck and is/has been dumped to tape • Newest one is undergoing dbcheck • Always have a validated copy of all necessary data on disk for restoration

  8. WHA Configuration (Continued) • ClearCase configuration • Currently using ClearCase 4.2 • When implementing a recovery, NO ClearCase configuration changes are required (i.e. registry) • Backup ALL relevant data at the same time • VOB data and /var/adm/atria located on same disk location • DNS alias used instead of real host name for ClearCase license server • Use Logical vs. Physical VOB storage locations in registering • DNS alias used for VOB servers (VOB server can change by moving the alias)

  9. WHA Configuration (Continued) • ClearCase configuration (continued) • Use Logical vs. Physical VOB storage locations in registering -- The path to the VOB storage must be the same, independent of host and storage location • Create links to VOB storage, example: • /local/mnt (this mount point always exists and is always shared • Use links to create logical  physical mapping, need unique logical paths for all VOB storage within the same region /local/mnt/VOBSA  /net/dnsalias/local/mnt2/vobs /local/mnt/VOBSB  /net/dnsalias/local/mnt3/vobs

  10. WHA Configuration (Continued) • ClearCase configuration (continued) • Once links are created, register and tag (mkvob, mkreplica…). Must use fully qualifying method: -host <dns alias of VOB server> -hpath <the linked path, not physical path> -gpath <the global and linked path> • Never use the real host name or real physical path!! • To switch servers: Restore data, move host alias, create links, stop and start ClearCase • The clients and view servers must reacquire the new VOB storage mount points, so restart ClearCase or reboot the clients

  11. WHA Configuration (Continued) • ClearCase configuration (continued) • Example -- /vobs/bsc • Host name is cyclone and VOB storage location: /local/mnt2/vobs/bsc.vob (physcial) /local/mnt/VOBS/bsc.vob (logical) • DNS alias cyclone == edbvobA • Register and tag /vobs/bsc to DNS alias and logical link instead of physical storage location /net/edbvobA/local/mnt/VOBS/bsc.vob -VS- /net/cyclone/local/mnt2/vobs/bsc.vob

  12. WHA Configuration (Continued) • ClearCase configuration (cont) • Example of lsvob (2 VOB servers, 3 storage locations): * /vobs/mgw/msf_erab /net/mother/local/mnt/VOBSA/mgw/msf_erab.vob public * /vobs/mgw/msf_eedn /net/mother/local/mnt/VOBSA/mgw/msf_eedn.vob public * /vobs/mgw/msf_etm /net/mother/local/mnt/VOBSA/mgw/msf_etm.vob public * /vobs/cello/ose /net/mother/local/mnt/VOBSC/cello/ose.vob public * /vobs/ewu/perl /net/stepmother/local/mnt/VOBSB/ewu/perl.vob public * /vobs/ewu/freeware /net/stepmother/local/mnt/VOBSB/ewu/freeware.vob pu * /vobs/stre/det /net/stepmother/local/mnt/VOBSB/stre/det.vob public

  13. WHA Configuration (Continued) • ClearCase configuration (continued) • DNS alias used for VOB servers (VOB server can change by moving the alias) • The registered path and host is always the same no matter what physical host is the VOB server! • Always use the alias, for MultiSite as well. Machines can come and go but the VOB server host name is always the same • There is both a Rational and SUN white paper documenting this configuration and setup! http://www.rational.com/media/partners/sun/Ericsson_final.pdf

  14. Backup Process • All setup is completed and WHA implemented • Lock VOBs (less than 2 minutes) • We use SUN Instant Image TM to snapshot VOB storage partition • both VOB storage and /var/adm/atria is located here (we also have trigger scripts and …) • Snapshot is to shadow1 • another disk partition, could be totally different disk • Shadow2 passed data validation with “dbcheck” yesterday and is being dumped to tape

  15. Backup Process (Continued) • Once backup to shadow1 complete, “dbcheck” will be started for data validation • Once data validation is successful -- and it’s a new backup day -- shadow1 becomes shadow2, and shadow2 becomes shadow1, and it starts all over • If error found during dbcheck we take immediate corrective action – keep validated copy on disk (shadow2) while we check out the production data • There is ALWAYS a “good copy” on the shadow2 disk!

  16. Recovery Process • Typical recovery scenario: • Get another server or fix broken one – you have to give it the same server hostname or change the ClearCase registry information! • Restore data from backup tape (100 - 200 GB, 2+ days) • Do data validation, checkvob and dbcheck (2+ days) • Restore replica (MultiSite users) for 80+ VOBs, this takes at least 8 – 12 hours • Clean up clients – typically a crash means NFS/MVFS is messed up – REBOOT! • Is that it? I wish it was! Developers can’t work! • WHA recovery scenario?

  17. Recovery Process (Continued) • WHA recovery scenario - get another server or fix broken one • ANY server can act as the new VOB server. Of course using an existing VOB/View server would degrade performance • Get VOBs on-line and back in service as fast as possible, WHA means I can “cut-over” to another server again later!

  18. Recovery Process (Continued) • WHA recovery scenario - Get another server or fix broken one (cont) • STEPS (same for any WHA cut-over): • Move the DNS alias to new server • create the links (links for /var/adm/atria and VOB physical storage locations from /local/mnt/VOBS?) • Since /var/adm/atria was backed up with the VOB storage, they are in sync • Just turn ClearCase off/on and – NEW VOB SERVER!

  19. Recovery Process (Continued) • WHA recovery scenario - Restore data from backup tape,100 - 200 GB • Not 2+ days • We don’t go to tape, unless we’ve had a real disaster! • We don’t do a “restore” we have 2 copies on disk! • Use shadow1 if data validation is complete or confidence level high – shadow2 is only 24-48 hrs old • Mount shadow disk to new VOB server (SAN makes this easy)

  20. Recovery Process (Continued) • WHA recovery scenario - Restore data from backup tape (cont) • Create the links to the VOB physical storage location • Much faster than transferring 100 – 200 GB data from tape! • 15 minutes MAX!

  21. Recovery Process (Continued) • WHA recovery scenario - Do data validation, checkvob and dbcheck • Not 2+ days • Takes a “very” long time (100-200GB of VOBs, some with 4-6GB databases) • Checkvob and dbcheck is run on all servers monthly • Daily successful dbcheck runs on shadow disk – high confidence

  22. Recovery Process (Continued) • WHA recovery scenario - Do data validation, checkvob and dbcheck (cont) • If shadow1 has completed dbchecks, use it, if not use shadow2 • NO time spent on data validation during recovery because it was done during the backup phase! • Would like checkvob and other data validation utilities that can be run on off-line VOBs!

  23. Recovery Process (Continued) • WHA recovery scenario - Restore replica • MultiSite heavily used with syncing internally every 30 minutes – checked in changes will be available in another replica since the shadow image was snapshot! • Get the changes since the snapshot from other replica • By default – restorereplica wants to sync with ALL replicas (NOT all 30-40 we have ) • **CAREFULL**

  24. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Lots of VOBs, 80+ , this will still take at least 8 – 12 hours to only 2-4 replicas • Must get update packets (that have changes since the backup) from other replicas • See example of commands on next slides!

  25. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Example of commands: mt restorereplica (default requires updates from all replicas) OR mt restorereplica replica:ewuhub_bscng_aim replica:ewucth_bscng_aim replica:ewubo_bscng_aim ** MUST INCLUDE THE REPLICA THAT WAS THE LAST REPLICA THAT WAS EXPORTED TO JUST BEFORE THE CRASH!! – NEED TO AVOID DIVERGENCE IN THE VOB REPLICAS! * Check via lsepoch, make sure the replica with record of the most changes that took place in the restored replica is included! (mt lsepoch ewuhub_bscng_aim@/vobs/bscng/aim)

  26. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) **WARNINGS – POSSIBLE DIVERGANCE** ** MUST INCLUDE THE REPLICA THAT WAS THE LAST REPLICA THAT WAS EXPORTED TO JUST BEFORE THE CRASH!! – NEED TO AVOID DIVERGENCE IN THE VOB REPLICAS! • Cheak for latest replica sync’d to • lsepoch • lshistory

  27. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Check via lsepoch, make sure the replica with record of the most changes that took place in the restored replica is included! • With ClearCase 4.X you can use –actual to query remote replicas

  28. Recovery Process (continued) • WHA recovery scenario - Restore replica (continued) • Check via lsepoch • EXAMPLE: restored replica is ewucello_bscng_aim mt lsepoch –actual ewuhub_bscng_aim@/vobs/bscng/aim oid:834d7251.f24c11d4.a4df.00:01:80:b8:c7:b4=450831 (ewucello_bscng_aim) mt lsepoch –actual ewucth_bscng_aim@/vobs/bscng/aim oid:834d7251.f24c11d4.a4df.00:01:80:b8:c7:b4=450745 (ewucello_bscng_aim)

  29. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Example of commands – to find last replica exported to. This is not trivial, you have to check each replica you have been syncing with: Example: mt lsreplica –invob /vobs/nmis Replicas (14): boulder_nmis, bscclassic_nmis, cbssw_nmis, edbbsc_nmis, edbbsm_nmis, edbspe_nmis, edbtetra_nmis, ewubo_nmis, ewucth_nmis, ewuhub_nmis, ewustre_nmis, ramstest_nmis, servicenet_nmis, streit2_nmis These replicas are the only ones the restored replica syncs with: boulder_nmis, bscclassic_nmis, ewubo_nmis, ewucth_nmis, ewuhub_nmis

  30. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) Example (cont): /vobs/nmis (must do lshistory at each remote replica site!) cleartool lshistory replica:boulder_nmis cleartool lshistory replica:bscclassic_nmis cleartool lshistory replica:ewubo_nmis cleartool lshistory replica:ewucth_nmis cleartool lshistory replica:ewuhub_nmis Example results: 12-Jun.15:55 root import sync from replica "bscclassic_nmis" to replica “ewuhub_nmis” • Review the output of the above commands, see which was the last replica to be sent an export sync packet

  31. Recovery Process (Continued) • WHA recovery scenario - Restore replica (continued) • Now run the restorereplica command with appropriate replica(s) identified! (we use ALL replicas we sync with, but not replicas we never sync with) mt restorereplica replica:boulder_nmis replica:bscclassic_nmis \ replica:ewubo_nmis replica:ewucth_nmis replica:ewuhub_nmis • Now send export packets to those replicas and send packets with changes back. The VOB is locked until the replica you are restoring gets update packets from each! • Once all changes have been processed by the restored replica, you can unlock the VOBs and go to the next step

  32. Recovery Process (Continued) • WHA recovery scenario - Clean up clients • Typically a crash means NFS/MVFS is messed up • Easiest way to get clients and servers working properly is to – REBOOT • To try and clean-up clients without a reboot see the basic script on the next page

  33. Recovery Process (Continued) • WHA recovery scenario - Clean up clients (continued) • Script #!/bin/sh -x /usr/sbin/fuser -uck /view for VOB in `/usr/atria/bin/cleartool lsvob -s` do /usr/sbin/fuser -uck $VOB > /dev/null 2>&1 done /usr/atria/bin/cleartool umount -all > /dev/null 2>&1 for MNT in `df | grep local/mnt | grep -v "/dev/dsk" | cut -f1 -d "("` do umount $MNT > /dev/null 2>&1 done rm -r /vobs/* /etc/init.d/atria stop

  34. Recovery Process (Continued) • WHA Restore completed! But developers can’t work! • Build issues – need error handling in build scripts • VOBs and Views may have been created or deleted since the backup: • Created since backup - storage exists without entry in registry • Deleted since backup - registry entry exists without storage • FIRST – MAKE SURE ALL VOB AND VIEW SERVER PROCESSES HAVE BEEN KILLED – this eliminates lots of potential problems (stop and restart ClearCase on all systems)

  35. Recovery Process (Continued) • Build issues • Case #1: VOBs that have been restored HAVE references to DO’s • DO’s physically exist in VOB (no problem) • DO’s exist in view (ref count = 1) (again no problem) • DO’s references exist in VOBs, but the DO data DOES NOT exist anymore (maybe removed since backup by rmview or rmdo) • Case #2: VOBs that have been restored DO NOT have references to DO’s that exist • DO’s exist in a single view, reference count == 1, reference in the view but not the VOBs • DO’s were promoted so references exist in multiple views (ref count > 1) – but not in the VOBs

  36. Recovery Process (Continued) • Build issues – Case # 1 • VOBs that have been restored HAVE references to DO’s • DO’s references exist in VOBs, but the DO data DOES NOT exist anymore • maybe removed since backup by rmview or rmdo

  37. Recovery Process (Continued) • Build issues – Case #1 (continued) • Since DO pointers exist in the restored VOB, these DO’s are considered during configuration lookup of builds. Results in Warnings! But it does rebuild the DO’s clearmake -C sun -f /vobs/wds/build/include/Makefile.if -e clearmake: Warning: Unable to evaluate derived object "libimc.a.1@@07-Nov.19:10. 220156" in VOB directory "/vobs/bscng/ccl/imc/imc_if/lib.sp750@@" ** recoverview does NOT clean this up, you just keep getting warnings! We created a script to clean this up, but you might be able to just ignore the messages!

  38. Recovery Process (Continued) • Build issues – Case #1 (continued) • If view has been deleted, ERROR message will be generated (scripts need error handling) >>> (clearmake): Build evaluating lib1.a >>> (clearmake): Build evaluating one.o No candidate in current view for "one.o" >>> (clearmake): Shopping for DO named "one.o" in VOB directory "/vobs/stre/do_test/.@@" >>> (clearmake): Evaluating heap derived object "one.o@@05-Jun.12:24.74" >>> clearmake: Error: Unable to find view by uuid:5b997e3d.78b711d6.ad2c.00:01:80:b6:87:eb, last known at "lime:/tmp/do3.vws". >>> clearmake: Error: Unable to contact View - ClearCase object not found >>> clearmake: Warning: View "lime:/tmp/do3.vws" unavailable - This process will not contact the view again for 60 minutes. NOTE: Other processes may try to contact the view. >>> clearmake: Warning: Unable to evaluate derived object "one.o@@05-Jun.12:24.74" in VOB directory "/vobs/stre/do_test/.@@"

  39. Recovery Process (Continued) • Build issues – Case #2 • VOBs that have been restored DO NOT have references to DO’s that exist • DO’s exist in a single view, reference count == 1, reference in the view but not the VOBs • DO’s were promoted so references exist in multiple views (ref count > 1) – but not in the VOBs

  40. Recovery Process (Continued) • Build issues – Case #2 (continued) • DO’s exist in a single view, reference count == 1, reference in the view but not the VOBs • DO’s were promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • Recoverview can be used to clean this up, needs to be run in each view with a problem. Moves stranded DO’s to view .s/lost+found: recoverview –vob <vob uuid> -tag <view tag>

  41. Recovery Process (Continued) • Build issues – Case #2.2 (continued) • DO’s promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • *careful, view server processes have not been terminated! lime /vobs/stre/do_test 53 ct setview do2 lime /vobs/stre/do_test 51 ct ls -l view private object .cmake.state version Makefile@@/main/1 Rule: element * /main/LATEST derived object four.o [no config record] derived object lib1.a [no config record] dir version lost+found@@/main/0 Rule: element * /main/LATEST derived object one.o [no config record] derived object three.o [no config record] derived object two.o [no config record]

  42. Recovery Process (Continued) • Build issues – Case #2.2 (continued) • DO’s promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • *view server processes have been terminated! lime /vobs/stre/do_test 52 ct ls .cmake.state Makefile@@/main/1 Rule: /main/LATEST cleartool: Error: Trouble looking up element "four.o" in directory ".". cleartool: Error: Trouble looking up element "lib1.a" in directory ".". lost+found@@/main/0 Rule: /main/LATEST cleartool: Error: Trouble looking up element "one.o" in directory ".". cleartool: Error: Trouble looking up element "three.o" in directory ".". cleartool: Error: Trouble looking up element "two.o" in directory ".".

  43. Recovery Process (Continued) • Build issues – Case #2.2 (continued) • DO’s promoted so references exist in multiple views (ref count > 1) – but not in the VOBs • *view server processes have been terminated! > ls -l ./one.o: No such file or directory ./two.o: No such file or directory ./three.o: No such file or directory ./four.o: No such file or directory ./lib1.a: No such file or directory

  44. Recovery Process (Continued) • Build issues – Case #2.2 (continued) • With proper shutdown of the view server process, ClearCase automatically purges the references and enter a log message in /var/adm/atria/view_log: 06/12/02 10:54:44 view_server(24163): Warning: Cover object mother:/local/mnt2/workspace/vobs/stre/do_test.vbs:336e07d7.7e2b11d6.b659.00:01:80:b6:87:eb for 0x8000000a not found in VOB: ClearCase object not found 06/12/02 10:54:44 view_server(24163): Warning: Cover object mother:/local/mnt2/workspace/vobs/stre/do_test.vbs:336e07df.7e2b11d6.b659.00:01:80:b6:87:eb for 0x80000007 not found in VOB: ClearCase object not found 06/12/02 10:54:44 view_server(24163): Warning: Cover object 06/12/02 10:54:53 view_server(24163): Warning: Vob stale 0x8000000d: Purging

  45. Recovery Process (Continued) • VOBs and Views may have been created or deleted since the backup: • VOBs or Views created since backup - storage exists without entry in registry • VOBs or Views deleted since backup - registry entry exists without storage • At least the registry is in sync with the data that was restored • ClearCase configuration and VOB storage on same device, gets backed up at the same time!

  46. Recovery Process (Continued) • VOBs and Views may have been created or deleted since the backup (continued): • You can use rgy_check to help clean this up /usr/atria/etc/rgy_check –views | vobs • It helps if you have standard storage locations for VOBs and Views, you know where to look • Sometimes you just need to wait for users to complain! Remember those “error/warning” msg! • Views are suppose to be temporary working space right!

More Related