HA architectures using open source virtualization and networking technologies Josep Vidal Canet Universitat de València 18 November 2008 • 13:00 -13:30 Valencia, Spain
Motivation • There are many factors that can cause data unavailability, corruption or even loss. • Hardware failures • natural disasters like thunders, earthquakes, flooding or fire • Wars or terrorist attacks like the ones that occurred in September the 11th in which a lot of corporate data was destroyed. • Each time, a computer system that keeps running a business for whatever reason stops, money is lost • Example: you want to buy a book on an online book store • However, at this moment the system is not longer working so you decide to buy it from competitor’s store. • What I am presenting here is a highly available distributed computer architecture that keeps running even in the case that some of its components fails - for example due to a natural disaster.
Problem analysis • Usually, most architectures are designed using a layered approach • 3 tiers: Web + Application Logic + Data • Each tier should be designed in order to meet SLA (Service Level Agreements) goals • Availability – 99.xxx % • Performance – 95 % accesses, < 1 second • Nowadays SLA is forcing organizations to deploy HA architectures in order to avoid downtimes.
Example of IS Architecture (UV) WEB Clusters Applications Clusters Data Sources Web mail (1) post (2) mailboxes (2) J2EE Applications J2EE Applications (5) Accounting Research (2) DW, SECRETARIA VIRTUAL, REPLICA (2) Switch Level 4 / pound WorkLoad Balancer Virtual Classroom (2) Virtual Classroom (4) Virtual Classroom (1) www.uv.es (1) WEB (1) disc (1) disc (1) VIRTUAL DISC (1) Library (5) Library (7) Library (1) monitor
Web + Application Tiers • It is easy to design an architecture to meet SLA goals • Main reason: Easy to clone & balance work between several systems • Modification’s rate is low -> Data is only changed each time an application is installed or updated. • Balancer is the SPoF (Single Point of Failure) • An active/passive approach can be used in order to avoid unavailability • Let’s see an example
General Architecture Open Databases WAS Grid Cluster Web Servers Linux/Apache Replica Multiplica Complica Implica LDAP server2 25 DB2 OS/390 Z/890 Explica LDAP server1 Session’s persistence Data Servers Application Servers WEB Servers jdbc Connection Pool webges01 CICS SERVER Switch Level 4 / pound Workload Balancing Webges.. db2jd plugin-cfg Maxconnect webges11 CICS SERVER MaxTasks AIX pseries Web Balancer Automatic Failover Automatic session recovery for a down JVM THREAD LIMIT AUTOMAT CTG
Active/Passive Web balancer Active Automatic failover using a Public IP + Soft ARP 12 seconds of unavailability in the case of primary balancer failure Heartbeat
Application Tier Runtime = Websphere Aplication Server (WAS) JSPs Servlets EJBs Pseries: power5/power6
Data Tier for Web Applications Open Databases DB2 OS/390 Z/890 • More difficult to design IS without SPoF. • Main reasons • Data is persistent • Modification’s rate is high • They need a lot of resources -> Multicore servers + high I/O bandwidth • IS are complex and heterogeneous, they consist of databases, text files, batch, etc … jdbc Connection Pool db2jd CICS SERVER MaxTasks THREAD LIMIT AUTOMAT CTG
Data Tier for Web Applications Information Systems Available Solutions: • To deploy SSI (Single System Architectures) IBM SYSPLEX. Not enough mature We’ll tell you our experiences with DB2 + OpenSSI • To use clustarizable databases: Shared storage = SPoF
DB running with • OpenSSI • Comprehensive clustering solution offering a full, highly available SSI environment for Linux • Goals for OpenSSI clusters include availability, scalability and manageability, built from standard servers • Open Source • Can run databases • NO SPoF using DRBD +Heartbeat + CFS • Problems with process migration
SSI at UV db2inst2@ssi:~$ db2start 07-28-2008 17:14:04 0 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. db2inst2@ssi:~$ db2 connect to replica Database Connection Information Database server = DB2/LINUX 8.1.0 SQL authorization ID = DB2INST2 Local database alias = REPLICA db2inst2@ssi:~$ db2 "select count(*) from sysibm.systables " 1 ----------- 436 1 record(s) selected. db2inst2@ssi:~$ cluster -v 1: UP 2: UP
Data Tier for DB-based Applications DB2 open • Using Open Source VirtualizaTion and Networking technologies, it is possible to deploy geographical distributed architectures with automatic failover that have low downtimes due to contingencies. jdbc Connection Pool • The UV has deployed variations (active/passive, active/active) of such architectures in order to guarantee good response times and maximize availability for its DB based information systems
Data Tier Architecture using Open Source Ideal solution: SSI. Not possible yet Using Virtualization & Network Technologies we can design distributed, fault-tolerant systems, where physical resources are virtualizated & replicated far away (using IP) Proposed Architecture: Components: Physical Resources: IP & Storage (SAN) Networks. Physical Servers. Logical Resources: • Virtualization Software (XEN). • Distributed ReplicatedBlock Device (DRBD). • Automatic Failover (hearbeat).
HA Architecture components (Data Tier) LUN´s (Logical Unit Number) Active/Passive Architecture
Active/Passive HA architecture • Using DRBD we build a reliable mirror between disks from primary and secondary disks arrays • Over this mirror, we define a XEN VM in which DB system will run • This VM runs by default, using the CPUs of primary site and modifying the data stored over LUNs in primary disk array • DRBD uses standard IP network in order to keep synchronized primary and secondary disks • In the event of a contingency Heartbeat, will detect the unavailability, migrating XEN VM from primary site to the computational resources available on secondary site • To facilitate DB automatic recover after a system crash additional configuration is needed: • DB2 conf: Enable AUTORESTART, LOGRETAIN,... • Oracle conf: Enable ARCHIVE LOG mode.
Active/Passive HA architecture XEN + DRBD + Heartbeat Virtual secretary
Final situation after a failure Heartbeat detects failure & proceeds to restart VM on secondary HW resources During System restart, DB2 could proceed to do a crash recovery
System components DB Network RAID-1 /dev/drbd1 Primary Disc Array db2inst1@bancuv3:~$ db2 connect to josep Database Connection Information Database server = DB2/LINUX 8.1.0 SQL authorization ID = DB2INST1 Local database alias = JOSEP bresca:~# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1209 2 r----- 66538.2 bancuv3 13 1024 1 -b---- 3286.8 bresca:~# more /proc/drbd version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root@bresca, 2007-07-03 11:57:21 0: cs:Unconfigured 1: cs:Connected st:Primary/Secondary ld:Consistent ns:31544386 nr:0 dw:36278 dr:31835131 al:95 bm:3848 lo:0 pe:0 ua:0 ap:0 bresca:~# dmesg qla2xxx 0000:03:01.1: Configure NVRAM parameters... Vendor: STK Model: FLEXLINE 380 Rev: 0619 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdh: drive cache: write through w/ FUA sdh: sdh1 sdh2 Bancuv3Active VM Bresca Xen dom0
DRBD Disk Synchronization after server’s failure colmena:/etc# more /proc/drbd version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root@bresca, 2007-07-03 11:57:21 0: cs:Unconfigured 1: cs:SyncTarget st:Secondary/Primary ld:Inconsistent ns:0 nr:1044903 dw:1044903 dr:0 al:0 bm:3023 lo:37 pe:1222 ua:37 ap:0 [>...................] sync'ed: 2.2% (46235/47256)M finish: 0:17:00 speed: 46,160 (52,228) K/sec DRBD uses host based replication (sync & async) in order to keep up to date local & remote discs Be careful with failures of primary system while the secondary node is synchronizing
XEN /dev/drbd1 /dev/drbd1 Heartbeat Bresca Xen dom0 Colmena Xen dom0 Network RAID-1 /dev/drbd1 IP IP FC FC Deco-garden Brico-mania Primary Disc Array DRBD (Distributed RAID Block Device) SecondaryDisc Array 10 km • Virtual Machine Configuration: /etc/xen/bancuv3.cfg • VM Disk is a network mirror between two remote FC disks …. # Disk device(s). root = '/dev/sda1 ro' disk = [ 'phy:/dev/drbd1,sda1,w', 'phy:/dev/sdh2,sda2,w'] ….
XEN + DRBD + Heartbeat /dev/drbd1 /dev/drbd1 Heartbeat Bresca Xen dom0 Colmena Xen dom0 Network RAID-1 /dev/drbd1 IP IP FC FC Deco-garden Brico-mania Primary Disc Array DRBD (Distributed RAID Block Device) SecondaryDisc Array 10 km • At this point, VM uses only virtual resources, so it not depends on the underlying HW • As VM disk is a network mirror, it can be run in both systems • Finally we add Heartbeat for failure detection & recovery • In the advent of a failure, Heartbeat migrates VM to the available resources (secondary site)
Failure detection & recovery more /var/log/ha-log heartbeat: 2008/08/27_10:43:59 info: Received shutdown notice from 'bresca'. heartbeat: 2008/08/27_10:43:59 info: Acquiring resource group: bresca 22.214.171.124/23/eth0:5 bancuv3 heartbeat: 2008/08/27_10:43:59 info: Running /etc/ha.d/resource.d/IPaddr 126.96.36.199/23/eth0:5 start heartbeat: 2008/08/27_10:44:00 info: /sbin/ifconfig eth0:5:0 188.8.131.52 netmask 255.255.254.0 broadcast 184.108.40.206 heartbeat: 2008/08/27_10:44:00 info: Sending Gratuitous Arp for 220.127.116.11 on eth0:5:0 [eth0] heartbeat: 2008/08/27_10:44:00 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-18.104.22.168 eth0 22.214.171.124 auto 126.96.36.199 ffffffffffff heartbeat: 2008/08/27_10:44:00 info: Running /etc/ha.d/resource.d/bancuv3 start heartbeat: 2008/08/27_10:44:04 info: all HA resource acquisition completed (standby). heartbeat: 2008/08/27_10:44:04 info: Standby resource acquisition done [all].
Active/Passive HA architecture • Additional system tuning is needed to improve system recovery times: • Using a journal filesystem (xfs, ext3, etc ..) • To facilitate DB automatic recover after a system crash additional configuration is needed (in the case of DB2: AUTORESTART, LOGRETAIN, DB2_USE_PARALLEL_RECOVERY...) • Database recovery could be a time consuming task (not deterministic) • The drawback of this architecture is that secondary site computational resources are idle waiting for a failure on primary site • A better one consists on balancing the execution of DB instances or other VMs between both sites • In the event of a contingency over a site, the VMs are migrated from the affected site to the available one. • VM migration consists of stopping VM on affected site and starting it on the available resources
Active/Active HA architecture (II) bresca:~# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1209 2 r----- 66538.2 bancuv3 13 1024 1 -b---- 3286.8 webges06 6 256 1 -b---- 83.6 colmena:~# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 2575 4 r----- 492.2 rac2 1 1024 1 -b---- 458.3 webges05 2 256 1 -b---- 69.7
Active/Active HA architecture (II) • In the advent of a failure, Heartbeat migrates VM • to the available resources, in a determinate point of time • The load of the available site, will be increased (x2)
Active/Active HA architecture (IV) MV2 MV3 MV4 MV1 Xen Server B /dev/drbd4 Distributed RAID Block Devices RAID1 RAID1 RAID1 RAID1 FC • After a failure in one of the two sites, the load of the available site, will be increased by a factor of two • We will be up & running but with worse response times • Once primary site HW & SW resources have been recovered, load is redistributed automatically
What we have learned • To deploy HA architectures for DB based information systems, which automatically detects and recovers from errors of the runtime (HW, SW) needed to run corporate applications • To select the best HA architecture for databases (active/passive, active/active, single system image) that fits with the business's SLA • To automate major steps involved in the detection and recovery from errors of a determinate component of DB runtime • To learn how to configure & use open-source tools (Xen, Heartbeat, DRBD, openssi) needed to implement high availability architectures