Massively Parallel Database Dump and Reload

Massively Parallel Database Dump and Reload By Frank Spaniak US Tsubaki 301 East Marquardt Drive Wheeling IL 60609 FrankSpaniak@ustsubaki.com

Introduction UST -Senior Systems Administrator InterParking (GE) WWI – President - Software and Management Consulting Progress since 1984 - 3.2j Basic before it was called Visual basic (1980 yikes!) Some Current / Past Clients and Project involvement • General Parking (now InterParking) • Remanufacturing Division of Ford • National Safety Council • Americania De Finzas (Chain of banks in Mexico) • Board of Directors Artisan Corp

Tsubakimoto Corporation (or 私の愛すべき働き場所) • Founded in 1917 • The most comprehensive product line in the industry • Power transmission products • Network of sales & service outlets in 70 countries spanning the globe • Consolidated net sales of over $900 million • Major affiliates in the US, Canada, Europe, Singapore, Taiwan and Australia • Vital production plants in Japan, United States, Canada, Taiwan, Singapore and Korea • Employs more than 4,500 people globally • Using QAD since 1991 • Originally a VMS shop until about 6 years ago.

HP 9000 N-class (Earth) 4 -350mhz processors 6 gig memory HP-UX 11.0 350-400 users of which 250 on a wide area network 29 Different databases including QAD and several Custom databases Running QAD 9.0 SP3 (V8) and EB2 sp4 (V9) The Largest QAD Database was 24 Gig (48 extents) 24x6 ½ Operation over 13 Locations and Plants in North America The Setting(or My Inheritance, before taxes)

Unable to perform a lot of archive and deleting. Couldn’t Archive/Delete even a single day’s invoices without the process crashing, 72 hours per day’s work to delete (that’s is a whole another story). We had 9 years Payable records to wade thru. Nightly Corvu processing was starting at 3 am and finishing about 3pm the next day. Full MRP was 1 hour 20 min. Month end processing congestion was unbearable, it was taking 3 days to do. Delays on sub systems (AIM, shipping manifesting) it was 50-70 seconds, seriously hurting production. Freedom data collection download 8 hours over the WAN to complete. It was clear to all the users that processing times were becoming unworkable. What the lack of a Dump/Load caused(or why we kept working late)

Several QAD databases were consolidated in late 2003 but no dump and reload. A dump/load was attempted about Dec. 2003 it had been 1.5 years since the prior dump and load, but it was unsuccessful. As of July 2006, it had been 3½ years since the last dump and load. Changes in production planning closed the window of work from long weekends to just 12 hours at the outside, 8 hours was the target. An now for some history(or Gameboy anyone?)

Shipping backup of the database to another site, to perform the dump/load on because it was a faster box. 24 hours and a lot of $$. Building a box at the HP service center in Chicago and bringing a tape there to do the dump and reload. 12+ hours and security problems. These options all fell outside the 8-12 Hour window. What’s a DBA to do?(or is it Friday yet?) Several options were considered to improve conditions The Constraints that the IT department operated under • Small Time window for the actual dump and load process < 8 hours; The shorter the better. • This process had to be provable before running on production. • Expenses were tight so we could not purchase software to do this.

Since exclusive access was in short supply, we had to perform testing in pieces on a test database. Then add the times together. The time to dump/load the main production database took approximately 4 days, with exclusive access to the machine (binary dump and load). Consistent with previous attempts. No other system in the house at that time had the space / horsepower to do the dump and load. Initial Assessment(or did someone paint a target on me?) Several tests had been performed to gain metrics on the time needed for a dump and load, this was over the course of a several weeks

Found that we had an overall database scatter factor of 5.5. Many tables were much worse. Approximately 97 million Records totaling 24 Gig. We would try to do as much AR archiving and deleting as possible before the dump and reload. To bring the databases down for any extended period of time for all the plants and processes, was a significant coordinated effort; like herding cats. Initial Assessment(or did someone paint a target on me?)

Over this process, We purchased a replacement for Earth HP RP 4440 (Samson) 8 -1-ghz processors 16 gig memory HP-UX 11.i 2-170 gig mirrored internal drives .5 TB external storage over gigabit fiber to our SAN (EVA). Initial Assessment (continued)(or Yes they did) This went along way in being able to test and prevent many sins from happening.

Summary of Systems HP 9000 N-class 4 -350 Mhz 6 Gig memory Earth San 2.1 Tb RP 4440 6 – 1 Ghz 16 Gig Memory Samson

The Action Plan (or Who could we find to blame?) Using Samson as the testing Computer: • Performed a dump and reload with the new box, it was still taking over 8 hours for the main database. • Perform the dump/load for the testing system and put the users on it for several weeks to verify the data, and then started working on speeding up the process. • While the dump and reload is going on, prep the systems on earth by building the empty extents to restore the backups. • Streamlined the dump and reload to about 7 hours thru automation and some parallel operation, automating several of the commands to dump and reload the db. Very little wiggle room for problems.

The san drives were significantly faster that than the internal drives i.e. 480 meg/sec Transfer speed vs. 1000/meg sec over to the SAN. Since the db was so fragmented the machine was sitting in i/o wait for the dump and the fiber line to the SAN was relatively quiet. The question became how can we make this run at top speed consistently? Discoveries made(or You gotta love those computer toys)

A reasonably Current ksh or sh and the commands pwd wc Head Environment is setup in bin/set_progress.sh Sets DLC and PROPATH Variables. Sets RUN_DIR and DONE_DIR variables. And updates the PATH for Progress to the local bin directory Start up is via dumpctl.sh Which verifies that the environmental settings are what they are supposed to be and runs the application. A server must be running on the database to be dumped. Required UNIX commands and setup (or Say what?)

Lay of the Land(or Playtime in UNIX) Dump Control DB Application /prod/dump/dump_control.p Script Directory - /prod/dump/bin set_progress.sh Sets Environment dmp_get_done.sh Looks for Processes done dmp_get_rec.sh Gets 1 (one) done file marker dmp_start_thread.sh Starts 1 (one) dump thread. Run Directory /tmp/run Holds run Markers Done Directory /tmp/done Holds done Markers Target DB Target Directory - /backups/dump

bin/set_progress.sh : #!/usr/bin/ksh # # Establish Progress Location and environment used by app (fjs) # temp dir to store the run and done markers while running export RUN_DIR=/tmp/run export DONE_DIR=/tmp/done # Progress export DLC=/dlc/83b22 export PROPATH=$BASEDIR:$DLC # check dirs are valid [ ! -d ${RUN_DIR} ] && export XERROR="Run Marker $RUN_DIR directory is not valid“ [ ! -d ${DONE_DIR} ] && export XERROR="Done Marker directory $DONE_DIR is not valid " [ ! -d ${DLC} ] && export XERROR="$DLC directory is not valid " # add Progress and the local bin dir to path # BASEDIR is passed from startup script export PATH=$DLC:$DLC/bin:$BASEDIR/bin:$PATH

bin/dmp_start_thread.sh : #!/usr/bin/ksh # # start dump thread (fjs) # # Parameters are # 1 db name to dump # 2 filename to dump # 3 dump directory . ./bin/set_progress.sh $DLC/bin/proutil ${1} -C dump ${2} ${3} > /dev/null 2>&1 # indicate file done mv ${RUN_DIR}/${2}.run ${DONE_DIR}/${2}.done

Main Screen

#1 Database Group And Tables

#1 Adding a new group

#1 Adding a new group (continued) Based on the database information entered, this will read the schema into the control Group If you select Yes, it will read in the table names and set the default dump location for the table

#1 Adding a new group If yes then a list of all the database tables is displayed and you can select a table to override it’s dump location.

#2 Doing the Database Dump When you select #2 Run Dump Procedure you are prompted with database group to run for. Once the group is selected a Recap screen is shown and proceeds to the data dump.

#2 The Database Dump Running 40 Data Dumps Running at once! Be careful of bus bandwidth saturation ! YMMV

#3 & 4 Generate Load and Index rebuild script • These have no display output and simply generate scripts • Load-script.sh for the database load • Index-rebuild.sh for the index rebuild phase • This is done for V8 Compatibility they do not have to be run is sequence; the load and index scripts can be generated before hand to put in automated procedures.

Reduced the database dump time to 45 minutes With load and index rebuild at 1:35 total little over 4.5 hours end to end from 7. Overall percentage gain of about 40% Nightly Corvu processing finishing about 5am (2 hours) Full MRP was reduced to about 5 minutes per site Month end processing was finishing by mid afternoon the first day. Support sub-system response is 1-3 seconds. Freedom download taking 3 hours over the WAN. Eliminated 12 extents and still left 5 empty ones. The Result (or How I Got a Day off work )

The Result - continued (or How I Got a Day off work … with pay) • With the reduced processing time we were further able to reduce the database size with archiving and deleting Leaving more time for:

The Result (or What I did on my day off ) Golf And Beer

The Result (or This space for rent ) Questions ?

Massively Parallel Database Dump and Reload

Massively Parallel Database Dump and Reload

Presentation Transcript

Massively Parallel Processors

Massively Parallel LDPC Decoding on GPU

Parallel Database Primer

Massively Parallel Cloud Data Storage Systems

Programming Massively Parallel Graphics Processors

Massively Parallel/Distributed Data Storage Systems

Emulating Massively Parallel (Peta FLOPS ) Machines

Parallel Database Systems

Massively Distributed Database Systems

Massively Parallel Ensemble Methods Using Work Queue

A Massively Parallel Architecture for Bioinformatics

Time-Domain Segmentation based Massively Parallel Simulation

Database Dump/Load

Massively Parallel Multgrid for Finite Elements

Theoretical limitations of massively parallel biology

Massively Parallel Signature Sequencing (MPSS)

Steering Massively Parallel Applications Under Python

PAVEMENT/PIO Parallel I/O System for Massively Parallel Processors

CM-5 Massively Parallel Supercomputer

Massively Parallel Computing for Protein Alignment

Massively Parallel Cloud Data Storage Systems

Parallel Database Primer