OpenVMS Distributed Lock Manager Performance

OpenVMS Distributed Lock Manager Performance Session ES-09-U Keith Parris HPQ

Background • VMS system managers have traditionally looked at performance in 3 areas: • CPU • Memory • I/O • But in VMS clusters, what may appear to be an I/O bottleneck can actually be a lock-related issue

Overview • VMS keeps some lock activity data that no existing performance management tools look at • Locking statistics and lock-related symptoms can provide valuable clues in detecting disk, adapter, or interconnect saturation problems

Overview • The VMS Lock Manager does an excellent job under a wide variety of conditions to optimize locking activity and minimize overhead, but: • In clusters with identical nodes running the same applications, remastering can sometimes happen too often • In extremely large clusters, nodes can “gang up” on lock master nodes and overload them • Locking activity can contribute to: • CPU 0 saturation in Interrupt State • Spinlock contention (Multi-Processor Synchronization time) • We’ll look at methods of detection, and solutions to, these types of problems

Topics • Available monitoring tools for the Lock Manager • How to map VMS symbolic lock resource names to real physical entities • Lock request latencies • How to measure lock rates

Topics • Lock mastership, and why one might care about it • Dynamic lock remastering • How to detect and prevent lock mastership thrashing • How to find the lock master node for a given resource tree • How to force lock mastership of a given resource tree to a specific node

Topics • Lock queues, their causes, and how to detect them • Examples of problem locking scenarios • How to measure pent-up remastering demand

Monitoring tools • MONITOR utility • MONITOR LOCK • MONITOR DLOCK • MONITOR RLOCK (in VMS 7.3 and above; not 7.2-2) • MONITOR CLUSTER • MONITOR SCS • SHOW CLUSTER /CONTINUOUS • DECamds / Availability Manager • DECps (Computer Associates’ Unicenter Performance Management for OpenVMS, earlier Advise/IT)

Monitoring tools • ANALYZE/SYSTEM • New SHOW LOCK qualifiers for VMS 7.2 and above: • /WAITING • Displays only the waiting lock requests (those blocked by other locks) • /SUMMARY • Displays summary data and performance counters • New SHOW RESOURCE qualifier for VMS 7.2 and above: • /CONTENTION • Displays resources which are under contention

Monitoring tools • ANALYZE/SYSTEM • New SDA extension LCK for lock tracing in VMS 7.2-2 and above • SDA> LCK !Shows help text with command summary • Can display various additional lock manager statistics: • SDA> LCK STATISTIC !Shows lock manager statistics • Can show busiest resource trees by lock activity rate: • SDA> LCK SHOW ACTIVE !Shows lock activity • Can trace lock requests: • SDA> LCK LOAD !Load the debug execlet • SDA> LCK START TRACE !Start tracing lock requests • SDA> LCK STOP TRACE !Stop tracing • SDA> LCK SHOW TRACE !Display contents of trace buffer • Can even trigger remaster operations: • SDA> LCK REMASTER !Trigger a remaster operation

Mapping symbolic lock resource names to real entities • Techniques for mapping resource names to lock types • Common prefixes: • SYS$ for VMS executive • F11B$ for XQP, file system • RMS$ for Record Management Services • See Appendix H in Alpha V1.5 IDSM or Appendix A in Alpha V7.0 version

Resource names • Example: XQP File Serialization Lock • Resource name format is • “F11B$s” {Lock Basis} • Parent lock is the Volume Allocation Lock “F11B$v” {Lock Volume Name} • Calculate File ID from Lock Basis • Lock Basis is RVN and File Number from File ID (ignoring Sequence Number), packed into 1 longword • Identify disk volume from parent resource name

Resource names • Identifying file from File ID • Look at file headers in Index File to get filespec: • Can use DUMP utility to display file header (from Index File) • $ DUMP /HEADER /IDENTIFIER=(file_id) /BLOCK=COUNT=0 disk:[000000]INDEXF.SYS • Follow directory backlinks to determine directory path • See example procedure FILE_ID_TO_NAME.COM • (or use LIB$FID_TO_NAME routine to do all this, if sequence number can be obtained)

Resource names • Example: RMS lock tree for an RMS indexed file: • Resource name format is • “RMS$” {File ID} {Flags byte} {Lock Volume Name} • Identify filespec using File ID • Flags byte indicates shared or private disk mount • Pick up disk volume name • This is label as of time disk was mounted • Sub-locks are used for buckets and records within the file

Internal Structure of an RMS Indexed File

RMS Data Bucket Contents Data Bucket Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record

RMS Indexed FileBucket and Record Locks • Sub-locks of RMS File Lock • Have to look at Parent lock to identify file • Bucket lock: • 4 bytes: VBN of first block of the bucket • Record lock: • 8 bytes (6 on VAX): Record File Address (RFA) of record

Locks and File I/O • Lock requests and data transfers for a typical RMS indexed file I/O (prior to 7.2-1H1): 1) Lock & get root index bucket 2) Lock & get index buckets for any additional index levels 3) Lock & get data bucket containing record 4) Lock record 5) For writes: write data bucket containing record Note: Most data reads may be avoided thanks to RMS global buffer cache

Locks and File I/O • Since all indexed I/Os access Root Index Bucket, contention on lock for Root Index Bucket of hot file can be a bottleneck • Lookup by Record File Address (RFA) avoids index lookup on 2nd and subsequent accesses to a record

Lock Request Latencies • Latency depends on several things: • Directory lookup needed or not • Local or remote directory node • $ENQ or $DEQ operation • Local or remote lock master • If remote, type of interconnect

Directory Lookups • This is how VMS finds out which node is the lock master • Only needed for 1st lock request on a particular resource tree on a given node • Resource Block (RSB) remembers master node CSID • Basic conceptual algorithm: Hash resource name and index into lock directory vector, which has been created based on LOCKDIRWT values

Lock Request Latencies • Local requests are fastest • Remote requests are significantly slower: • Code path ~20 times longer • Interconnect also contributes latency • Total latency up to 2 orders of magnitude higher than local requests

Lock Request LatencyClient process on same node:4-6 microseconds Lock Master Node Client

Storage Lock Request LatencyClient across CI star coupler:440 microseconds Lock Master Client node Client Star Coupler

Lock Request Latencies

How to measure lock rates • VMS keeps counters of lock activity for each resource tree • but not for each of the sub-resources • So you can see the lock rate for an RMS indexed file, for example • but not for individual buckets or records within that file • SDA extension LCK can trace all lock requests if needed

Identifying busiest lock trees in the cluster with a program • Measure lock rates based on RSB data: • Follow chain of root RSBs from LCK$GQ_RRSFL listhead via RSB$Q_RRSFL links • Root RSBs contain counters: • RSB$W_OACT: Old activity field (average lock rate per 8 second interval) • Divide by 8 to get per-second average • RSB$W_NACT: New activity (locks so far within current 8-second interval) • Transient value, so not as useful

Identifying busiest lock trees in the cluster with a program • Look for non-zero OACT values: • Gather resource name, master node CSID, and old-activity field • Do this on each node • Summarize data across the cluster • See example procedure LOCK_ACTV.COM and program LCKACT.MAR • Or, for VMS 7.2-2 and above: • SDA> LCK SHOW ACTIVE • Note: Per-node data, not cluster-wide summary

Lock Activity Program Example 0000002020202020202020203153530200004C71004624534D52 RMS$F.qL...SS1 ... RMS lock tree for file [70,19569,0] on volume SS1 File specification: DISK$SS1:[DATA8]PDATA.IDX;1 Total: 11523 *XYZB12 6455 XYZB11 746 XYZB14 611 XYZB15 602 XYZB23 564 XYZB13 540 XYZB19 532 XYZB16 523 XYZB20 415 XYZB22 284 XYZB18 127 XYZB21 125 * Lock Master Node for the resource {This is a fairly hot file. Here the lock master node is optimal.}

Lock Activity Program Example 0000002020202032454C494653595302000000D3000C24534D52 RMS$.......SYSFILE2 ... RMS lock tree for file [12,211,0] on volume SYSFILE2 File specification: DISK$SYSFILE2:[SYSFILE2]SYSUAF.DAT;5 Total: 184 XYZB16 75 XYZB20 48 XYZB23 41 XYZB21 16 XYZB19 2 *XYZB15 1 XYZB13 1 XYZB14 0 XYZB12 0 {This reflects user logins, process creations, password changes, and such. Note the poor lock master node selection here (XYZB16 would be optimal).}

Example: Application (re)opens file frequently • Symptom: High lock rate on File Access Arbitration Lock for application data file • Cause: BASIC program re-executing OPEN command for a file; BASIC dutifully closes and then re-opens file • Fix: Modify BASIC program to execute OPEN statement only once at image startup time

Lock Activity Program Example 00000016202020202020202031505041612442313146 F11B$aAPP1 .... Files-11 File Access Arbitration lock for file [22,*,0] on volume APP1 File specification: DISK$APP1:[DATA]XDATA.IDX;1 Total: 50 *XYZB15 8 XYZB21 7 XYZB16 7 XYZB19 6 XYZB20 6 XYZB23 6 XYZB18 5 XYZB13 3 XYZB12 1 XYZB22 1 XYZB14 1 {This shows where the application is apparently opening (or re-opening) this particular file 50 times per second.}

Lock Mastership (Resource Mastership) concept • One lock master node is selected by VMS for a given resource tree at a given time • Different resource trees may have different lock master nodes

Lock Mastership (Resource Mastership) concept • Lock master remembers all locks on a given resource tree for the entire cluster • Each node holding locks also remembers the locks it is holding on resources, to allow recovery if lock master node dies

Lock Mastership • Lock mastership node may change for various reasons: • Lock master node goes down -- new master must be elected • VMS may move lock mastership to a “better” node for performance reasons • LOCKDIRWT imbalance found, or • Activity-based Dynamic Lock Remastering • Lock Master node no longer has interest

Lock Remastering • Circumstances under which remastering occurs, and does not: • LOCKDIRWT values • VMS tends to remaster to node with higher LOCKDIRWT values, never to node with lower LOCKDIRWT • Shifting initiated based on activity counters in root RSB • PE1 parameter being non-zero can prevent movement or place threshold on lock tree size • Shift if existing lock master loses interest

Lock Remastering • VMS rules for dynamic remastering decision based on activity levels: • assuming equal LOCKDIRWT values • 1) Must meet general threshold of 80 lock requests so far (LCK$GL_SYS_THRSH) • 2) New potential master node must have at least 10 more requests per second than current master (LCK$GL_ACT_THRSH)

Lock Remastering • VMS rules for dynamic remastering: • 3) Estimated cost to move (based on size of lock tree) must be less than estimated savings (based on lock rate) • except if new master meets criteria (2) for 3 consecutive 8-second intervals, cost is ignored • 4) No more than 5 remastering operations can be going on at once on a node (LCK$GL_RM_QUOTA)

Lock Remastering • VMS rules for dynamic remastering: • 5) If PE1 on the current master has a negative value, remastering trees off the node is disabled • 6) If PE1 has a positive, non-zero value on the current master, the tree must be smaller than PE1 in size or it will not be remastered

Lock Remastering • Implications of dynamic remastering rules: • LOCKDIRWT must be equal for lock activity levels to control choice of lock master node • PE1 can be used to control movement of lock trees OFF of a node, but not ONTO a node • RSB stores lock activity counts, so even high activity counts can be lost if the last lock is DEQueued on a given node and thus the RSB gets deallocated

Lock Remastering • Implications of dynamic remastering rules: • With two or more large CPUs of equal size running the same application, lock mastership “thrashing” is not uncommon: • 10 more lock requests per second is not much of a difference when you may be doing 100s or 1,000s of lock requests per second • Whichever new node becomes lock master may then see its own lock rate slow somewhat due to the remote lock request workload

Lock Remastering • Lock mastership thrashing results in user-visible delays • Lock operations on a tree are stalled during a remaster operation • Locks and Resources were sent over 1 per SCS message • Remastering large lock trees could take a long time • e.g. 10 to 50 seconds for 15K lock tree size, prior to 7.2-2 • Improvement in VMS in version 7.2-2 and above gives very significant performance gain • by using 64 Kbyte block data transfers instead of sending 1 SCS message per RSB or LKB

How to Detect Lock Mastership Thrashing • Detection of remastering activity • MONITOR RLOCK in 7.3 and above (not 7.2-2) • SDA> SHOW LOCK/SUMMARY in 7.2 and above • Change of mastership node for a given resource • Check message counters under SDA: • SDA> EXAMINE PMS$GL_RM_RBLD_SENT • SDA> EXAMINE PMS$GL_RM_RBLD_RCVD • Counts which increase suddenly by a large amount indicate remastering of large tree(s) • SENT: Off of this node • RCVD: Onto this node • See example procedures WATCH_RBLD.COM and RBLD.COM

How to Prevent Lock Mastership Thrashing • Unbalanced node power • Unequal workloads • Unequal values of LOCKDIRWT • Non-zero values of PE1

How to find the lock master node for a given resource tree • 1) Take out a Null lock on the root resource using $ENQ • VMS does directory lookup and finds out master node • 2) Use $GETLKI to identify the current lock master node’s CSID and the lock count • If the local node is the lock master, and the lock count is 1 (i.e. only our NL lock), there’s no interest in the resource now

How to find the lock master node for a given resource tree • 3) $DEQ to release the lock • 4) Use $GETSYI to translate the CSID to an SCS Nodename • See example procedure FINDMASTER_FILE.COM and program FINDMASTER.MAR, which can find the lock master node for RMS file resource trees

Controlling Lock Mastership • Lock Remastering is a good thing • Maximizes the number of lock requests which are local (and thus fastest) by trying to move lock mastership of a tree to the node with the most activity on that tree • So why would you want to wrest control of lock mastership away from VMS? • Spread lock mastership workload more evenly across nodes to help avoid saturation of any single lock master node • Provide best performance for a specific job by guaranteeing local locking for its files

How to force lock mastership of a resource tree to a specific node • 3 ways to induce VMS to move a lock tree: 1) Generate a lot of I/Os • For example, run several copies of a program that rapidly accesses the file 2) Generate a lot of lock requests • without the associated I/O operations 3) Generate the effect of a lot of lock requests without actually doing them • by modifying VMS’ data structures

How to force lock mastership of a resource tree to a specific node • We’ll examine: • 1) Method using documented features • thus fully supported • 2) Method modifying VMS data structures

Controlling Lock Mastership Using Supported Methods • To move a lock tree to a particular node (non-invasive method): • Assume PE1 non-zero on all nodes to start with • 1) Set PE1 to 0 on existing lock master node to allow dynamic lock remastering of tree off that node • 2) Set PE1 to negative value (or small positive value) on target node to prevent lock tree from moving off of it afterward

OpenVMS Distributed Lock Manager Performance

OpenVMS Distributed Lock Manager Performance

Presentation Transcript

Lock Services in Distributed File Systems

Optim Performance Manager

ATLAS Magda Distributed Data Manager

OpenVMS Volume Shadowing Performance Keith Parris Systems

Lock Performance

High Performance Distributed Computing

Using uPortal Distributed Layout Manager

BMC Performance Manager

Performance of Distributed Systems

Distributed Configuration Manager for FaReCast

Magda Distributed Data Manager Prototype

Magda Distributed Data Manager

Tim Lock Product Assurance Manager

Magda Distributed Data Manager Status

Performance Manager

Manager Performance Evaluation

AppSense Performance Manager

Performance Manager

Overview of Distributed Layout Manager

CockpitMgr for OpenVMS

OpenVMS System Management

High Performance Distributed Computing