Session OS-23: Monitoring and Controlling the VMS Lock Manager

Session OS-23:Monitoring and Controlling the VMS Lock Manager Keith Parris Wednesday, 9 May 2001

Background • VMS system managers have traditionally looked at performance in 3 areas: • CPU • Memory • I/O • But in VMS clusters, what may appear to be an I/O bottleneck can actually be a lock-related issue

Overview • VMS keeps some lock activity data that no existing performance tools look at • Locking statistics and lock-related symptoms can provide valuable clues in detecting disk, adapter, or interconnect saturation problems

Overview • The VMS Lock Manager does an excellent job under a variety of conditions to optimize locking activity and minimize overhead, but: • In clusters with identical nodes running the same applications, remastering can sometimes happen too often • In extremely large clusters, nodes can “gang up” on lock master nodes and overload them • Locking activity can contribute to CPU saturation • Particularly CPU 0 in Interrupt State

Topics • Available monitoring tools for the Lock Manager • How to map VMS symbolic lock resource names to real physical entities • Lock request latencies • How to measure lock rates

Topics • Lock mastership, and why one might care about it • Dynamic lock remastering • How to detect and prevent lock mastership thrashing • How to find the lock master node for a given resource tree • How to force lock mastership of a given resource tree to a specific node

Topics • Lock queues, their causes, and how to detect them • Examples of problem locking scenarios • How to measure pent-up remastering demand

Monitoring tools • MONITOR utility • MONITOR LOCK • MONITOR DLOCK • MONITOR RLOCK (new for VMS 7.3) • MONITOR CLUSTER • MONITOR SCS • SHOW CLUSTER /CONTINUOUS • DECamds / Availability Manager • DECps (Advise/IT)

Monitoring tools • ANALYZE/SYSTEM • New SHOW LOCK qualifiers for VMS 7.2: • /WAITING • Displays only the waiting lock requests (those blocked by other locks) • /SUMMARY • Displays summary data and performance counters • New SHOW RESOURCE qualifier for VMS 7.3: • /CONTENTION • Displays resources which are under contention • New SDA extension LCK for lock tracing in 7.3 (as-yet undocumented)

Mapping symbolic lock resource names to real entities • Techniques for mapping resource names to lock types • Common prefixes: • SYS$ for VMS executive • F11B$ for XQP, file system • RMS$ for Record Management Services • See Appendix H in Alpha V1.5 IDSM or Appendix A in Alpha V7.0 version

Resource names • Example: XQP File Serialization Lock • Resource name format is • “F11B$s” {Lock Basis} • Parent lock is the Volume Allocation Lock “F11B$v” {Lock Volume Name} • Calculate File ID from Lock Basis • Lock Basis is RVN and File Number from File ID (ignoring Sequence Number), packed into 1 longword • Identify disk volume from parent resource name

Resource names • Identifying file from File ID • Look at file headers in Index File to get filespec: • Can use DUMP utility to display file header (from Index File) • $ DUMP /HEADER /IDENTIFIER=(file_id) /BLOCK=COUNT=0 disk:[000000]INDEXF.SYS • Follow directory backlinks to determine directory path • See example procedure FILE_ID_TO_NAME.COM • (or use LIB$FID_TO_NAME routine to do all this, if sequence number can be obtained)

Resource names • Example: RMS lock tree for an RMS indexed file: • Resource name format is • “RMS$” {File ID} {Flags byte} {Lock Volume Name} • Identify filespec using File ID • Flags byte indicates shared or private disk mount • Pick up disk volume name • This is label as of time disk was mounted • Sub-locks are used for buckets and records within the file

Internal Structure of an RMS Indexed File

RMS Data Bucket Contents Data Bucket Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record

RMS Indexed FileBucket and Record Locks • Sub-locks of RMS File Lock • Have to look at Parent lock to identify file • Bucket lock: • 4 bytes: VBN of first block of the bucket • Record lock: • 8 bytes (6 on VAX): Record File Address (RFA) of record

Locks and File I/O • Lock requests and data transfers for a typical RMS indexed file I/O (prior to 7.2-1H1): 1) Lock & get root index bucket 2) Lock & get index buckets for any additional index levels 3) Lock & get data bucket containing record 4) Lock record 5) For writes: write data bucket containing record Note: Most data reads may be avoided thanks to RMS global buffer cache

Locks and File I/O • Since all indexed I/Os access Root Index Bucket, contention on lock for Root Index Bucket of hot file can be a bottleneck • Lookup by Record File Address (RFA) avoids index lookup on 2nd and subsequent accesses to a record

Lock Request Latencies • Latency depends on several things: • Directory lookup needed or not • Local or remote directory node • $ENQ or $DEQ operation • Local or remote lock master • If remote, type of interconnect

Directory Lookups • This is how VMS finds out which node is the lock master • Only needed for 1st lock request on a particular resource tree on a given node • Resource Block (RSB) remembers master node CSID • Basic conceptual algorithm: Hash resource name and index into lock directory vector, which has been created based on LOCKDIRWT values

Lock Request Latencies • Local requests are fastest • Remote requests are significantly slower: • Code path ~20 times longer • Interconnect also contributes latency • Total latency up to 2 orders of magnitude higher than local requests

Lock Request LatencyClient process on same node:4-6 microseconds Lock Master Node Client

Storage Lock Request LatencyClient across CI star coupler:440 microseconds Lock Master Client node Client Star Coupler

Lock Request Latencies

How to measure lock rates • VMS keeps counters of lock activity for each resource tree • but not for each of the sub-resources • So you can see the lock rate for an RMS indexed file, for example • but not for individual buckets or records within that file

Identifying heaviest-used lock trees in the cluster • Measure lock rates based on RSB data: • Follow chain of root RSBs from LCK$GQ_RRSFL listhead via RSB$Q_RRSFL links • Root RSBs contain counters: • RSB$W_OACT: Old activity field (average lock rate per 8 second interval) • Divide by 8 to get per-second average • RSB$W_NACT: New activity (locks so far within current 8-second interval) • Transient value, so not as useful

Identifying heaviest-used lock trees in the cluster • Look for non-zero OACT values: • Gather resource name, master node CSID, and old-activity field • Do this on each node • Summarize data across the cluster • See example procedure LOCK_ACTV.COM and program LCKACT.MAR • Or, for VMS 7.3: SDA> LCK SHOW ACTIVITY

Lock Activity Example 0000002020202020202020203153530200004C71004624534D52 RMS$F.qL...SS1 ... RMS lock tree for file [70,19569,0] on volume SS1 File specification: DISK$SS1:[DATA8]PDATA.IDX;1 Total: 11523 *XYZB12 6455 XYZB11 746 XYZB14 611 XYZB15 602 XYZB23 564 XYZB13 540 XYZB19 532 XYZB16 523 XYZB20 415 XYZB22 284 XYZB18 127 XYZB21 125 * Lock Master Node for the resource {This is a fairly hot file. Here the lock master node is optimal.}

Lock Activity Example 0000002020202032454C494653595302000000D3000C24534D52 RMS$.......SYSFILE2 ... RMS lock tree for file [12,211,0] on volume SYSFILE2 File specification: DISK$SYSFILE2:[SYSFILE2]SYSUAF.DAT;5 Total: 184 XYZB16 75 XYZB20 48 XYZB23 41 XYZB21 16 XYZB19 2 *XYZB15 1 XYZB13 1 XYZB14 0 XYZB12 0 {This reflects user logins, process creations, password changes, and such. Note the poor lock master node selection here (XYZB16 would be optimal).}

Example: Application (re)opens file frequently • Symptom: High lock rate on File Access Arbitration Lock for application data file • Cause: BASIC program re-executing OPEN command for a file; BASIC dutifully closes and then re-opens file • Fix: Modify BASIC program to execute OPEN statement only once at image startup time

Lock Activity Example 00000016202020202020202031505041612442313146 F11B$aAPP1 .... Files-11 File Access Arbitration lock for file [22,*,0] on volume APP1 File specification: DISK$APP1:[DATA]XDATA.IDX;1 Total: 50 *XYZB15 8 XYZB21 7 XYZB16 7 XYZB19 6 XYZB20 6 XYZB23 6 XYZB18 5 XYZB13 3 XYZB12 1 XYZB22 1 XYZB14 1 {This shows where the application is apparently opening (or re-opening) this particular file 50 times per second.}

Lock Mastership (Resource Mastership) concept • One lock master node is selected by VMS for a given resource tree at a given time • Different resource trees may have different lock master nodes

Lock Mastership (Resource Mastership) concept • Lock master remembers all locks on a given resource tree for the entire cluster • Each node holding locks also remembers the locks it is holding on resources, to allow recovery if lock master node dies

Lock Mastership • Lock mastership node may change for various reasons: • Lock master node goes down -- new master must be elected • VMS may move lock mastership to a “better” node for performance reasons • LOCKDIRWT imbalance found, or • Activity-based Dynamic Lock Remastering • Lock Master node no longer has interest

Lock Remastering • Circumstances under which remastering occurs, and does not: • LOCKDIRWT values • VMS tends to remaster to node with higher LOCKDIRWT values, never to node with lower LOCKDIRWT • Shifting initiated based on activity counters in root RSB • PE1 parameter being non-zero can prevent movement or place threshold on lock tree size • Shift if existing lock master loses interest

Lock Remastering • VMS rules for dynamic remastering decision based on activity levels: • assuming equal LOCKDIRWT values • 1) Must meet general threshold of 10 lock requests per second (LCK$GL_SYS_THRSH) • 2) New potential master node must have at least 10 more requests per second than current master (LCK$GL_ACT_THRSH)

Lock Remastering • VMS rules for dynamic remastering: • 3) Estimated cost to move (based on size of lock tree) must be less than estimated savings (based on lock rate) • except if new master meets criteria (2) for 3 consecutive 8-second intervals, cost is ignored • 4) No more than 5 remastering operations can be going on at once on a node (LCK$GL_RM_QUOTA)

Lock Remastering • VMS rules for dynamic remastering: • 5) If PE1 on the current master has a negative value, remastering trees off the node is disabled • 6) If PE1 has a positive, non-zero value on the current master, the tree must be smaller than PE1 in size or it will not be remastered

Lock Remastering • Implications of dynamic remastering rules: • LOCKDIRWT must be equal for lock activity levels to control choice of lock master node • PE1 can be used to control movement of lock trees OFF of a node, but not ONTO a node • RSB stores lock activity counts, so even high activity counts can be lost if the last lock is DEQueued on a given node and thus the RSB gets deallocated

Lock Remastering • Implications of dynamic remastering rules: • With two or more large CPUs of equal size running the same application, it is easy to get lock mastership thrashing: • 10 more lock requests per second is not much of a difference when you may be doing 100s or 1,000s of lock requests per second • Whichever new node becomes lock master may then see its own lock rate slow somewhat due to the remote lock request workload

Lock Remastering • Lock mastership thrashing results in user-visible delays • Lock operations on a tree are stalled during a remaster operation • Locks are basically sent one per message • Remastering large lock trees can take a long time • e.g. 10-50 seconds for 15K lock tree size • Changes in VMS in version 7.3 give 3x-9x performance improvement • by using 64 Kbyte block data transfers instead of 1 message per RSB or LKB

How to Detect Lock Mastership Thrashing • Detection of remastering activity • Check message counters in SDA: • SDA> EXAMINE PMS$GL_RM_RBLD_SENT • SDA> EXAMINE PMS$GL_RM_RBLD_RCVD • Counts which increase suddenly by a large amount indicate remastering of large tree(s) • SENT: Off of this node • RCVD: Onto this node • See example procedures WATCH_RBLD.COM and RBLD.COM • Change of mastership node • SDA> SHOW LOCK/SUMMARY in 7.2 and above • MONITOR RLOCK in 7.3

How to Prevent Lock Mastership Thrashing • Unbalanced node power • Unequal workloads • Unequal values of LOCKDIRWT • Non-zero values of PE1

How to find the lock master node for a given resource tree • 1) Take out a Null lock on the root resource using $ENQ • VMS does directory lookup and finds out master node • 2) Use $GETLKI to identify the current lock master node’s CSID and the lock count • If the local node is the lock master, and the lock count is 1 (i.e. only our NL lock), there’s no interest in the resource now

Finding the lock master node for a given resource • 3) $DEQ to release the lock • 4) Use $GETSYI to translate the CSID to an SCS Nodename • See example procedure FINDMASTER_FILE.COM and program FINDMASTER.MAR

Controlling Lock Mastership • Why wrest control of lock mastership from VMS? • Spread lock mastership workload more evenly across nodes to help avoid saturation of any single lock master node • Provide best performance for a specific job by guaranteeing local locking for its files

How to force lock mastership of a resource tree to a specific node • 3 ways to induce VMS to move a lock tree: 1) Generate a lot of I/Os • For example, run several copies of a program that rapidly accesses the file 2) Generate a lot of lock requests • without the associated I/O operations 3) Generate the effect of a lot of lock requests without actually doing them • by modifying VMS’ data structures

How to force lock mastership of a lock tree to a specific node • We’ll examine: • 1) Method using documented features, thus fully supported • 2) Method modifying VMS data structures, thus unsupported

Controlling Lock Mastership Using Supported Methods • To move a lock tree to a particular node (non-invasive method): • Assume PE1 non-zero on all nodes to start with • 1) Set PE1 to 0 on existing lock master node to allow dynamic lock remastering of tree off that node • 2) Set PE1 to negative value (or small positive value) on target node to prevent lock tree from moving off of it afterward

Controlling Lock Mastership Using Supported Methods • 3) On target node, take out a Null lock on root resource • 4) Take out a sub-lock of the parent Null lock, and then repeatedly convert it between Null and some other mode • Check periodically to see if tree has moved yet (using $GETLKI) • 5) Once tree has moved, free locks • 6) Set PE1 back to original value on former master node

Session OS-23: Monitoring and Controlling the VMS Lock Manager