Lock management

The locking mechanism in versions of Service Manager (SM) prior to version 9.31 had a locking mechanism that used a multicasting to request and obtain a lock on a resource in Service Manager. This locking mechanism was implemented using a Peer Lock in the JGroups toolkit. However, there were several issues with this implementation, which are addressed with the new locking mechanism introduced in this document.

Issues in Service Manager Multicast Communication

Service Manager’s previous multicast implementation suffers from the following limitations:

• All nodes must be contacted by the node that is requesting the resource. This leads to high overhead per request.

• All nodes must give approval to a request, regardless of whether a node is using a resource or not.

• A node that does not respond represents a single point of failure for the system, even if that node has no other connection to the resource or the requesting node.

• If one node does not respond, the request must be re-issued by the originating node, which increases overhead even further.

• Nodes that are slow to respond will eventually be removed from the node cluster.

• Scalability is very poor.

• The potential for netstorms is high (wherein every node is attempting to request permission from every other node, leading to n^2 requests).

New Locking Mechanism Overview

The new locking mechanism consists of a record entry for each locked resource in a database table. The new Lock table (for exclusive locks) and LockShared table (for shared locks) have been created to house these records. See the following tables for details on the structure and fields of the Lock and LockShared tables.

Note: The only difference between these two tables is the value in the TYPE field, and the primary key for the LockShared table is a combination of the LOCKID, pID, tID, and IP fields.

The Lock Table
Field Type Null Key Default Extra
LOCKID varchar(600) N PK   LockID is a hex value of ResourceName.  This ensures that the ResourceName field can be either case-sensitive or case-insensitive.
RESOURCENAME varchar(200) N     Logic lock ID
TYPE char(1)       Exclusive / Shared lock
PID float       The Process ID that holds the lock.
TID float       The Thread ID that holds the lock.
RADTHREADID float       The Rad Thread ID that holds the lock.
SESSIONID float       The session ID that holds the lock.
REASON VARCHAR2(60)       Application private
USER VARCHAR2(60)       SM login user (For example, "falcon")
HOSTNAME VARCHAR2(60)       The host that holds the lock.
IP VARCHAR2(60)       The IP address of the host that holds the lock.
DEVICENAME VARCHAR2(60)       The device that holds the lock.
For the bg scheduler thread, the devicename is "SYSTEM"          
LOCKAT datetime       When the lock is obtained.
STARTAT datetime       When the lock was requested.
RETRYCOUNT float       Number of times the lock was requested.
HEARTBEAT float       Updated periodically by the lock holder.
SUSPECTED float       Indicates another node suspects the lock holder has failed.
SYSRESTRICTED CHAR(1)       Indicates no modification allowed for end-users
SYSMODCOUNT float        
SYSMODUSER VARCHAR(60)        
SYSMODTIME datetime        
The LockShared Table
Field Type Null Key Default Extra
LOCKID varchar(600) N PK   LockID is a hex value of ResourceName.  This ensure that the ResourceName field can be either case-sensitive or case-insensitive.
RESOURCENAME varchar(200) N     Logic lock ID
TYPE char(1) N PK   Shared lock
PID float   PK   The Process ID that holds the lock.
TID float   PK   The Thread ID that holds the lock.
RADTHREADID float       The Rad Thread ID that holds the lock.
SESSIONID float       The session ID that holds the lock.
REASON VARCHAR2(60)       Application private
USER VARCHAR2(60)       SM login user (For example, "falcon")
HOSTNAME VARCHAR2(60)       The host that holds the lock.
IP VARCHAR2(60)       The IP address of the host that holds the lock.
DEVICENAME VARCHAR2(60)       The device that holds the lock.
For the bg scheduler thread, the devicename is "SYSTEM"          
LOCKAT datetime       When the lock is obtained.
STARTAT datetime       When the lock was requested.
RETRYCOUNT float       Number of times the lock was requested.
HEARTBEAT float       Updated periodically by the lock holder.
SUSPECTED float       Indicates another node suspects the lock holder has failed.
SYSRESTRICTED CHAR(1)       Indicates no modification allowed for end-users
SYSMODCOUNT float        
SYSMODUSER VARCHAR(60)        
SYSMODTIME datetime        

Lock Behavior

Locks may be either shared locks or exclusive locks. A shared lock allows multiple nodes to read the data from a resource. An exclusive lock may only be obtained if any shared locks on a resource have been released by the nodes that hold them.

Exclusive Lock

The process by which a node obtains an exclusive lock is as follows:

  1. A node that requests an exclusive lock tries to insert a record into the Lock database table to see whether a resource is available.
  2. If there is no record for this resource in the Lock table, the resource is available and the node holder obtains the lock and inserts a record into the Lock table.
  3. If there is a record in the Lock table, the node will fail to obtain a lock.

To unlock an exclusive lock, the corresponding lock record is removed from the Lock table.

Shared Lock

The process by which a node obtains a shared lock is as follows:

  1. A node that requests a shared lock tries to insert a record into the Lock database table to see whether a resource is available.
  2. If there is no record for this resource in the Lock table, the node holder inserts a record into the Lock table and proceeds to step 4.
  3. If there is a record in the Lock table, the TYPE field of that record is checked. If the TYPE field is “Exclusive” the lock requested is rejected. If the TYPE field is “Shared,” proceed to step 4.
  4. The node tries to insert a record into the LockShared table. If the insertion is successful, the shared lock request is granted. If the insertion fails, the shared lock request is rejected.

To unlock a shared lock, the corresponding lock record is removed from the LockShared table. Then, the corresponding lock record is removed from the Lock table unless other records share the same resource ID.

Note: There is no mechanism to escalate a Shared lock to an Exclusive lock. To obtain an exclusive lock on a resource, all shared locks must be released.

Lock Retry, Timeout, and Heartbeat

Each process that requires a lock has a dedicated LockHandler thread to handle all lock related operations. When a process needs to execute a lock/unlock operation, the process places a request in queue. The LockHandler reads the queue and attempts to insert the appropriate records in to the Lock or LockShared tables. Additionally, the LockHandler will also return the response to the process that has requested the lock.

The LockHandler will attempt to retry a “wait” lock request every two seconds until the lock attempt succeeds. If a “no-wait” lock request is specified, the LockHandler will reattempt lock acquisition immediately.

  • It is possible that a node may obtain a lock, and then fail to release the lock for several reasons. For example, the node could fail, or a problem with the network may prevent communication between the nodes. To prevent other nodes from waiting for an unresponsive lock owner, the following heartbeat mechanism has been implemented:
  • When a process requests a lock and finds the lock is already held and the heartbeat value has not been updated, it sets the “Suspected” flag of the lock record in the database table to 1.
  • The lock owner updates the heartbeat and checks the “Suspected” flag of the lock record in the database table every 60 seconds.
  • If the “Suspected” flag is set to 1, the lock owner sets the “Suspected” flag back to 0 and then increases the heartbeat.
  • If the lock owner has failed to update the “Suspected” flag and heartbeat, the process that is waiting will periodically recheck the “Suspected” flag. After 10 minutes (the default value specified in the deadnodelocktimeout parameter), the lock is forcibly removed by deleting the lock record from the database table.

New Parameter

This new locking mechanism implements the deadnodelocktimeout parameter. This parameter specifies the amount of time that must elapse before a process forcibly removes a lock from the Lock or LockShared table. By default, this parameter is set to 10, which indicates that 10 minutes must elapse before a record is forcibly removed. 10 minutes is also the minimum value for this parameter.

This parameter is specified in the sm.ini file. Changing the value of this parameter does not require a restart of the server.