Administer > Configure NNMi in a High Availability Cluster > Troubleshooting the HA Configuration

Troubleshooting the HA Configuration

This section includes the following topics:

Common High Availability Configuration Mistakes

Some common High Availability (HA) configuration mistakes are listed here:

  • Incorrect disk configuration

    • VCS: If a resource cannot be probed, the configuration is somehow wrong. If a disk cannot be probed, the disk might no longer be accessible by the operating system.
    • Test the disk configuration manually and confirm against HA documentation that the configuration is appropriate.
  • The disk is in use and cannot be started for the HA resource group.

    Always check that the disk is not activated before starting the HA resource group.

  • WSFC: Bad network configuration

    If network traffic is flowing across multiple NIC cards, RDP sessions fail when activating programs that consume a large amount of network bandwidth, such as the NNMi ovjboss process.

  • Some HA products do not automatically restart at boot time.

    Review the HA product documentation for information about how to configure automatic restart on boot up.

  • Adding NFS or other access to the OS directly (resource group configuration should be managing this).
  • Being in the shared disk mount point during a failover or offlining of the HA resource group.

    HA kills any processes that prevent the shared disk from being unmounted.

  • Reusing the HA cluster virtual IP address as the HA resource virtual IP address (works on one system and not the other)
  • Timeouts are too short. If the products are misbehaving, HA product might time out the HA resource and cause a failover.

    WSFC: In Failover Cluster Management, check the value of the Time to wait for resource to start setting. NNMi sets this value to 15 minutes. You can increase the value.

  • Not using maintenance mode

    Maintenance mode was created for debugging HA failures. If you attempt to bring a resource group online on a system, and it fails over shortly afterwards, use the maintenance mode to keep the resource group online to see what is failing.

  • Not reviewing cluster logs (cluster logs can show many common mistakes).

Configuration Issues with RHCS 6

It is possible for the /etc/cluster/cluster.conf file versions to differ between the two systems in an HA environment if the ricci service is down or has been intentionally disabled. Therefore, monitor the cluster.conf file regularly to ensure that the file versions are synchronized.

If the cluster.conf file versions are not sychronized, you may experience problems when you attempt to do any of the following:

  • apply changes to cluster.conf
  • unconfigure a resource group
  • start the cluster
  • use the clustat command

HA Resource Testing

This section describes the general approach for testing the resources that you will place into the NNMi HA resource group. This testing identifies hardware configuration problems. It is recommended to perform this testing before configuring NNMi to run under High Availability (HA). Note the configuration values that generate positive results, and use these value when performing the complete configuration of the NNMi HA resource group.

For specific details regarding any of the commands listed here, see the most recent documentation for your HA product.

To test HA resources, follow these steps:

  1. If necessary, start the HA cluster.
  2. (Windows only) Verify that the following virtual IP addresses have been defined for the HA cluster:

    • A virtual IP address for the HA cluster
    • A virtual IP address for each HA resource group

    Each of these IP addresses should not be used elsewhere.

  3. Add an HA resource group to the HA cluster.

    Use a non-production name, such as test, for this HA resource group.

  4. Test the connection to the HA resource group:

    1. Add the virtual IP address and corresponding virtual hostname for the resource group as a resource to the HA resource group.

      Use the values that you will later associate with the NNMi HA resource group.

    2. Fail over from the active cluster node to the passive cluster node to verify that the HA cluster correctly fails over.
    3. Fail over from the new active cluster node to the new passive cluster node to verify failback.
    4. If the resource group does not fail over correctly, log on to the active node, and then verify that the IP address is properly configured and accessible. Also verify that no firewall blocks the IP address.v
  5. Configure the shared disk as described in Configuring a SAN or a Physically Connected Disk.
  6. Test the connection to the shared disk:

    1. Add the shared disk as a resource to the HA resource group as described in Moving the Shared Disk into the NNMiHA Resource Group.
    2. Fail over from the active cluster node to the passive cluster node to verify that the HA cluster correctly fails over.
    3. Fail over from the new active cluster node to the new passive cluster node to verify failback.
    4. If the resource group does not fail over correctly, log on to the active node, and then verify that the disk is mounted and available.
  7. Keep a record of the commands and inputs that you used to configure the shared disk. You might need this information when configuring the NNMi HA resource group.
  8. Remove the resource group from each node:

    1. Remove the IP address entry.
    2. Offline the resource group, and then remove resource group from the node.

    At this point, you can use the NNMi-provided tools to configure NNMi to run under HA.

Re-Enable NNMi for High Availability after All Cluster Nodes are Unconfigured

When all NNMi High Availability (HA) cluster nodes have been unconfigured, the ov.conf file no longer contains any mount point references to the NNMi shared disk.

To re-create the mount point reference without overwriting the data on the shared disk, follow these steps on the primary node:

  1. If NNMi is running, stop it:

    ovstop -c

  2. Reset the reference to the shared disk:

    • Windows:

      %NnmInstallDir%\misc\nnm\ha\nnmhadisk.ovpl NNM -setmount <HA_mount_point>
    • Linux:

      $NnmInstallDir/misc/nnm/ha/nnmhadisk.ovpl NNM -setmount <HA_mount_point>
  3. In the ov.conf file, verify the entries related to HA mount points.

NNMi Does Not Start Correctly Under High Availability

When NNMi does not start correctly, it is necessary to debug whether the issue is a hardware issue with the virtual IP address or the disk, or whether the issue is some form of application failure. During this debug process, put the system in maintenance mode without the NORESTART keyword.

  1. On the active node in the HA cluster, disable HA resource group monitoring by creating the following maintenance file:

    • Windows: %NnmDataDir%\hacluster\<resource_group>\maintenance
    • Linux: $NnmDataDir/hacluster/<resource_group>/maintenance
  2. Start NNMi:

    ovstart

  3. Verify that NNMi started correctly:

    ovstatus -c

    All NNMi services should show the state RUNNING. If this is not the case, troubleshoot the process that does not start correctly.

  4. After completing your troubleshooting, delete the maintenance file:

    • Windows: %NnmDataDir%\hacluster\<resource_group>\maintenance
    • Linux: $NnmDataDir/hacluster/<resource_group>/maintenance

Changes to NNMi Data are Not Seen after Failover

The NNMi configuration points to a different system than where NNMi is running. To fix the problem, verify that the ov.conf file has appropriate entries for the following items:

  • NNM_INTERFACE=<virtual_hostname>
  • HA_RESOURCE_GROUP=<resource_group>
  • HA_MOUNT_POINT=<HA_mount_point>
  • NNM_HA_CONFIGURED=YES
  • HA_POSTGRES_DIR=<HA_mount_point>/NNM/dataDir/shared/nnm/databases/Postgres
  • HA_EVENTDB_DIR=<HA_mount_point>/NNM/dataDir/shared/nnm/eventdb
  • HA_CUSTOMPOLLER_DIR=<HA_mount_point>/NNM/dataDir/shared/nnm/databases/custompoller
  • HA_NNM_LOG_DIR=<HA_mount_point>/NNM/dataDir/log
  • HA_JBOSS_DATA_DIR=<HA_mount_point>/NNM/dataDir/nmsas/NNM/data
  • HA_LOCALE=C

nmsdbmgr Does Not Start after High Availability Configuration

This situation usually occurs as a result of starting NNMi after running the nnmhaconfigure.ovpl command but without the nnmhadisk.ovpl command with the -to option having been run. In this case, the HA_POSTGRES_DIR entry in the ov.conf file specifies the location of the embedded database on the shared disk, but this location is not available to NNMi.

To fix this problem, follow these steps:

  1. On the active node in the High Availability (HA) cluster, disable HA resource group monitoring by creating the following maintenance file:

    • Windows: %NnmDataDir%\hacluster\<resource_group>\maintenance
    • Linux: $NnmDataDir/hacluster/<resource_group>/maintenance
  2. Copy the NNMi database to the shared disk:

    • Windows:

      %NnmInstallDir%\misc\nnm\ha\nnmhadisk.ovpl NNM 
      -to
      <HA_mount_point>
    • Linux:

      $NnmInstallDir/misc/nnm/ha/nnmhadisk.ovpl NNM 
      -to
      <HA_mount_point>

    Caution To prevent database corruption, run this command (with the -to option) only one time. For information about alternatives, see Re-Enable NNMi for High Availability after All Cluster Nodes are Unconfigured.

    • Windows:

      %NnmInstallDir%\misc\nnm\ha\nnmhastartrg.ovpl NNM <resource_group>
    • Linux:

      $NnmInstallDir/misc/nnm/ha/nnmhastartrg.ovpl NNM <resource_group>
  3. Start NNMi:

    ovstart

  4. Verify that NNMi started correctly:

    ovstatus -c

    All NNMi services should show the state RUNNING.

  5. After completing your troubleshooting, delete the maintenance file:

    • Windows: %NnmDataDir%\hacluster\<resource_group>\maintenance
    • Linux: $NnmDataDir/hacluster/<resource_group>/maintenance

NNMi Runs Correctly on Only One High Availability Cluster Node (Windows)

The Windows operating system requires two different virtual IP addresses, one for the High Availability (HA) cluster and one for the HA resource group.

If the virtual IP address of the HA cluster is the same as that of the NNMi HA resource group, NNMi only runs correctly on the node associated with the HA cluster IP address.

To correct this problem, change the virtual IP address of the HA cluster to a unique value for the network.

Disk Failover Does Not Occur

This situation can happen when the operating system does not support the shared disk. Review the HA product, operating system, and disk manufacturer documentation to determine whether these products can all work together.

If disk failure occurs, NNMi does not start on failover. Most likely, nmsdbmgr fails because the HA_POSTGRES_DIR directory does not exist. Verify that the shared disk is mounted and that the appropriate files are accessible.

Shared Disk is Not Accessible (Windows)

The command nnmhaclusterinfo.ovpl -config NNM -get HA_MOUNT_POINT returns nothing.

The drive of the shared disk mount point must be fully specified (for example, S:\) during HA configuration.

To correct this problem, run the nnmhaconfigure.ovpl command an each node in the HA cluster. Fully specify the drive of the shared disk mount point.

Shared Disk Does Not Contain Current Data

Responding to the nnmhaconfigure.ovpl command question about disk type with the text none bypasses the code for setting the disk-related variables in the ov.conf file. To fix this situation, follow the procedure in Prepare the Shared Disk Manually in High Availability Environments.

Shared Disk Files Are Not Found by the Secondary Node after Failover

The most common cause of this situation is that the nnmhadisk.ovpl command was run with the -to option when the shared disk was not mounted. In this case, the data files are copied to the local disk, so the files are not available on the shared disk.

To fix this problem, follow these steps:

  1. On the active node in the High Availability (HA) cluster, disable HA resource group monitoring by creating the following maintenance file:

    • Windows: %NnmDataDir%\hacluster\<resource_group>\maintenance
    • Linux: $NnmDataDir/hacluster/<resource_group>/maintenance
  2. Log on to the active node, and then verify that the disk is mounted and available.
  3. Stop NNMi:

    ovstop

  4. Copy the NNMi database to the shared disk:

    • Windows:

      %NnmInstallDir%\misc\nnm\ha\nnmhadisk.ovpl NNM -to <HA_mount_point>
    • Linux:

      $NnmInstallDir/misc/nnm/ha/nnmhadisk.ovpl NNM -to <HA_mount_point>

    Caution To prevent database corruption, run this command (with the -to option) only one time. For information about alternatives, see Re-Enable NNMi for High Availability after All Cluster Nodes are Unconfigured.

  5. Start the NNMi HA resource group:

    • Windows:

      %NnmInstallDir%\misc\nnm\ha\nnmhastartrg.ovpl NNM <resource_group>
    • Linux:

      $NnmInstallDir/misc/nnm/ha/nnmhastartrg.ovpl NNM <resource_group>
  6. Start NNMi:

    ovstart

  7. Verify that NNMi started correctly:

    ovstatus -c

    All NNMi services should show the state RUNNING.

  8. After completing your troubleshooting, delete the maintenance file:

    • Windows: %NnmDataDir%\hacluster\<resource_group>\maintenance
    • Linux: $NnmDataDir/hacluster/<resource_group>/maintenance

Error: Wrong Number of Arguments

The name of the product Perl module is a required parameter to most of the NNMi High Availability (HA) configuration commands.

  • For NNMi, use the value NNM.
  • To determine what value to use for an NNM iSPI, see the documentation for that NNM iSPI.

Resource Hosting Subsystem Process Stops Unexpectedly (Windows Server)

Starting an High Availability (HA) cluster resource on a computer running the Windows Server operating system stops the Resource Hosting Subsystem (Rhs.exe) process unexpectedly.

For information about this known problem, see the Microsoft Support web site article The Resource Hosting Subsystem (Rhs.exe) process stops unexpectedly when you start a cluster resource in Windows Server , which is available from http://support.microsoft.com/kb/978527.

Tip Always run the NNMi resource in a separate resource monitor (rhs.exe) specific to the resource group.

Product Startup Times Out (Windows WSCS 2008)

After upgrading to NNMi 10.30, if the app resource (<resource>-app) in the Failover Cluster Manager changes from "Pending" to "Failed", there might be a timeout issue. If this situation occurs, do the following:

  1. Use the cluster log /gen command to generate the cluster.log file.
  2. Open the log located in the following directory:

    C:\Windows\cluster\reports\cluster.log
  3. If you see an error in the cluster.log file similar to the following, you have a DeadlockTimeout issue:

    ERR [RHS] Resource <resource-name>-APP handling deadlock. Cleaning current operation.

    The DeadlockTimeout is the total time for failover when the agent might be blocked. The PendingTimeout represents either the online or offline operation. The DeadlockTimeout default value is 45 minutes (2,700,000 milliseconds), and the PendingTimeout default value is 30 minutes (1,800,000 milliseconds).

    You can change the DeadlockTimeout and the PendingTimeout values. For example, to set a DeadlockTimeout of 75 minutes and a PendingTimeout of 60 minutes, you can run the following commands:

    cluster res "<resource group>-APP" /prop DeadlockTimeout=4500000
    cluster res "<resource group>-APP" /prop PendingTimeout=3600000

    See your High Availability vendor documentation for more information

Log Files on the Active Cluster Node Are Not Updating

This situation is normal. It occurs because the log files have been redirected to the shared disk.

For NNMi, review the log files in the location specified by HA_NNM_LOG_DIR in the ov.conf file.

Cannot Start the NNMi HA Resource Group on a Particular Cluster Node

If the nnmhastartrg.ovpl or nnmhastartrg.ovpl command does not correctly start, stop, or switch the NNMi HA resource group, review the following information:

  • MSFC:

    • In Failover Cluster Management, review the state of the NNMi HA resource group and underlying resources.
    • Review the Event Viewer log for any errors.
  • VCS:

    • Run /opt/VRTSvcs/bin/hares -state to review the resource state.
    • For failed resources, review the /var/VRTSvcs/log/<resource>.log file for the resource that is failing. Resources are referenced by the agent type, for example: IP*.log, Mount*.log, and Volume*.log.

If you cannot locate the source of the problem, you can manually start the NNMi HA resource group by using the HA product commands:

  1. Mount the shared disk.
  2. Assign the virtual host to the network interface:

    • MSF:

      • Start Failover Cluster Management.
      • Expand the resource group.
      • Right-click <resource_group>-ip, and then click Bring Online.
    • VCS: /opt/VRTSvcs/bin/hares -online <resource_group>-ip
      -sys <local_hostname>
    • RHCS: Run /usr/sbin/cmmodnet to add the IP address.
  3. Start the NNMi HA resource group. For example:

    • Windows:

      %NnmInstallDir%\misc\nnm\ha\nnmhastartrg.ovpl NNM 
      -start
      <resource_group>
    • Linux:

      $NnmInstallDir/misc/nnm/ha/nnmhastartrg.ovpl NNM 
      -start
      <resource_group>

The return code 0 indicates that NNMi started successfully.

The return code 1 indicates that NNMi did not start correctly.