Searching the Help
To search for information in the Help, type a word or phrase in the Search box. When you enter a group of words, OR is inferred. You can use Boolean operators to refine your search.
Results returned are case insensitive. However, results ranking takes case into account and assigns higher scores to case matches. Therefore, a search for "cats" followed by a search for "Cats" would return the same number of Help topics, but the order in which the topics are listed would be different.
Search for | Example | Results |
---|---|---|
A single word | cat
|
Topics that contain the word "cat". You will also find its grammatical variations, such as "cats". |
A phrase. You can specify that the search results contain a specific phrase. |
"cat food" (quotation marks) |
Topics that contain the literal phrase "cat food" and all its grammatical variations. Without the quotation marks, the query is equivalent to specifying an OR operator, which finds topics with one of the individual words instead of the phrase. |
Search for | Operator | Example |
---|---|---|
Two or more words in the same topic |
|
|
Either word in a topic |
|
|
Topics that do not contain a specific word or phrase |
|
|
Topics that contain one string and do not contain another | ^ (caret) |
cat ^ mouse
|
A combination of search types | ( ) parentheses |
|
Configure health checks
An OMi deployment typically consists of the OMi servers, one or more connected servers, and the nodes monitored by Operations Agents. The agents enable you to collect event data, discover topology, and run actions. When you monitor nodes using agents, it is important to check that the agent is running correctly, and that the server and agent can communicate with each other.
Health checks are especially important in agent installations that facilitate the integration of data from other management systems (for example, OpsCx, SiteScope, or ArcSight Logger). If the agent on such a system fails, the event flow from the integration also stops.
By default, health checking (of the type Agent & Server) is enabled for all agents monitored by OMi and, also by default, agents send heartbeat events only to their primary manager.
Learn More
Agents send heartbeat events to the server at regular intervals. If the server does not receive an event from an agent within the configured interval, for example because the agent is not running, it will wait for a while before creating an event to indicate problems with the agent's health. The grace period ensures that any temporary delays (for example caused by the network latency or the agent buffering events) do not prematurely generate an agent problem event.
The agent problem event enters the event pipeline and is processed like all other events. The health indicator of the event is set to "down", which influences the health status of the related agent CI. When the server receives events again from the agent, it creates an agent up event, correlates the agent up event with the previously generated agent problem event, and closes the agent problem event automatically. The health indicator of the agent up event changes the status of the related agent CI back to "up".
If the server does not receive an event from an agent within the configured interval, the server can generate an agent problem event or, if Agent & Server checking is configured, can actively check the status of the agent before generating an agent problem event.
The server attempts to make two checks using HTTP connections to the agent. The first check opens a socket connection to verify that the control daemon (ovcd
) and communication broker (ovbbccb
) are running.
The second check attempts to get the following additional information from the message subagent (opcmsga
):
- Message agent status
- Certificate status
- Control daemon status
- Whether the agent is buffering events for this server
- Whether the agent is buffering events for other servers
After the server has received the information from the agent, it generates an agent problem event and adds the agent output to the event. The agent problem event is always generated, even if the server check reports a running agent. The status of the agent changes to up only when the server receives events again from the agent.
The purpose of agent heartbeat events is to indicate to the server that the agent can send events and that communication with the server is possible. The arrival of events on the server is enough information for the server to assume an agent is healthy. The heartbeat events themselves are therefore designed to be as small as possible in order to reduce the load on the network. Unlike other agent events, heartbeat events are internal events that do not enter the event pipeline. They are therefore not subjected to any event pipeline processing activities (for example, event storm suppression).
The gateway server then passes the events on to the data processing server where they are evaluated. If health checking is disabled on the server (either globally or individually for single agents), the gateway server immediately discards all heartbeat events without further processing. The data processing server then discards the heartbeat events after storing an in-memory record of each received heartbeat event.
Operations Agents 12.00 and later send heartbeat events depending on the value of the configuration variable OPC_HB_MSG_INTERVAL
. If this variable is set to any number of seconds, the agent sends heartbeat events at the interval defined by the number of seconds. If the variable is not set or set to 0, no heartbeat events are sent.
Operations Agents earlier than 12.00 require a scheduled task policy template to generate heartbeat events. The policy template is included in the OOTB Contents for OMi content pack and is automatically assigned to monitored nodes and OpsCx, SiteScope, or ArcSight connected servers with an agent version earlier than 12.00.
The server starts expecting heartbeat events from an agent when the agent meets the prerequisites (listed in Prerequisites) and the agent's heartbeat configuration is set to Default
(meaning that the defaults from the infrastructure settings are applied) or to Custom
(manually changed by a user).
For Operations Agents 12.00 and later, the server then sets or changes the OPC_HB_MSG_INTERVAL
variable on the agent to the specified number of seconds and the agent starts sending heartbeat events to the server. Agents earlier than 12.00 send heartbeat events after the scheduled task policy template has been successfully deployed.
The server does not check the health of agents that do not meet the health check prerequisites or of agents that have health checking disabled (Off
).
New agents are eligible for health checking as soon as they meet the prerequisites. Whether the server actually starts health checking for these agents depends on the agent health check configuration (Default
, Custom
, or Off
).
By default, agents send heartbeat events only to their primary event receiver. When the primary event receiver changes, the agent begins to send heartbeat events to the new primary event receiver. The new event receiver evaluates the heartbeat events only after the agent meets the prerequisites listed in Prerequisites.
The old event receiver continues to expect heartbeat events from the agent and will generate an agent problem event. You must therefore disable health checking on the old server, either for the switched agents only or globally.
In mixed environments with agents connected to Operations Manager (OM) servers and the OM servers forwarding events to OMi, health checking is performed by the OM servers using the built-in OM health check functionality.
See also How to switch the primary event receiver of an Agent managed by OMi and Health checking problems in manager-of-manager environments.
Tasks
OMi can only check the health of Operations Agents that meet the following prerequisites:
-
The Operations Agent must be installed on the monitored node.
-
The monitored node must be presented as a node CI in the RTSM.
-
The Operations Agent must be presented as an
Operations-agent
CI in the RTSM. -
The Operations Agent CI must be related to a CI representing the OMi server. This ensures that the OMi server checks the health of only those agents that it actually monitors (that is, for which it is the primary manager).
The server automatically creates the required CIs and CI relations in the RTSM when an agent is connected to OMi. The CIs and CI relations are created after the certificate requests have been granted, at the latest after 24 hours, or after the agent has been restarted (reboot of the agent system or ovc ‑restart). It may therefore take up to 24 hours for the CIs and CI relations to be created in the RTSM.
Alternatively, create the agent to server CI relation manually using the Create Managed By OMi Relationship icon in the monitored nodes toolbar.
The following task configures health checking for one or more systems with an Operations Agent installed.
-
Open the properties of the monitored nodes that you want to configure:
-
Navigate to Monitored Nodes:
Administration > Setup and Maintenance > Monitored Nodes
Alternatively, click Monitored Nodes.
-
Make sure a node filter (from any of the filter categories) is selected in the Node Views browser. Alternatively, select a node group.
-
Select the node and click Edit. The Monitored Node Properties dialog box opens. Click the Health Check tab.
To configure multiple nodes at once, hold down the Ctrl or Shift key while selecting them. Then click Edit. The Edit Health Check Configuration dialog box opens.
-
-
The values for Health Check Configuration indicate the current health check configuration of the agent:
-
Off disables health checking for the system. The agent continues to send heartbeat events but the gateway server discards them. No further processing takes place.
-
Default means that the default settings from the infrastructure settings are used.
-
Custom enables you to override the default settings.
-
-
In Health Check Type, configure the type of health check you want to perform:
-
Agent Only configures the agent to send heartbeat events at a regular interval.
-
Agent & Server configures the server to actively check the health of the agent if a heartbeat event does not arrive within the configured timeout.
-
-
Set the Agent Heartbeat Interval. This is the interval at which the agent sends heartbeat events to the server.
The server creates a deployment job each time the Agent Heartbeat Interval is changed. For more information about deployment jobs, see Deployment Jobs.
-
Set the Agent Heartbeat Grace Period. This is the period that the server allows before generating an agent problem event or, if Agent & Server checking is enabled before contacting the agent.
The server expects to receive a heartbeat event from the agent within the time period defined by the Agent Heartbeat Interval plus the Heartbeat Grace Period.
The following task configures health checking globally for all Operations Agents that are managed by OMi.
When a new monitored node is added, the agent inherits the default health check settings from the infrastructure settings.
-
Open the Infrastructure Settings Manager:
Administration > Setup and Maintenance > Infrastructure Settings
Alternatively, click Infrastructure Settings.
-
Select the context Monitoring Automation and scroll to the Monitoring Automation - Health Check Settings table.
-
Make sure health checking is enabled on the server. Set the value of Enable Health Check to true.
-
Optional. Modify the Default Agent Heartbeat Interval. This is the interval at which the agent sends heartbeat events to the server.
Each time you change the agent heartbeat interval, a deployment job is created for all nodes that are configured to use the default settings from the infrastructure settings. For more information about deployment jobs, see Deployment Jobs.
-
Optional. Modify the Default Grace Period. This is the period that the server allows before generating an agent problem event or, if Agent & Server checking is enabled, before contacting the agent.
-
Optional. Configure the Default Health Check Type. Agent Only configures the agent to send heartbeat events at a regular interval. When the server detects that a heartbeat event is missing, it waits for the grace period to pass before generating an event. In addition to configuring the agent to send heartbeat events, Agent & Server configures the server to contact the agent when heartbeat events are not arriving and the grace period has passed.
-
Optional. Change Enable Health Check on New Agents. By default, health checking is enabled for new agents. To disable health checking for new agents, change this setting to false.
-
Optional. Configure additional advanced server settings such as the logging of agent up events and the severity of the event that the server generates when it detects a problem with the agent health.
Troubleshooting
When you configure the agent heartbeat interval in the infrastructure settings or in the node properties, the configuration variable OPC_HB_MSG_INTERVAL is set on the agent system.
You can verify that the agent heartbeat interval is set correctly on the agent system by running an opr-agt ‑get_config_var
query on the server (see also
Example:
opr-agt ‑username myU ‑password myPwd ‑node_list "node1.example.com" ‑get_config_var eaagt:OPC_HB_MSG_INTERVAL
Note The heartbeat interval on the agent is configured in seconds.
The agent sends heartbeat events at the interval configured for OPC_HB_MSG_INTERVAL. The server, however, expects to receive heartbeat events at the interval configured for the Default Agent Heartbeat Interval setting in the infrastructure settings or node properties. If the interval set on the agent is longer than the interval set on the server, the server expects to receive more heartbeat events than the agent actually sends, and generates an agent problem event or initiates a health check although the node is up.
The next time you modify the health check interval in the infrastructure settings or node properties, OPC_HB_MSG_INTERVAL on the agent is overwritten with the settings from the server. It is therefore recommended that you do not manually change the OPC_HB_MSG_INTERVAL setting directly on the agent system. Instead, use OMi to configure the Agent Heartbeat Interval.
For more information about the Operations Agent configuration variables, see the Operations Agent Help.
HPE recommends that you do not modify the out-of-the-box scheduled task policy template that configures Operations Agents with versions earlier than 12.00 to send heartbeat events at a regular interval. You can modify and save the policy template, and thereby create a new policy template version. However, OMi by default always auto-assigns the out-of-the-box policy template.
It is also not recommended to tune the interval parameter of the scheduled task policy. Tuning the interval parameter only updates the heartbeat interval on the agent, not the interval at which the server expects to receive heartbeat events. Also, changing the agent heartbeat interval in the monitored nodes properties overwrites any tuned interval parameters. To change the agent heartbeat interval for a single agent, change the properties of the monitored node.
OMi must know the version of the Operations Agent in order to deploy the corresponding health check configuration to the agent (OPC_HB_MSG_INTERVAL setting or scheduled task policy template). The server uses the following methods to obtain the agent version:
-
Tries to retrieve the agent version from the Operations Agent CI in the RTSM.
-
Connects to the agent and checks the value of OPC_INSTALLED_VERSION.
-
If the above checks fail or do not provide the requested information, the server assumes the agent version to be earlier than 12.00 and deploys the scheduled task policy template.
If the monitored node properties do not enable you to configure health checking, check that health checking is globally enabled in the infrastructure settings and that the node meets the prerequisites for health checking listed in Prerequisites.
Tip Make sure that the agent is managed by this OMi server and then create the agent to server CI relation manually using the Create Managed By OMi Relationship icon in the monitored nodes toolbar.
By default, agents send heartbeat events only to their primary manager and only the primary manager checks the health of an agent; that is, only the primary manager expects to receive the heartbeat events from an agent. In Manager-of-Manager (MoM) environments, the agent evaluates the rules defined in the flexible management policy to determine the target manager of the heartbeat event.
Health check problems can occur, for example when the agent does not send events because of a misconfiguration, when the agent sends the events to the wrong server, or when a server does not notice a failed agent or agent connection. Use the following troubleshooting tips to determine the cause of the problem and to identify the solution:
-
Symptom 1: The server generates an agent problem event even though the agent is running and the connection works.
-
Problem 1: The agent is not sending any heartbeat events because the OPC_HB_MSG_INTERVAL variable is not set or set to the wrong interval.
Solution 1: Verify that the agent heartbeat interval is set correctly on the agent system. For details, see OPC_HB_MSG_INTERVAL configuration variable for Agents 12.00 and later.
Problem 2: The agent is not sending any heartbeat events because the scheduled task policy template is not deployed or the wrong heartbeat interval is set.
Solution 2: Verify that the policy template is deployed and that the agent heartbeat interval is set correctly on the agent system.
-
Problem 3: The agent sends heartbeat events to another server.
Solution 3: Check the flexible management policy on the agent. If the server does not receive the heartbeat events because it is not the primary manager, add event target rules to the policy that send the heartbeat events to the server expecting the events.
Heartbeat events have the following event attributes:
Event Attribute Value Title Heart Beat
Event Key 8c72e1fa-b1f1-4def-8c7e-71ecee643351
See also Event Target Rules.
-
Problem 4: The variable OPC_BACKUP_MGRS does not include the server that expects the heartbeat events. This variable is only available with Operations Agents 11.11 and later.
Solution 4: Include the server that performs the health checking in the variable OPC_BACKUP_MGRS.
For more information about the Operations Agent configuration variables, see the Operations Agent Help.
-
-
Symptom 2: The server does not send agent problem events although the agent is down or the connection is broken.
-
Problem 1: The server does not expect to receive heartbeat events from the agent because the agent does not meet the health check prerequisites.
Solution 1: Make sure the agent meets the prerequisites described in Prerequisites.
-
Problem 2: Health checking may be disabled globally on the server or individually for the agent.
Solution 2: Check the Monitoring Automation infrastructure setting Enable Health Check and make sure that health check is enabled. Check the properties of the node that is not sending heartbeat events and make sure health check is enabled. See also How to configure health checks for individual agents.
-
-
Symptom 3: Multiple servers are configured to perform health checking but it works for only one.
-
Problem 1: The agents send the heartbeat events only to the primary manager or to the target manager defined in the event target rules of the flexible management policy.
Solution 1: Add all health checking servers to the agent configuration variable OPC_BACKUP_MGRS and set the OPC_BACKUP_MGRS_FAILOVER_ONLY variable to false. This ensures that all servers defined as backup managers always receive all events, including the heartbeat events. The variables are only available with Operations Agents 11.11 and later.
For more information about the Operations Agent configuration variables, see the Operations Agent Help.
-
Problem 2: The monitored node does not meet the prerequisites for health checking. In particular, the relation between the agent and the server CI may not be set on all servers.
Solution 2: Make sure that RTSM synchronization does not delete the relation between the agent and the server CI. See also Prerequisites.
-
OMi by default logs the processing of heartbeat events on the gateway and data processing servers. The log information is placed in the following log files:
-
Gateway server:
<OMi_HOME>/log/wde/opr-heartbeat.log
-
Data processing server:
<OMi_HOME>/log/opr-backend/opr-heartbeat.log
The default log level is INFO. You can change the log level in the following files:
-
Gateway server:
<OMi_HOME>/conf/core/Tools/log4j/wde/opr-heartbeat.properties
-
Data processing server:
<OMi_HOME>/conf/core/Tools/log4j/opr-backend/opr-heartbeat.properties
We welcome your comments!
To open the configured email client on this computer, open an email window.
Otherwise, copy the information below to a web mail client, and send this email to ovdoc-asm@hpe.com.
Help Topic ID:
Product:
Topic Title:
Feedback: