User Guide

Cluster

Service Resource Management

To manage service resources, several scripts have been developed by the NetEye team and are provided with every NetEye installation. These scripts are wrappers of the PCS and DRBD APIs and their use is showcased in section Adding a Service Resource to a Cluster. Examples of commands that are useful for NetEye Cluster troubleshooting are introduced in section Cluster Management Commands.

Adding a Service Resource to a Cluster

Service resources can be added by modifying an existing template, located under the /usr/share/neteye/cluster/templates/ directory, then copying it to a suitable location, and finally using it in a script.

For example, consider the Services-core-nats-server.conf.tpl template.

{
    "volume_group": "vg00",
    "ip_pre" : "192.168.47",
    "Services": [
        {
            "name": "nats-server",
            "ip_post": "48",
            "drbd_minor": 23,
            "drbd_port": 7810,
            "folder": "/neteye/shared/nats-server/",
            "collocation_resource": "cluster_ip",
            "size": "1024"
        }
    ]
}

Copy it, then edit it.

cluster# cd /usr/share/neteye/cluster/templates/
cluster# cp Services-core-nats-server.conf.tpl  /tmp/Services-core-nats-server.conf
cluster# vi /tmp/Services-core-nats-server.conf

Hint

You can copy the edited file to any other location, to be used for reference or in case you need to change settings at any point in the future.

In the file, make sure to change the following values to match your infrastructure network:

  • ip_pre: the corporate network address of the node (i.e., the first three octets).

  • ip_post: the IP address of the node (only the last octet)

Once done, make sure that the JSON file you saved is valid syntactically, for example by using the jq utility:

cluster# jq . /tmp/Services-core-nats-server.conf

A valid file will be displayed on, but if there is some syntactic mistake in the file, an explanatory message will provide a hint to fix the problem. Some possible message is shown next.

parse error: Expected separator between values at line 7, column 21

parse error: Objects must consist of key:value pairs at line 12, column 10

Note

Even if multiple errors are present in the file, only one error message is shown at a time, so always run jq until you see the whole content of the file instead of error messages: this will prove the file contains valid JSON.

Finally, let the cluster pick up the changes and configuration.

cluster# cd /usr/share/neteye/scripts/cluster
cluster# ./cluster_service_setup.pl -c /tmp/Services-core-nats-server.conf

Cluster Management Commands

The most important commands used for checking the status of a (NetEye) Cluster and to troubleshoot problems are:

  • drbdmon, a small utility to monitoring the DRBD devices and connections in real-time

  • drbdadm, DRBD’s primary administration tool

  • pcs, used to manage a cluster, verify its resources, constraints, fencing devices and much more

Hint

You can find more information about all their functionalities and sub-commands in their respective manual pages: drbdmon, drbdadm, and pcs.

In the remainder, we show some typical use of these commands, starting from the simplest one.

cluster# drbdmon

As its name implies, this command monitors what is happening in DRBD and shows in real time a lot of information about the DRDB status. Within the interface, any resource highlighted in red is in a degraded status and therefore requires some inspection and fix. Click p to show only problematic resources.

The next command is the Swiss army knife of DRBD and is used to carry out all configurations, tuning, and management of a DRBD infrastructure. The most important option of the drbdadm command is -d (long option: --dry-run): the command is executed and behaves exactly like without the option, but it makes no changes to the system. This option should always be used before making any change to the configuration, to check for possible problems and unexpected side effects.

The command itself has a lot of options and sub commands, extensively described in the above-mentioned man page. Within a NetEye Cluster, the most used sub command is perhaps

cluster# drbdadm --dry-run adjust all

This command checks the content of the configuration file and synchronises the configuration on all nodes. The given command only shows what would happen, remove the --dry-run option to actually run it and make changes.

The third command is the main tool to manage the corosync/pacemaker stack of a cluster: pcs. Like drbdadm it has a number of sub commands and option.

cluster# pcs status

This command prints the current status of the NetEye Cluster, its nodes, and its resources, and allows to check whether there are any ongoing issues.

In the output, right above the Full list of resources, all the nodes (if any) are shown there, along with their state–Online/Offline and Standby being the most common.

The presence of Offline nodes, that is, nodes disconnected form the cluster or even shut down, is usually a sign of an ongoing problem and requires a quick reaction. Indeed, the only legitimate situation when a node can be Offline is after a planned reboot (like i.e., a kernel update or a hardware upgrade).

On the other hand, nodes should be in Standby state only during updates: if this is not the case, it is worth to check that node for problems.

If in the list of resources there is any resource marked as Stopped, below the list and right above the Daemon status appear some log entries for each stopped service. While these logs should suffice to give some hint about the reason for the resource being stopped, it is possible to check the full status and log files using the commands systemctl status <resource name> and journalctl -u <resource name>.

Additional sub commands of pcs are:

cluster# pcs property list

This command returns some information about the cluster and is similar to the following snippet:

Cluster Properties:
cluster-infrastructure: corosync
cluster-name: NetEye
dc-version: 1.1.23-1.el7_9.1-9acf116022
have-watchdog: false
last-lrm-refresh: 1648467995
stonith-enabled: false
Node Attributes:
neteye02.neteyelocal: standby=on

The important points here are:

  • stonith-enabled: false. This should always be true, a value of false, like in the example, implies that the cluster fencing has been enabled for the node. This should happen only during maintenance windows, otherwise an immediate inspection is required because it may result in a split-brain situation. It is important to remark that Fencing must always be configured on a cluster before starting any resource.

  • neteye02.neteyelocal: standby=on. The node is in Standby status, meaning it can not host any running services or resources, but will still vote in the quorum.

See also

Fencing is described in great details in NetEye’s blog post Configuring Fencing on Dell Servers.

cluster# pcs constraint

Returns a list of all active constraints on the cluster.

cluster# pcs resource show [cluster_ip]

This command shows all the configured resources; if the parameter cluster_ip is added, shows only the Cluster IP address.

See also

For more information, troubleshooting options, and debugging commands, you can refer to RedHat’s Reference Documentation for Pacemaker and high-availability, in particular Chapters 3. The pcs CLI, 9.7 Displaying fencing devices, and 10.3. Displaying Configured Resources.