Troubleshooting¶
The Update and Upgrade procedures can stop for disparate reasons. This section collects the most frequents cases and provide some guidelines to resolve the issue and continue the procedures.
In some cases you might want to check out the logs of the various commands that have been executed.
All the logs are stored in a log file at /neteye/local/os/log/neteye_command/
If you find a problem that is not covered in this page, please refer to the official channels: sales, consultant or support portal. for help and directions on how to proceed.
Some check fails¶
In this case, an informative message will point out the check that failed, allowing to inspect and fix the problem.
For example, if the exit message is similar to the following one, you need to manually install the latest updates.
"Found updates not installed"
"Example: icingacli, version 2.8.2_neteye1.82.1"
Then, after the updates are installed, you can run it again and the command will start over the tasks.
An .rpmnew
and/or .rpmsave
file is found¶
This can happen in presence of a customisation in some of the installed packages. Check section Migrate .rpmsave and .rpmnew Files for directions on how to proceed. Once done, remember to run neteye update again.
A cluster resource has not been created¶
During a NetEye Cluster upgrade, it can happen that there is the need of creating new cluster resources before running the neteye install script. Creation of a resource must be done manually, and directions can be found in section 4. Additional Tasks of the Cluster Upgrade from 4.44 to 4.45.
An health check is failing¶
…during the update/upgrade procedure¶
The NetEye update or upgrade commands run all the deep health checks to ensure that the NetEye installation is healthy before running the update or upgrade procedure. It might happen, however, that one of the check fail, thus preventing the procedures to complete successfully.
Hence, to manually solve the problem you should follow the directions that can be found in section The NetEye Health Check.
Once the issue is solved, the NetEye update/upgrade commands can be run again.
…after the finalization procedure¶
After the finalization procedure has successfully ended, you might notice in the Problems View (see health check fails and is in state WARNING. The reason is that you are using some module that needs to be migrated, because some breaking change has been introduced in the release.
) that someHence, you should go to the Problems View and check which health check is failing. There you will also find instructions for the correct migration of the module, which is in almost all cases amounts to enabling an option: the actual migration will then be executed manually.
How to check the NetEye Cluster status¶
Run the following cluster command:
# pcs status
and please ensure that:
Only the last (N) node MUST be active
All cluster resources are marked “Started” on the last (N) node
All cluster services under “Daemon Status” are marked active/enabled on the last (N) node
How to check DRBD status¶
Check if the DRBD status is ok by using the drbdmon command, which updates the DRBD status in real time.
See also
Section 4.2 of DRBD’s official documentation contains information and details about the possible statuses.
https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-check-status
Elasticsearch Cluster upgrade¶
To guarantee an efficient yet reliable upgrade of the nodes in the Elasticsearch cluster, NetEye adopts a strategy that upgrades nodes in parallel when possible, in order to save time. To troubleshoot potential issues during the upgrade, it is important to understand how the procedure works.
Parallel Upgrade of Elasticsearch Nodes¶
Step 1: Group Nodes by Role
Nodes are organized into logical groups based on their roles:
Master Nodes Group
Nodes with the master role.
Data Nodes Group
Nodes with the data role (excluding master).
Data Tier Groups
Nodes with tier roles: hot, warm, cold, frozen.
If a node has multiple tier roles, a combined group is created.
Example: a node with cold and frozen roles is placed in a group named cold+frozen. All nodes with either cold or frozen roles are included in this group.
Each node belongs to only one group.
Step 2: Upgrade Sequence
We upgrade the groups in the following order to maintain cluster health:
Data Tier Groups
Nodes are upgraded in parallel, but only one node per group at a time.
- For example:
One node from hot
One node from warm
One node from cold+frozen
Data Nodes
Nodes are upgraded sequentially, one node at a time.
Master Nodes
Nodes are upgraded sequentially, one node at a time.
Waiting for Shard Relocation¶
Updating or upgrading Elasticsearch requires restarting the service to take effect. During this process, the shards allocated on the node being restarted are temporarily unassigned until the node is back online.
To ensure that upgrading node X
does not cause shards to become completely unavailable, the procedure by default
waits until there are no unassigned shards whose replica is allocated on node X
before proceeding with its upgrade.
Note
If a shard has no replicas (i.e., it is a primary shard without any replicas), it will become unavailable during the upgrade of the node hosting it.
By default, each node waits up to one hour for shard relocation to complete before continuing with the upgrade. If relocation is not completed within this time frame, the procedure fails with an error, allowing you to investigate the issue.
In installations with large volumes of data, relocation may take longer. In such cases, you may choose to increase the waiting time or maybe skip the relocation check entirely. Refer to the following sections for instructions.
Customize maximum relocation waiting time¶
You can customize the maximum waiting time for shard relocation by specifying two parameters when launching the update or upgrade command: the number of retries and the seconds between each retry.
For example, to set a maximum waiting time of two hours:
neteye# (nohup neteye update --extra-vars '{"es_status_wait_retries":120,"es_status_wait_seconds_between_retries":60}' &) && tail --retry -f nohup.out
neteye# (nohup neteye upgrade --extra-vars '{"es_status_wait_retries":120,"es_status_wait_seconds_between_retries":60}' &) && tail --retry -f nohup.out
Skipping relocation wait¶
If shard availability during the upgrade is not required in your installation, you can skip the relocation wait using the skip_es_status_to_wait parameter:
neteye# (nohup neteye update --extra-vars '{"skip_es_status_to_wait":true}' &) && tail --retry -f nohup.out
neteye# (nohup neteye upgrade --extra-vars '{"skip_es_status_to_wait":true}' &) && tail --retry -f nohup.out
Waiting for a particular cluster status¶
If the default behavior of waiting for shard relocation is not suitable for your installation, you can configure the procedure to wait for a specific cluster status before upgrading each node.
For example, to wait until the cluster reaches green
status:
neteye# (nohup neteye update --extra-vars '{"es_status_to_wait": "green"}' &) && tail --retry -f nohup.out
neteye# (nohup neteye upgrade --extra-vars '{"es_status_to_wait": "green"}' &) && tail --retry -f nohup.out
Note
Supported values are: green
and yellow
.