User Guide Functional Overview Requirements Architecture System Installation NetEye Additional Components Installation Setup The neteye Command Director NetEye Self Monitoring Tornado Business Service Monitoring IT Operation Analytics - Telemetry Geo Maps NagVis Audit Log Shutdown Manager Reporting ntopng Visual Monitoring with Alyvix Elastic Stack IT Operations (Command Orchestrator) Asset Management Service Level Management Cyber Threat Intelligence - SATAYO NetEye Update & Upgrade How To NetEye Extension Packs Troubleshooting Security Policy Glossary
module icon Service Level Management
Overview Customers Availability Event Adjustment Outages Resource Advanced Topics
ntopng Visual Monitoring with Alyvix Elastic Stack IT Operations (Command Orchestrator) Asset Management Service Level Management Cyber Threat Intelligence - SATAYO Introduction to NetEye Monitoring Business Service Monitoring IT Operation Analytics Visualization Network Visibility Log Management & Security Orchestrated Datacenter Shutdown Application Performance Monitoring User Experience Service Management Service Level Management & Reporting Requirements for a Node Cluster Requirements and Best Practices NetEye Satellite Requirements TCP and UDP Ports Requirements Additional Software Installation Introduction Single Node Cluster NetEye Master Master-Satellite Architecture Underlying Operating System Acquiring NetEye ISO Image Installing ISO Image Single Nodes and Satellites Cluster Nodes Configuration of Tenants Satellite Nodes Only Nodes behind a Proxy Additional NetEye Components Single Node Cluster Node Satellites Nodes only Verify if a module is running correctly Accessing the New Module Cluster Satellite Security Identity and Access Management External Identity Providers Configure federated LDAP/AD Emergency Reset of Keycloak Configuration Advanced Configuration Authorization Resources Tuning Advanced Topics Basic Concepts & Usage Advanced Topics Monitoring Environment Templates Monitored Objects Import Monitored Objects Data Fields Deployment Icinga 2 Agents Configuration Baskets Dashboard Monitoring Status VMD Permissions Notifications Jobs API Configuring Icinga Monitoring Retention Policy NetEye Self Monitoring 3b Concepts Collecting Events Add a Filter Node WHERE Conditions Iterating over Event fields Retrieving Payload of an Event Extract Variables Create a Rule Tornado Actions Test your Configuration Export and Import Configuration Example Under the hood Development Retry Strategy Configuration Thread Pool Configuration API Reference Configure a new Business Process Create your first Business Process Node Importing Processes Operators The ITOA Module Configuring User Permissions Telegraf Metrics in NetEye Telegraf Configuration Telegraf on Monitored Hosts Visualizing Dashboards Customizing Performance Graph The NetEye Geo Map Visualizer Map Viewer Configuring Geo Maps NagVis 3b Audit Log 3b Overview Shutdown Manager user Shutdown Manager GUI Shutdown Commands Advanced Topics Overview User Role Management Cube Use Cases ntopng and NetEye Integration Permissions Retention Advanced Topics Overview User Roles Nodes Test Cases Dashboard Use Cases Overview Architecture Authorization Elasticsearch Overview Enabling El Proxy Sending custom logs to El Proxy Configuration files Commands Elasticsearch Templates and Retentions El Proxy DLQ Blockchain Verification Handling Blockchain Corruptions El Proxy Metrics El Proxy Security El Proxy REST Endpoints Agents Logstash Elastic APM Elastic RUM Log Manager - Deprecated Overview Authorization in the Command Orchestrator Module Configuring CLI Commands Executing Commands Overview Permissions Installation Single Tenancy Multitenancy Communication through a Satellite Asset collection methods Display asset information in monitoring host page Overview Customers Availability Event Adjustment Outages Resource Advanced Topics Introduction Getting Started SATAYO Items Settings Managed Service Mitre Attack Coverage Changelog Before you start Update Procedure Single Node Upgrade from 4.41 to 4.42 Cluster Upgrade from 4.41 to 4.42 Satellite Upgrade from 4.41 to 4.42 DPO machine Upgrade from 4.41 to 4.42 Create a mirror of the RPM repository Sprint Releases Feature Troubleshooting Tornado Networking Service Management - Incident Response IT Operation Analytics - Telemetry Identity Provider (IdP) Configuration Introduction to NEP Getting Started with NEPs Online Resources Obtaining NEP Insights Available Packages Advanced Topics Upgrade to NetEye 4.31 Setup Configure swappiness Restarting Stopped Services Enable stack traces in web UI How to access standard logs Director does not deploy when services assigned to a host have the same name How to enable/disable debug logging Activate Debug Logging for Tornado Modules/Services do not start Sync Rule fails when trying to recreate Icinga object How to disable InfluxDB query logging Managing an Elasticsearch Cluster with a Full Disk Some logs are not indexed in Elasticsearch Elasticsearch is not functioning properly Reporting: Error when opening a report Debugging Logstash file input filter Bugfix Policy Reporting Vulnerabilities Glossary 3b

Outages

An outage is a period of time during which a Monitored Object (i.e., a host or a service) or a Business Process is in a non-available state. In the simplest case, an outage is defined as the period of time between two consecutive available statuses, during which an event takes place, that influences the Availability; the states that affect Availability negatively and that are used for the calculation of outages are HARD events only.

Each Outage has an associated Duration, which is the total amount of time during which the Monitored Object or Business Process was in a non-available state.

During each calculation period, the sum of the duration of all outages plus the duration of Available statuses must equal the duration of the calculation period itself.

Outages appear as part of each SLM report, if configured, and it is possible to configure to show only the number of outages, or a detailed list of outages for each monitored object in the report. It is also possible to include Outage Annotation for every outage, which will then be rendered in the SLM report, if defined. More information can be found in the Outage configuration section.

Whenever the availability is below the required target availability, the monitored host or service is highlighted. You can drill down on any calculation period with an Availability below 100%, to a new page, presenting a detailed report of the related events.

In case an Operational Time has been defined in a SLA, the definition and computation of an outage change slightly. In this case, indeed, Outages are recorded only when they fall within an Operational Time; when the state of non-availability takes place outside an Operational Time, it is called Unavailability Period. In other words, an Unavailability Period is an interval of time during which a service, host, or Business Process is not available and becomes an Outage if it falls at least partially within an Operational Time.

In the remainder of this section, we show examples of how Outages are calculated with Operational time and how they are considered within and across Calculation Periods.

Records of Outages with Operational Time

This section enumerates all the basic cases that are taken into account for the calculation of outages, by considering a scenario with one Calculation Period composed of two Operational Time Ranges: the first from 8AM to 9AM and the second from 10AM to 11AM.

More complex scenarios can be reproduced by suitably combining the following cases.

Throughout this section, in all the diagrams and in the text, OT stands for Operational Time and reports all the Operational Time Ranges, while UnPs represent Unavailability Periods, i.e., the intervals during which a host or service is unavailable.

Unavailability Periods outside the Operational Time

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1---1   2---2
          |       |
          |       \-> 1 outage of 30 minutes
          \-> no outage

In this example the Unavailability Period 1 [7:20AM-7:40AM] starts and ends outside any Operational Time Range, so it causes no outage.

The Unavailability Period 2 [8AM-8:30AM] instead, falls completely inside the first Operational Time Range, so it causes an outage that starts at 8AM and ends at 8:30AM, with a duration of 30min.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 8:30AM*

  • Outage duration: 30min

Unavailability Periods starting or ending outside Operational Times

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs      1---------------------1
                |
                \-> 1 outage of 1 hour

The UnP 1 [7:30AM-9:30AM] starts before the OT range [8AM-9AM] and ends after the OT range [8AM-9AM].

Since Outages exist only during the Operational Time, the UnP causes an Outage that starts together with the OT at 8AM, which is the beginning of the OT range [8AM-9AM] and ends at 9AM, which is the end of the OT range [8AM-9AM].

The duration of the Outage is 1h, which is the amount of Operational Time affected by the Outage.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 9AM

  • Outage duration: 1h

Unavailability Periods across different Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs      1--------------------------------1
                |
                \-> 1 single outage of 90 minutes

The UnP 1 [7:30AM-10:30AM] starts before the first OT range [8AM-9AM], and ends during the second OT range [10AM-11AM].

This means that the Unavailability Period causes an Outage that starts at 8AM (the start of OT range [8AM-9AM]) and ends at 10:30AM, after the second OT range starts. The duration of the Outage is 1h 30min, since the Outage affected the full OT range [8AM-9AM] and the first 30min of the OT range [10AM-11AM].

Reported Outage:

  • Outage start: 8AM

  • Outage end: 10:30AM*

  • Outage duration: 1h 30min

Temporary Uptimes between Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1---------------------1   2-----------------2
                            ↓       ↓
                       1 single outage of 2 hours

In this case the UnP 1 [7:20AM-9:20AM] entirely covers the OT range [8AM-9AM] and the UnP 2 [9:40AM-11:30AM] entirely covers the subsequent OT range [10AM-11AM].

The resulting Outage is a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h. Only one Outage is reported because during the Operational Time, the Monitored Object or Business Process was continuously unavailable, with the short availability window completely outside the OT. The short interval during which the Monitored Object or Business Process was available [9:20AM-9:40 AM] is disregarded as it is outside the OT.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 11AM

  • Outage duration: 2h

Temporary Uptimes during Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1----------------1      2----------2
                |                  |
                |                  \-> 1 outage of 30 minutes
                \-> 1 outage of 50 minutes

Here the UnP 1 [7:20AM-8:50AM] ends just before the end of the first OT range [8AM-9AM], and the UnP 2 [9:30AM-10:30AM] overlaps with the beginning of the second OT range [10AM-11AM].

This causes 2 different Outages, because during both Operational Time Ranges the Monitored Object or Business Process changes status (here, it becomes available). The first Outage starts at 8AM and ens at 8:50AM, with a duration of 50min. The 2nd Outage starts at 10AM and ends at 10:30AM, with a duration of 30min

1st Reported Outage:

  • Outage start: 8AM

  • Outage end: 8:50AM*

  • Outage duration: 50min

2nd Reported Outage:

  • Outage start: 10AM

  • Outage end: 10:30AM*

  • Outage duration: 30min

Unavailability Periods starting or ending on OT range start or end

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OTs             |----------|          |---------|
UnPs    1------------------1          2-----------------2
                          ↓            ↓
                       1 single outage of 2 hours

In case an Unavailability Period starts or ends exactly in the same instant when an OT range starts or ends, then the start or end of the Unavailability Period will be always considered as happening outside the OT range. This case is very similar to the first case presented, with the only difference that two events coincide with the start or end f an Operational Time Range.

The UnP 1 [7:20AM-9AM] ends exactly in the same moment when the first OT range [8AM-9AM] ends, while UnP 2 [10AM-11:50AM] starts exactly in the same moment when the second OT range [10AM-11AM] starts.

This causes a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h.

This happens because the end of UnP 1 [7:20AM-9AM] is considered as happening after the first OT range [8AM-9AM] and the start of UnP 2 [10AM-11:50AM] is considered as happening before the second OT range [10AM-11AM]. So the Monitored Object or Business Process was continuously unavailable for the whole Operational Time.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 11AM

  • Outage duration: 2h

Outages and Calculation Periods.

The previous cases take into account Operational Time Ranges, during which the interval between two consecutive Time Ranges is disregarded, so an Unavailability Period that spans multiple OT Ranges is considered as a single Outage.

In this section, we define how Outages are considered when Calculation Periods are taken into account.

The simplest case, when one Unavalability Period (and therefore the Outage) is entirely included in a Calculation Period, results in a single Outage reported in that Calculation Period.

However, when an Unavailability Period spans across two Calculation Periods, although Calculation Periods are by default contiguous time intervals, the Outage will be split into two outages, one for the first Calculation Period, one for the second. The same applies when multiple Calculation Periods are involved, like the following diagram shows. Here, CP are Calculation Periods and OUT are the recorded Outages.

    Jan 2020   Feb        Mar       Apr       May      June
     |----------|----------|----------|---------|--------|
CPs  |----------|----------|----------|---------|--------|
UnPs   1-------------1         2----------------------2
OUT    |---O1---|-O2-|         |--O3--|---O4----|--O5-|

In this example, although there are only two UnP, five Outages will be reported, one for each month in which the Monitored Object or Business Process was not available. In summary (note that the actual Duration in the report will be in HH:MM:SS, here we use days for simplicity), the first Unavailability Period results in:

Reported Outage (January):

  • Outage start: 7th of January,

  • Outage end: 31st of January

  • Outage duration: 25 days

Reported Outage (February):

  • Outage start: 1st of February

  • Outage end: 14th of February

  • Outage duration: 14 days

while the second Unavailability Period results in:

Reported Outage (March):

  • Outage start: 10th of March

  • Outage end: 31st of March

  • Outage duration: 21 days

Reported Outage (April):

  • Outage start: 1st of April

  • Outage end: 30th of April

  • Outage duration: 30 days

Reported Outage (May):

  • Outage start: 1st of May

  • Outage end: 19th of may

  • Outage duration: 19 days