User Guide

Service Level Management

Data about availability and resource consumption that are produced by NetEye’s monitoring features can be used to automatically produce various types of report that show the past behaviour of services, hosts, and systems controlled by NetEye. This section introduces various concepts used by the Service Licence Management, the types of reports available, describes the algorithms used to create the reports, and finally shows how to schedule the reports.

Concepts

The Service Level Management Module (SLM) allows to setup contracts between one SLM customer and the service provider, with the purpose to sending a periodic report to the customer.

There are two types of contracts available:

  • The Availability contract measures the availability of hosts and services within a given period of time, to verify if the level of availability required by the customer has been met. An availability contract is defined by a customer, a SLA type and a set of monitored objects.

  • The Resource Contract shows a dashboard with diagrams that show the load of the monitored objects within a given time range. Dashboards are Grafana-based and need to be created for each contract. A resource contract is defined by a customer and a set of diagrams.

While the set up of a contract is quite straightforward, and it is described in dedicated sections below, it is important to highlight a few points that should be understood correctly, in order to avoid possible sources of problems. They are described in the next section, that you can look up for reference.

Important concepts

  • Users, customers, and their permissions

    When configuring new object in the SLM module, it is important to highlight the importance of permissions in the management of customers. Indeed, NetEye users with access to the SLM module can see and assign to a customer only roles that they belong to.

    For example, if a user Jake has B and C roles, then he can only see and assign roles B or C to a customer. The only exception is for users with Role Administrative access in NetEye, which can assign every role.

    This affects both Availability and Resource contracts: if in the SLM Module there are contracts involving a role that is not assigned to a user, then they will not be seen by the user.

    It is therefore important to assign appropriate roles to a user of the SLM Module, to allow them to create and manage the contracts of his customers.

  • Module permissions

    While the role of users and customers is important to understand their access to contracts, but also the module permissions are relevant and need to be clearly understood.

    • Full Module Access. It is essentially a shortcut and enables all the permissions below. User with full module access can see all the content of the module related to his role.

    • General Module Access. This permission gives only the ability to load the module configuration and provide only View access to Event Adjustment, This permission is mandatory for enabling the following permissions. It is also necessary to enable the SLM extension of the Reporting module. To give Add/Edit/Delete permissions you need to enable slm/admin.

    • slm/admin. With this permission it is possible to view and edit everything that the user’s role allows to see.

    • slm/report-adjustment-override. Granting this permission allows to modify the Consider Event Adjustments field in the Reporting module, provided the Reporting’s SLM extensions have been enabled.

Note

Users with slm/admin or slm/report-adjustment-override permissions but without General Module Access can neither see nor access the SLM module (and the SLM extensions of the Reporting module), but this is the expected behaviour: the General Module Access is required, to load the configuration of the SLM module and activate the Extensions.

  • Operational Time explained

    The operational time does not need to indicate a single contiguous extent of time. For instance, it may be defined as “Business Hours” (i.e., “Monday through Friday, 9:00AM to 5:00PM”), which would exclude evening and early morning hours. You would construct such a Time Period in Director by specifying each individual contiguous Time Range separately, e.g. first “monday 9:00AM to 5:00PM”, then “tuesday 9:00AM to 5:00PM”, etc.

    When calculating availability, the monitored object’s initial state is valid until the first state change event (if one exists) during that Time Range, and the last state change event occurring in the Time Range is valid until the end of the Time Range. Thus given the “Business Hours” example above where the initial state of a service is OK on Monday at 9:00AM, and a single state change event of type CRITICAL occurs at 4:30PM, then the resulting availability will be 7 and a half hours of OK and 30 minutes of CRITICAL.

  • How Downtime affects calculation

    A downtime is a scheduled period in which a host or service is not available. Suppose we have an overall operational time of 10 seconds where a series of state change events result in:

    • 1 second where the host is DOWN

    • 7 seconds with the host in an OK state

    • 2 more seconds where the host is DOWN

    And let’s also assume that at the first second the downtime was unexpected, but the final 2 seconds of this period was scheduled downtime.

    • If the Downtime box is checked, the availability will be calculated as (OK + DOWNTIME)/OPERATIONAL TIME = (7s + 2s)/10s, therefore 9/10 or 90%.

    • If instead the box is not checked, the availability will be: OK/OPERATIONAL TIME = 7s / 10s, therefore 7/10 or 70%.

Availability

This section describes all the necessary steps to create new availability contracts.

Defining SLA Types

An SLA is a commitment between a service provider and a client, defining particular aspects of a service. Within the SLM module, an SLA type can be associated with a customer contract and defines limits for metrics to be guaranteed by the service provider as well as the exact temporal boundaries during which the metrics must be guaranteed.

When the SLM module is first launched, the SLA Type panel is focused, displaying a row for each configured SLA type. Additional panels allow to define Contracts and Customers respectively. Search functionality is available for all three panels, but the text being searched will be restricted to the Name and Contract columns in the first two panels.

Note

Before you can successfully set up SLA types and contracts, you need to have defined a few other objects in the Directory, namely one or more TimePeriods (as Operational Time) and a filter expression (as Object Filter). An Object Filter is used to define for which host(s) or service(s) the contract is defined; in other words, on these hosts or services it will be calculated the availabilty required by the customer. Examples of valid input for the Object Filter are: host_name=*neteye* or service_description=jenkins*.

Before you can create an SLA contract to see the availability of monitored objects for a customer, you must first define the parameters for the SLAs you intend to use. You can do this at SLM / SLA Types, using the follow options:

  • Name: The name of this SLA type (e.g., “Gold level”)

  • Description: A more user-friendly description of the SLA type

  • Operational Time: The exact time(s) during which all elements necessary for a monitored object to function properly should in fact be in operation. The operational time is precisely defined by a TimePeriod object in Director. This field lets you select either a pre-defined Timeperiod object or one that you have created at Icinga Director / Timeperiods / Timeperiods.

  • Calculation Period: The unit of time over which the data will be aggregated into service level reports. For instance, if you want an availability report for the current year, you might want it broken down into “Monthly” or “Weekly” subsections.

  • Availability %: The target percentage of SLA availability for the calculation period. For hosts, only Down states have a negative impact on availability. For services instead, both Critical and Unknown (but not Warning) will decrease availability.

  • Downtime: When this box is checked, the scheduled downtimes of monitored objects will be taken into account for any related availability calculations. When downtime is in effect, the related monitored object is considered available, regardless of its actual state during that period. Once the scheduled downtime ends, the object’s state will be reset to the value of its most recent state change.

At the moment the only supported TimePeriod values (i.e., Ranges) are exact dates and names of weekdays. There is also currently no support for excluded ranges and included ranges.

More precise definitions of Calculation Period, Availability, Downtime, and other terms can be found in the glossary, while an example of the algorithm on which the SLM is based, is shown in How the availability is calculated.

The following table defines each calculation period more precisely:

Calculation Period

Unit of Time

daily

from 00:00 to 24:00 of that same day

weekly_sunday

from 00:00 on Sunday to 24:00 on Saturday

weekly_monday

from 00:00 on Monday to 24:00 on Sunday

monthly_1

from 00:00 on the first day of one month to 24:00 on the last day of the same month

monthly_2

from 00:00 on the first day of one month to 24:00 on the last day of the subsequent month

monthly_3

from 00:00 on the first day of one month to 24:00 on the last day of the third month

monthly_4

from 00:00 on the first day of one month to 24:00 on the last day of the fourth month

monthly_6

from 00:00 on the first day of one month to 24:00 on the lsat day of the sixth month

monthly_12

from 00:00 on the firstday of one month to 24:00 on the last day of the twelfth month

Note

The “last day” of a month may be the the 28th or 29th for February, 30th or 31st otherwise.

Creating an SLA Customer

To provide the ability to group contracts by customer, the SLM module allows you to define specific customers that you can then associate with one or more SLA/Resource contracts (e.g., because each contract concern different customer’s hosts or services or different levels of availability).

To add a new customer for SLM, go to SLM / Customers and enter values for the following:

  • Name: The name of this customer

  • Description:* A longer description for the customer

  • Role: The roles assigned to this customer, which give a set of permissions to access the resources to be reported. The Role will be used only for resource contract, not SLA contracts.

  • Report Template: The report template assigned to this customer, will be used while generating the SLM report. In any case, if a user failed to select any template then the default template will be used.

Creating SLA Contracts

Once you have defined an SLA type, you can begin creating Service Level Agreement contracts. Click on SLM / Contracts and enter appropriate values for the following options:

  • Name: The name of this contract

  • Description: A more user-friendly description of the contract

  • Customer: You can set in the Customer tab the customer whose monitored objects (typically hosts and services) will be included in the availability report. Only customers with the same role as the logged in user are displayed.

  • SLA Type: The type of SLA you defined in the section above

  • Consider Event Adjustments: This checkbox should be set by an administrator if you want to allow event adjustments to be considered when generating a report.

  • Objects Type: can be set to host or service* for including respectively only host or service objects into the current ***Contract* or to all for considering both hosts and services.

  • Objects Filter: A set of monitored objects determined by an Icinga filter expression. It is important to check that the filter expression actually returns at least one monitored object.

  • Monitoring Views: Depending on the choice of the Object types, here will be shown in parentheses the count of hosts, services, or both, that match the object filter. A click on the Hosts link or Services link will take you to the related Monitoring Overview.

Outages

An outage is a period of time during which a Monitored Object (i.e., a host or a service) or a Business Process is in a non-available state. In the simplest case, an outage is defined as the period of time between two consecutive available statuses, during which an event takes place, that influences the Availability; the states that affect Availability negatively and that are used for the calculation of outages are HARD events only.

Each Outage has an associated Duration, which is the total amount of time during which the Monitored Object or Business Process was in a non-available state.

During each calculation period, the sum of the duration of all outages plus the duration of Available statuses must equal the duration of the calculation period itself.

Outages appear as part of each SLM report, if configured, and it is possible to configure to show only the number of outages, or a detailed list of outages for each monitored object in the report.

Whenever the availability is below the required target availability, the monitored host or service is highlighted. You can drill down on any calculation period with an Availability below 100%, to a new page, presenting a detailed report of the related events.

In case an Operational Time has been defined in a SLA, the definition and computation of an outage change slightly. In this case, indeed, Outages are recorded only when they fall within an Operational Time; when the state of non-availability takes place outside an Operational Time, it is called Unavailability Period. In other words, an Unavailability Period is an interval of time during which a service, host, or Business Process is not available and becomes an Outage if it falls at least partially within an Operational Time.

In the remainder of this section, we show examples of how Outages are calculated with Operational time and how they are considered within and across Calculation Periods.

Records of Outages with Operational Time

This section enumerates all the basic cases that are taken into account for the calculation of outages, by considering a scenario with one Calculation Period composed of two Operational Time Ranges: the first from 8AM to 9AM and the second from 10AM to 11AM.

More complex scenarios can be reproduced by suitably combining the following cases.

Throughout this section, in all the diagrams and in the text, OT stands for Operational Time and reports all the Operational Time Ranges, while UnPs represent Unavailability Periods, i.e., the intervals during which a host or service is unavailable.

Unavailability Periods outside the Operational Time

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1---1   2---2
          |       |
          |       \-> 1 outage of 30 minutes
          \-> no outage

In this example the Unavailability Period 1 [7:20AM-7:40AM] starts and ends outside any Operational Time Range, so it causes no outage.

The Unavailability Period 2 [8AM-8:30AM] instead, falls completely inside the first Operational Time Range, so it causes an outage that starts at 8AM and ends at 8:30AM, with a duration of 30min.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 8:30AM*

  • Outage duration: 30min

Unavailability Periods starting or ending outside Operational Times

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs      1---------------------1
                |
                \-> 1 outage of 1 hour

The UnP 1 [7:30AM-9:30AM] starts before the OT range [8AM-9AM] and ends after the OT range [8AM-9AM].

Since Outages exist only during the Operational Time, the UnP causes an Outage that starts together with the OT at 8AM, which is the beginning of the OT range [8AM-9AM] and ends at 9AM, which is the end of the OT range [8AM-9AM].

The duration of the Outage is 1h, which is the amount of Operational Time affected by the Outage.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 9AM

  • Outage duration: 1h

Unavailability Periods across different Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs      1--------------------------------1
                |
                \-> 1 single outage of 90 minutes

The UnP 1 [7:30AM-10:30AM] starts before the first OT range [8AM-9AM], and ends during the second OT range [10AM-11AM].

This means that the Unavailability Period causes an Outage that starts at 8AM (the start of OT range [8AM-9AM]) and ends at 10:30AM, after the second OT range starts. The duration of the Outage is 1h 30min, since the Outage affected the full OT range [8AM-9AM] and the first 30min of the OT range [10AM-11AM].

Reported Outage:

  • Outage start: 8AM

  • Outage end: 10:30AM*

  • Outage duration: 1h 30min

Temporary Uptimes between Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1---------------------1   2-----------------2
                            ↓       ↓
                       1 single outage of 2 hours

In this case the UnP 1 [7:20AM-9:20AM] entirely covers the OT range [8AM-9AM] and the UnP 2 [9:40AM-11:30AM] entirely covers the subsequent OT range [10AM-11AM].

The resulting Outage is a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h. Only one Outage is reported because during the Operational Time, the Monitored Object or Business Process was continuously unavailable, with the short availability window completely outside the OT. The short interval during which the Monitored Object or Business Process was available [9:20AM-9:40 AM] is disregarded as it is outside the OT.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 11AM

  • Outage duration: 2h

Temporary Uptimes during Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1----------------1      2----------2
                |                  |
                |                  \-> 1 outage of 30 minutes
                \-> 1 outage of 50 minutes

Here the UnP 1 [7:20AM-8:50AM] ends just before the end of the first OT range [8AM-9AM], and the UnP 2 [9:30AM-10:30AM] overlaps with the beginning of the second OT range [10AM-11AM].

This causes 2 different Outages, because during both Operational Time Ranges the Monitored Object or Business Process changes status (here, it becomes available). The first Outage starts at 8AM and ens at 8:50AM, with a duration of 50min. The 2nd Outage starts at 10AM and ends at 10:30AM, with a duration of 30min

1st Reported Outage:

  • Outage start: 8AM

  • Outage end: 8:50AM*

  • Outage duration: 50min

2nd Reported Outage:

  • Outage start: 10AM

  • Outage end: 10:30AM*

  • Outage duration: 30min

Unavailability Periods starting or ending on OT range start or end

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OTs             |----------|          |---------|
UnPs    1------------------1          2-----------------2
                          ↓            ↓
                       1 single outage of 2 hours

In case an Unavailability Period starts or ends exactly in the same instant when an OT range starts or ends, then the start or end of the Unavailability Period will be always considered as happening outside the OT range. This case is very similar to the first case presented, with the only difference that two events coincide with the start or end f an Operational Time Range.

The UnP 1 [7:20AM-9AM] ends exactly in the same moment when the first OT range [8AM-9AM] ends, while UnP 2 [10AM-11:50AM] starts exactly in the same moment when the second OT range [10AM-11AM] starts.

This causes a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h.

This happens because the end of UnP 1 [7:20AM-9AM] is considered as happening after the first OT range [8AM-9AM] and the start of UnP 2 [10AM-11:50AM] is considered as happening before the second OT range [10AM-11AM]. So the Monitored Object or Business Process was continuously unavailable for the whole Operational Time.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 11AM

  • Outage duration: 2h

Outages and Calculation Periods.

The previous cases take into account Operational Time Ranges, during which the interval between two consecutive Time Ranges is disregarded, so an Unavailability Period that spans multiple OT Ranges is considered as a single Outage.

In this section, we define how Outages are considered when Calculation Periods are taken into account.

The simplest case, when one Unavalability Period (and therefore the Outage) is entirely included in a Calculation Period, results in a single Outage reported in that Calculation Period.

However, when an Unavailability Period spans across two Calculation Periods, although Calculation Periods are by default contiguous time intervals, the Outage will be split into two outages, one for the first Calculation Period, one for the second. The same applies when multiple Calculation Periods are involved, like the following diagram shows. Here, CP are Calculation Periods and OUT are the recorded Outages.

    Jan 2020   Feb        Mar       Apr       May      June
     |----------|----------|----------|---------|--------|
CPs  |----------|----------|----------|---------|--------|
UnPs   1-------------1         2----------------------2
OUT    |---O1---|-O2-|         |--O3--|---O4----|--O5-|

In this example, although there are only two UnP, five Outages will be reported, one for each month in which the Monitored Object or Business Process was not available. In summary (note that the actual Duration in the report will be in HH:MM:SS, here we use days for simplicity), the first Unavailability Period results in:

Reported Outage (January):

  • Outage start: 7th of January,

  • Outage end: 31st of January

  • Outage duration: 25 days

Reported Outage (February):

  • Outage start: 1st of February

  • Outage end: 14th of February

  • Outage duration: 14 days

while the second Unavailability Period results in:

Reported Outage (March):

  • Outage start: 10th of March

  • Outage end: 31st of March

  • Outage duration: 21 days

Reported Outage (April):

  • Outage start: 1st of April

  • Outage end: 30th of April

  • Outage duration: 30 days

Reported Outage (May):

  • Outage start: 1st of May

  • Outage end: 19th of may

  • Outage duration: 19 days

Event Adjustment

The Event Adjustment feature allows Service Owners and Service Level Managers to retroactively add events. Indeed, an Event Adjustment is the action of adding to a monitored host or service an event over a given period of time, that actually did not take place. This proves useful in situations where, for instance, the Operational team has forgotten to schedule Downtime in advance: this oversight would add to a wrong calculation of the availability, therefore adding an Event Adjustment that covers the unscheduled downtime would fix the problem. It is worth noting that Event Adjustments will affect the availability calculations for hosts and services, and potentially affect whether the service provider is or is not satisfying the target SLA specified in the customer contract.

How It Works

The Event Adjustment feature is an extension of the SLM module, and can only be performed by a NetEye user with a certain privilege level. All inserted event adjustments are stored in a dedicated database table in order to ensure that the data (i.e., the existing timeline) cannot be manipulated. Event adjustments are taken into account during Availability calculations without any additional intervention.

Each adjustment must be applied to a monitored object. The following table indicates which types of event adjustments can be applied to each type of monitored object.

Table 20 Available event types for host and services

Monitored Object Type

Event Type

host

up, down, downtime

service

ok, warning, critical, unknown, downtime

Multiple event adjustments on the same host or service cannot overlap, except if the event type of one of the adjustments is downtime. If a new or modified event does overlap the time bounds of an existing adjustment, NetEye will report an error and the new event adjustment will not be processed.

Resource

In order to configure a resource contract, you need to have defined a customer like described in the corresponding section above. You also need to have set up suitable dashboard(s) in Grafana that should be linked to the contract.

A user can define Resource Contracts for its resources to generate the SLM Resource reports. To add a new Resource Contracts for SLM, go to SLM > Resource Contracts and enter appropriate values for the following options:

  • Name: The name of this contract

  • Description: A more user-friendly description of the contract

  • Customer: You can set in the Customer tab the customer whose analytical dashboard in Grafana will be included in the resource report. The customers will appear the dropdown, if they have Analytics Module access, and an assigned role that is in common with the SLM user.

  • Dashboard: An analytical dashboard (static) of a customer in Grafana

To prevent a user from creating very large resource reports, there is a restriction in place, that sets the maximum number of panels that can be included in the report. A user cannot add a new Resource Contract, if the number of panels in the selected dashboard exceeds the limit.

This limit can be increased manually, by updating the dashboard_panel_size value in below SLM Module config file /neteye/shared/icingaweb2/conf/modules/slm/config.ini

Please be warned that increasing this limit will lead to a proportional decrease in performance.

Configuration

SLA Reports

The SLM module is compatible with Icinga’s Reporting module. One use of the data provided via SLM is for creating availability reports for the monitored objects included in each customer contract.

To create an availability report, you will need to:

  • Configure one or more customers, SLA types, and contracts in the SLM module

  • Create a new report in the Reporting module and set the following fields, which are all compulsory:

    • Name: A name that uniquely identifies the report

    • Timeframe: Selecting a value here defines for how much time the report will be generated. This value must be higher than the calculation period defined in the SLA Type for which the report is generated, otherwise it will lead to an empty report (see next section).

    • Report: Set this to SLM Report, whereupon the form will add the following field

    • Show Outages List: Show Outages in the report

    • Show Outages Count: Set the maximum number of outages to show per calculation period

    • Show Outages List Limit: When choosing this option, in the HTML output only the number of report defined in the previous option will be shown.

    • Customers: Choose the customer you want to create the report for

    • Contract: Choose the contract corresponding to the data to be processed in the report. In the Contract dropdown, you will be able to select only those contracts where the Calculation Period is defined to be smaller than the selected Timeframe. This ensures that the report will have an appropriate number of pages.

    • Consider Event Adjustments: With this drop-down, users with the appropriate permission can choose whether or not to consider user-defined event adjustments in this report. There are three possible choices:

      • <Yes/No> (inherited from “< ContractName >”): This option contains the Consider Event Adjustment flag value (i.e. Yes or No) with the contract name from which the value is inherited.

      • Yes: Override that value, forcing event adjustments to be considered

      • No: Override that value, forcing event adjustments to be ignored

After you click on “Create Report”, the report will appear in the list of available reports.

Within each report, you can read the details related to the selected contract and its monitored objects. This information is typically divided into hosts and services, and represents their percentage of availability. The availability of a monitored object will be green if it is above the threshold defined in its SLA Type, and red if not. In addition, all monitored objects that did not record any events during the reporting period will be listed separately.

You should ensure that the filter expression used in the Objects Filter field on the Contract tab returns at least one monitored object (e.g., at least one host or service).

Invalid Report Configurations

While the SLM module does not allow users to create incorrect report configurations, there are circumstances in which reports may seemingly contain wrong data, namely when the report is empty or very large. The reasons behind these two cases, along with solutions, are explained in the next sections.

Configurations Leading to Empty Reports

If the report’s time frame and the contract’s calculation period aren’t compatible, the report generated will be empty. This can happen when:

  • The Calculation Period is greater than the Time Frame. For example, setting the calculation period to monthly_12 and defining the time frame to be from 01.01.2019 to 01.06.2019. This would be like trying to fit 12 months inside 6 months.

  • The time frame doesn’t contain at least one entire and valid calculation period. For instance, when you define a report with a monthly calculation period, while the time frame is defined to start on 02.07.2019 and finish on 29.08.2019. Here, neither the time within July nor the time within August represents a complete month.

If you find you have created a report definition matching one of these cases, you can fix it with one of the following solutions:

  • Make the time frame defined for the specific report longer

  • Select a different SLA Type in the contract form, with a smaller calculation period

  • Select a smaller calculation period in the definition of the SLA type associated with the contract used for creating the report

Configurations Leading to Very Large Reports

If the combination of a report’s time frame and the contract’s calculation period would lead to a number of calculation period slots higher than a pre-determined limit, it strongly implies that the report produced would have an excessive number of pages.

NetEye attempts to avoid this situation by preventing a user from creating very large reports. In general, reports consisting of hundreds of pages are not useful. However, should you wish to override this upper bound for the allowed number of calculation periods, you can change this limit in the SLM module’s configuration page (Configuration / Modules / slm / Configuration) with the field Maximum report size. If you do so, please note that increasing this limit will lead to a proportional decrease in performance.

Event Adjustments

The Event Adjustment Web Interface

Whether event adjustments will be considered or not during report generation can be set with the flag “Consider Event Adjustments” on the associated SLA Contract. This will be taken as the default behavior for any reports produced for that contract. The value selected for this flag may be overridden in the Reporting module by users who are granted the permissions described below.

Viewing Permissions

By default, only users with the admins role can access the SLM module. Non-admin users who need to view an SLM report will need special permissions to access or modify this flag or the other items (i.e., Contracts, customers, and SLA Types) in the SLM Module. To grant these permissions to users, you need to create a role (go under Configuration / Authentication / Roles) with a suitable permission (like e.g., report-adjustment-override or admin) over the SLM module. Enabling only General Module Access permission will give only View Access to slm Event Adjustment.

To set these permission for a non-admin user requires first to enable the General Module Access permission for the SLM module.

Creating an Event Adjustment

To add a new event adjustment, go to SLM / Event Adjustments and enter values for the following options:

  • Object Type: Type of the monitoring object, can be host or service.

  • Host Name: Name of the host to which to attach the event.

  • Service Description: Name of the service, running on host passed in the Host Name field, to which to attach the event

  • Description: Title for the event

  • Start: Timestamp for the starting point of the event (YYYY-MM-DD hh:mm:ss)

  • End: Timestamp for the ending point of the event (YYYY-MM-DD hh:mm:ss)

  • Event Type: State for the event; The event type must be one of the values in Table 20 that are available for the monitored object passed in the Host Name or Service Description field.

Alternatively, Event Adjustments can also be created starting from any existing Icinga 2 Event in the Monitoring, from the Event Details page.

Advance Search Filter

It is possible for a user to search event adjustments, according to their requirements using the search filter that supports search on the basis of hostname, service, description, start/end time or event type.

The Event Adjustment CLI Command

The event adjustment feature is also available through the CLI. Below you can find detailed descriptions of the available commands and their parameters.

Note

All timestamps in the commands below must be in the format YYYY-MM-DD hh:mm:ss and be expressed in your local timezone as set in PHP during the initial configuration.

Create

The create command lets you create a new event adjustment for a particular Monitored Object (host or service).

Usage:

# icingacli slm adjustments create [parameters]

Available Parameters:

–host-name

(mandatory) Name of the host to which attach the event

–service-description

(optional) Name of the service, running on host $host-name, to which to attach the event

–description

(mandatory) Title for the event

–start

(mandatory) Timestamp for the starting point of the event

–end

(mandatory) Timestamp for the ending point of the event

–event-type

(optional) State for the event; if not explicitly defined it will be automatically set to downtime. The event type must be one of the values in the table above that is available for the monitored object passed in the $host-name or $service-description parameters.

Example

I want to create an event adjustment for the host “my-host” to specify that this host was up and running yesterday night from 1 AM to 3 AM.

I then execute the following command:

icingacli slm adjustments create --host-name="my-host" --description="Wrong host state" --start="2019-12-18 01:00:00"  --end="2019-12-18 03:00:00" --event-type="up"

List

The list command lets you see existing event adjustments. The output will be a JSON object.

Usage:

# icingacli slm adjustments list

Edit

The edit command lets you alter the starting or ending time of an existing event adjustment given that adjustment’s ID, which you can obtain from the list command. Note that you cannot change the host or service name filters this way - you will have to delete and recreate the event adjustment. Also, if neither the –start nor the –end parameters are included, the event adjustment will not be changed.

Usage:

# icingacli slm adjustments edit [parameters]

Available Parameters:

–id

(mandatory) The ID of the adjustment to change

–start

(optional) Timestamp for the starting point of the event

–end

(optional) Timestamp for the ending point of the event

–event-type

(optional) State for the event. The event type must be one of the values in Table 20 above that are available for the monitored object pointed to in the event adjustment.

Delete

The delete command lets you delete an existing event adjustment given that adjustment’s ID, which you can obtain from the list command.

Usage:

# icingacli slm adjustments delete [parameters]

Available Parameters:

–id

(mandatory) The ID of the adjustment to change

Resource Reports

To create an SLM resource report, you will need to:

  • Configure one or more customers and resource contracts in the SLM module

  • Create a new report in the Reporting module and set the following fields, which are all compulsory:

    • Name: A name that uniquely identifies the report

    • Timeframe: Selecting a value here defines for how much time the report will be generated.

    • Report: Set this to SLM Resource Report, whereupon the form will add the following field.

    • Customers: Choose for which customer you want to create the report. In Customers drop-down menu, you will be able to select only customers who either have access to the Analytics Module or have a role in common with the SLM user himself.

    • Resource Contracts: Choose the contract, linked to the analytics dashboard for a selected customer.

After you click on “Create Report”, the report will appear in the list of available reports.

Within each report, you can read the details related to the selected contract. The SLM Resource Reports will contain all the static panels including panels inside the rows of a Grafana dashboard linked to the selected resource contract.

Advanced Topics

API

The availability calculation process is a complex multi-step algorithm that involves two major phases. The first phase consists of identifying the time periods over which the availability calculations must be performed. The second phase calculates the availability for each of those periods identified in the previous phase.

Phase One - Identify the Availability Periods

In order to explain the details of the availability calculations, we will use here a simplified request model that differs from the implementations in the following respects:

  • Represents dates in a human readable format instead of in epoch milliseconds

  • Ignores the timezone

  • Uses a simplified event format

Consider the following pseudo request.

Note

The format of this request is not valid for the SLM module as the real dates should be in Unix time format. We also explicitly omit the array of events for the moment.

{
    "output_format": "json"
    "time_period": {
        "ranges": {
            "monday": "08:00-18:00",
            "tuesday": "08:00-12:00"
        }
    },
    "time_zone": "Europe/Rome",
    "calculation_period": {
        "type": "weekly",
        "start": "monday"
    },
    "time_range": {
        "from": "Friday 1st January 2010 at 00:00 am",
        "to": "Monday 25th January 2010 at 00:00 am"
    },
    "initial_state": "0",
    "consider_downtime" : false,
    "consider_event_adjustments" : false,
    "events": [],
    "last_hard_states": [],
    "downtimes": [],
    "event_adjustments": [],
    "target_availability": 99.5,
    "expected_monitored_objects": [],
    "locale": "en_US"
}

In the following section we will focus principally on understanding how the calculation periods are determined. To identify them, three elements from the request are used:

  1. The Time Frame in time_range

  2. The calculation_period

  3. The ranges from the time_period

In our example request above, the time_range is:

  • From: Friday, January 1st 2010 at 00:00am

  • To: Monday, January 25th 2010 at 00:00 am

Consequently, the events taken into account are all the events that occur within the time_range. In order to determine the availability during the initial period of the time range, SLM reports, for each monitored object, take into account also the last event occurring before the start of the time_range.

Once we have identified the time_range, we use the calculation_period to split the range into the desired calculation blocks. The final availability will be calculated for each one of these blocks independently.

The calculation_period of our request is set to weekly/monday. This means that we want availability aggregated by groups of seven days starting on Mondays. As noted above, our time_range spans from January 1st to January 25th, for a total of 25 days. If we split it into blocks of one week, we will obtain 5 availability periods:

  1. From Friday, January 1st to Sunday, January 3rd

  2. From Monday, January 4th to Sunday, January 10th

  3. From Monday, January 11th to Sunday, January 17th

  4. From Monday, January 18th to Sunday, January 24th

  5. Monday, January 25th

Since only full calculation periods are taken into account, the first and last blocks will be discarded since they last less than a full week. Consequently, the final calculation will include only three periods for a total of 21 days.

The last step is to exclude from each of the availability periods the times that do not match the requested time_period ranges:

  1. Monday from 08:00 to 18:00

  2. Tuesday from 08:00 to 12:00

If we apply this to the three availability periods in the previous step, we will arrive at the definitive calculation blocks:

  • The first availability period will include the aggregated statistics of:

    • Monday, January 4th from 08:00 to 18:00

    • Tuesday, January 5th from 08:00 to 12:00

  • The second will include the aggregated statistics of:

    • Monday, January 11th from 08:00 to 18:00

    • Tuesday, January 12th from 08:00 to 12:00

  • The third and final one will include the aggregated statistics of:

    • Monday, January 18th from 08:00 to 18:00

    • Tuesday, January 19th from 08:00 to 12:00

Phase Two - Calculate the Availabilities for each period

In the preceding section, we determined the required availability periods. Now, let’s create a sequence of monitoring events and see how they impact the availability calculations.

Like we did in the previous section, for the purpose of explaining this phase, we will make some assumptions:

  • All the events originate from the same host.

  • The calculations are performed in hours instead of milliseconds.

  • The downtimes are ignored (see section How the Downtime Calculations Work for downtime calculation).

Here is our example sequence of events:

  1. Thursday January, 7th at 00:00 -> hard_state DOWN

  2. Monday, January 11th at 10:00 -> hard_state UP

  3. Monday, January 11th at 20:00 -> hard_state DOWN

  4. Tuesday, January 12th at 10:00 -> hard_state UP

  5. Tuesday, January 12th at 11:00 -> hard_state DOWN

  6. Wednesday, January 13th at 00:00 -> hard_state UP

First, let’s calculate the availability for the first period, which includes:

  • Monday, January 4th from 08:00 to 18:00

  • Tuesday, January 5th from 08:00 to 12:00

Since the first event is after Monday the 4th at 8:00AM, we do not know the state of the host at the beginning of the period. Consequently, we will use the initial_state field, which represents the state of the host at beginning of the report, if it cannot be determined by analyzing the events in the request.

In our case this value is 0 (i.e., UP) and, as the initial_state is always of type hard_state, from the beginning of the period, the host is in state hard_up. The initial_state always defaults to 0 in the reports generated via NetEye GUI.

The state of the host never changes during the first period, because there are no events to alter it. Thus we have:

  • 10 hours of state hard_up* on Monday, January 4th from 08:00 to 18:00

  • 4 hours of state hard_up* on Tuesday, January 5th from 08:00 to 12:00

In summary, the availability results for the first period are:

  • total time: 12 hours

  • hard_up time: 12 hours (100% of the total time)

The second availability period includes:

  • Monday, January 11th from 08:00 to 18:00

  • Tuesday, January 12th from 08:00 to 12:00

The only event preceding the period takes place on Thursday, January 7th, at 00:00 sets the host state to hard_down. So on Monday, January 11th at 08:00 the state is hard_down.

At 10:00 on Monday, January 11th we receive a hard_up event, and now the host state becomes hard_up.

Subsequently, we have two events on Tuesday, January 12th, the first of which sets the state to hard_up at 10:00, and the second of which sets the state to hard_down one hour later.

So, we have:

  • 2 hours of state hard_down on Monday, January 11th from 08:00 to 10:00

  • 8 hours of state hard_up on Monday, January 11th from 10:00 to 18:00

  • 2 hours of state hard_up on Tuesday, 12th January from 08:00 to 10:00

  • 1 hours of state hard_down on Tuesday, 12th January from 10:00 to 11:00

  • 1 hours of state hard_up on Tuesday, 12th January from 11:00 to 12:00

In summary, the availability results for the second period are:

  • Total time: 12 hours

  • hard_up time: 9 hours (75% of the total time)

  • hard_down time: 3 hours (25% of the total time)

The third availability period includes:

  • Monday, January 18th from 08:00 to 18:00

  • Tuesday, January 19th from 08:00 to 12:00

During this time there are no events, so the state for the entire period is that of the last event received before the time period. Since this is the one received on Wednesday, January 13th at 00:00 that sets the state to hard_up, the availability result for the third period are:

  • Total time: 12 hours.

  • hard_up time: 12 hours (100% of the total time).

Warning

The Icinga2 IDO retention policy settings may affect the SLM availability Reports. Please see Interaction with IDO retention policy for more details.

How the Downtime Calculations Work

When calculating availability, the SLM module can optionally take into account planned downtimes for hosts and services. You can enable this behavior by using the consider_downtime key in the calculation request, and setting it to true, then adding a downtimes array. We will refer to the following example request:

{
   "output_format": "json",
   "time_period": "{ ... }",
   "time_zone": "Europe/Rome",
   "calculation_period": "{ ... }",
   "time_range": {
        "from": "1555000000000",
        "to": "1561000000000"
    },
   "initial_state": "0",
   "consider_downtime": true,
   "consider_event_adjustments": false,
   "events": [
       {
           "host_name": "host1.example.com",
           "service_description": "disk_agent",
           "timestamp": "1560000000000",
           "state": "0",
           "type": "hard_state"
       },
       {
           "host_name": "host1.example.com",
           "service_description": "disk_agent",
           "timestamp": "1560000500000",
           "state": "2",
           "type": "dt_end"
       },
       {
           "host_name": "host1.example.com",
           "service_description": "disk_agent",
           "timestamp": "1560001000000",
           "state": "2",
           "type": "soft_state"
       }
   ],
   "last_hard_states": [
       {
           "host_name": "host1.example.com",
           "service_description": "disk_agent",
           "timestamp": "1560000000000",
           "state": "0"
       }
   ],
   "downtimes":[
       {
            "depth": 2,
            "host_name": "host1.example.com",
            "service_description": "disk_agent"
       }
   ],
   "event_adjustments": [],
   "target_availability": 99.5,
   "expected_monitored_objects": [
        {
            "host_name": "host1.example.com",
            "service_description": "disk_agent"
        }
   ],
   "locale": "en_US"
}

Due to the fact that the consider_downtime property above is true, the availability calculation will take into account any listed downtimes.

During the calculation, important data is provided in the downtimes field of the request. This data is a list of the downtime status at the beginning of the time_range for each monitored host or service.

As downtimes can be nested, the depth value is used to represent their nesting level (check the table below and the underneath paragraphs for an example of nested downtimes). A depth value of zero implies that there was no downtime during the interval in question.

For example, in the example above, we know that:

  • At the epoch instant 1555000000000 (which corresponds to time_range.from) the service disk_agent on host ‘host1.example.com’ was in planned downtime.

  • In addition, we know that the downtime depth was 2.

If a service or host is not included in the downtimes list, it will by default be considered as not in downtime.

Consider this hypothetical ordered sequence of events for a given host to clarify how the states and downtime depth are calculated:

Table 21 Downtime Event Sequence Example

Time Instant

Received Event State Type

Received Event State

Calculated Availability State

Downtime Depth

1000

hard_state

0

operative.hard_up

0

2000

dt_start

0

in_downtime.hard_up

1

3000

dt_start

0

in_downtime.hard_up

2

4000

hard_state

2

in_downtime.hard_down

2

5000

dt_end

0

in_downtime.hard_down

1

6000

hard_state

0

in_downtime.hard_up

1

7000

dt_end

0

operative.hard_up

0

The explanation for the data sequence above is:

  • From time 1000 to 2000: The downtime depth is zero, so the host is not in downtime and we have 1000ms of availability state hard_up in the operative mode.

  • From 2000 to 3000: A downtime period is starting, so the depth is increased by 1 and the host is now in planned downtime. So we have 1000ms of availability state hard_up but in the in_downtime mode.

  • From 3000 to 4000: Another downtime period is starting, so the depth is now increased to 2 and the host is still in downtime. We have an additional 1000ms of availability state hard_up in the in_downtime mode.

  • From 4000 to 5000: We received a hard_state 2. The downtime depth does not change and the host is still in planned downtime. We have now 1000ms of availability state in the in_downtime mode, but the calculated state is now hard_down.

  • From 5000 to 6000: We received dt_end*, so the downtime depth is decreased by 1 but the host is still in planned downtime. We have an additional 1000ms of availability state hard_down in the in_downtime mode.

  • From 6000 to 7000: We received hard_state 0 and the host is still in planned downtime. We have 1000ms of availability state hard_up in the in_downtime mode.

  • After 7000: We received dt_end, so the downtime depth is decreased by 1 and the host is no longer in downtime because the depth value is now zero. From this moment on, the calculated status will be state hard_up in the operative mode.

When the consider_downtime property is instead false, then:

  1. The availability states are always considered as operative. Consequently, no time will be accounted as in_downtime.

  2. Any received dt_start and dt_end events will be reported in the events.skipped section of the response. For instance, the above event sequence example would have generated this skipped report:

{
    "events": {
        "skipped": {
          "dt_start": 2,
          "dt_end": 2
        },
        "unknown": 0
    }
}
How the Event Adjustments Calculations Work

When calculating availability, the SLM module can optionally take into account event adjustments for hosts and services. You can enable this behavior by setting the consider_event_adjustments flag in the calculation request to true, and then adding an event_adjustments array at the top level of the input as in this example:

{
   "output_format": "json",
   "time_period": "{ ... }",
   "time_zone": "Europe/Rome",
   "calculation_period": "{ ... }",
   "time_range": {
        "from": "1555000000000",
        "to": "1561000000000"
    },
   "initial_state": "0",
   "consider_downtime": false,
   "consider_event_adjustments": true,
   "events": [
       { }
   ],
   "downtimes":[],
   "event_adjustments": [
      {
        "host_name": "test-logmanager",
        "service_description": null,
        "start": "1559661092000",
        "end": "1559661093000",
        "event_type": "up"
      },
      {
        "host_name": "docker-jenkins-node-01",
        "service_description": null,
        "start": "1559661092000",
        "end": "1559661096000",
        "event_type": "down"
      }
   ],
   "target_availability": 99.5,
   "expected_monitored_objects": [],
   "locale": "en_US"
}

Because the consider_event_adjustments flag in the above example is set to true, the availability calculation will take into account any adjustments listed in the event_adjustment field of the request.

An event adjustment sets the status of a monitoring object to a specific value over a given time range, regardless of the events that actually happened within that time range.

After the event adjustment end instant, the status of the calculation will be the one of the last event received before that instant.

For instance, in the example above, the status of the host test-logmanager will be changed to hard ok between the instants 1559661092000 and 1559661093000 if the events array were to contain entries that indicating a different status.

Event adjustments cannot overlap or be nested.

Interaction with IDO retention policy

The Icinga 2 IDO DB retention policy settings may affect the SLM availability Reports. This happens because the availability calculations of the SLM module relies on the state change events present in the Icinga 2 IDO DB and the retention policy may delete older events which are still relevant for determining the availability of a Host or Service.

Moreover, there is the possibility that the outcome of an SLM Report will differ depending on the time it was generated, since the Icinga 2 IDO DB retention policy continuously removes events older than the defined age in days. For this reason, please configure the days in the Icinga 2 IDO DB retention policy, with a value sufficient to avoid impacting any of your SLM Reports. You can set this policy under Configuration ‣ Modules ‣ neteye ‣ Configuration.

Additionally, in case some events are missing from the history, remember that you can always create proper Event Adjustments to fix the problem.

To overcome some issues deriving from the deletion of old Icinga2 events, the SLM availability calculation takes, as optional parameter, an array of last_hard_states taken from Icinga 2’s attributes (see Tables Runtime Attributes of Hosts and Services in Icinga2 documentation for details) of hosts and services. This information can integrate and complete the information given by the events array and represents which was the last hard state of a Monitored Object. This parameter is useful when state change events of a monitored object are no longer available, but the information about the last hard state is still stored in the Monitored Object status.

The following are the hard states considered by the SLM module:

  • for hosts: UP and DOWN

  • for services: OK, WARNING, CRITICAL, and UNKNOWN

The last_hard_state of a monitored object is only taken into account when no other events of type hard state are passed for that monitored object.

Consider as an example the following request:

{
   "output_format": "json",
   "time_period": "{ ... }",
   "time_zone": "Europe/Rome",
   "calculation_period": "{ ... }",
   "time_range": {
        "from": "1555000000000",
        "to": "1561000000000"
    },
   "initial_state": "0",
   "consider_downtime": false,
   "consider_event_adjustments": true,
   "events": [
       {
           "host_name": "host1.example.com",
           "service_description": null,
           "timestamp": "1560000000000",
           "state": "0",
           "type": "hard_state"
       }
   ],
   "last_hard_states": [
       {
           "host_name": "host1.example.com",
           "service_description": null,
           "timestamp": "1560000000000",
           "state": "0"
       },
       {
           "host_name": "host2.example.com",
           "service_description": null,
           "timestamp": "1550000000000",
           "state": "0"
       }
   ],
   "downtimes":[],
   "event_adjustments": [],
   "target_availability": 99.5,
   "expected_monitored_objects": [],
   "locale": "en_US"
}

The last_hard_state timestamp of the host host1.example.com will be discarded since a hard_state event type for the host host1.example.com is already present in the events array.

On the contrary, since no hard events are present for host host2.example.com, the last_hard_state of the host host2.example.com will be treated as an actual hard event and will affect the availability calculation. This last hard state will allow to compute the availability of host host2.example.com even though no event for that host is present in events.

In this example, the status of the host host2.example.com will be considered as hard up from the instant 1550000000000, onwards.

Example Scenarios

To help you understand how the SLM availability computation works together with the IDO retention policy, please have a look at the following scenarios.

Note

For the sake of simplicity in these scenarios we will not take into account the Operational Times and Calculation Periods.

All scenarios depicted below show a service that changes to an OK status, but the same reasoning applies to any Monitored Object transitioning to any hard state mentioned in the previous section.

Scenario 1

In Scenario 1, for the Monitored Object in consideration we only have 1 state change (A changes to OK), occurring outside the retention period. This state change is then not available in the state history. Instead, the last_hard_state is always available, hence this information is used to compute the availability of the Monitored Object throughout the Report Time Frame.

  • The reported availability is therefore: 100%

../_images/scenario-1.svg

Fig. 140 Scenario 1

Scenario 2

In Scenario 2, we have one state change (A, Hard Critical) occurring before the retention period and one state change (B, OK) occurring within the retention period.

Since A occurred before the retention period, and last_hard_state does not represent that state change, it is not possible to assess with certainty which was the status of the Monitored Object before B. In absence of data, by default SLM guesses that during the initial period the Monitored Object was available.

  • The reported availability is therefore: 100%

../_images/scenario-2.svg

Fig. 141 Scenario 2

Scenario 3

In Scenario 3, we have two state changes (A, Hard Critical, and B, OK, in this order) occurring before the retention period, but within the Report Time Frame.

A is no longer available, so, thanks to the last_hard_state we only know that the Monitored Object, from B onwards, was available. Again, since we do not know the status before B, SLM assumes that the Monitored Object was available.

  • The reported availability is therefore: 100%

../_images/scenario-3.svg

Fig. 142 Scenario 3

Scenario 4

In Scenario 4, we have one state change (A, Hard Critical) occurring before the retention period but within the Report Time Frame, and another (B, OK) occurring within the retention period and within the Report Time Frame.

A is no longer available during the computation, while B is, because it occurred within the retention period and because the last_hard_state corresponds to it.

Since in the computation we only know that the Monitored Object was available after state change B, SLM assumes that before state change B the Monitored Object was available.

  • The reported availability is therefore: 100%

../_images/scenario-4.svg

Fig. 143 Scenario 4

Scenario 5

In Scenario 5, we have state change A (Hard Critical) occurring before the retention period and before the Report Time Frame, and Downtime B (which has a duration of 25% of the Report Time Frame) occurring before the retention period but within the Report Time Frame.

Thanks to the fact that last_hard_state corresponds to A, A will be taken into account for the availability calculation, despite occurring outside the retention period. Also Downtime B will be taken into account for the availability calculation because the IDO retention policy is not applied on Downtimes in NetEye 4.

  • The reported availability is therefore: 25%

../_images/scenario-5.svg

Fig. 144 Scenario 5

Scenario 6

In Scenario 6, the Monitored Object considered was created (A) after the end of the Report Time Frame. Hence also its last_hard_state (B) occurred after the end of the Report Time Frame.

In this case, the SLM availability calculation discards the state change B because it occurs after the Report Time Frame and so it has no effect on the availability calculation. Since there is no relevant event for the duration of the Report Time Frame, SLM cannot say anything about the availability of the Monitored Object.

  • SLM will report: No events associated with this monitored object

../_images/scenario-6.svg

Fig. 145 Scenario 6

Ignored events

Some types of events are ignored by the calculation algorithm. This happens when the event belongs to one of the following categories:

  • skipped: any occurrence of the following event types that have no impact on availability:

    • notify

    • comment

    • comment_deleted

    • ack

    • ack_deleted

    • dt_comment

    • dt_comment_deleted

    • flapping

    • flapping_deleted

    • dt_start (only if consider_downtime* is set to false)

    • dt_end (only if consider_downtime* is set to false)

    • unknown: Any event that is of a type not known to the Backend

An End-to-End Calculation Example

As mentioned at the beginning of the chapter, the /api/availability_calculation_full endpoint receives HTTP POST requests in JSON format. In this section we show an example of how the availability is calculated, using the following valid request body:

{
    "output_format": "json",
    "time_period": {
        "display_name": "Time period test name",
        "object_name": "timeperiod_test",
        "ranges": {
            "monday": "08:00-18:00",
            "tuesday": "08:00-12:00,13:00-18:00"
        }
    },
    "time_zone": "Europe/Rome",
    "calculation_period": {
        "type": "weekly",
        "start": "monday"
    },
    "time_range": {
        "from": "1555000000000",
        "to": "1561000000000"
    },
    "initial_state": "0",
    "consider_downtime" : false,
    "consider_event_adjustments": false,
    "events": [
        {
            "host_name": "host1.example.com",
            "service_description": "disk_agent",
            "timestamp": "1559720464000",
            "state": "0",
            "type": "hard_state"
        },
        {
            "host_name": "host1.example.com",
            "service_description": null,
            "timestamp": "1559661092000",
            "state": "0",
            "type": "hard_state"
        }
    ],
    "downtimes": [],
    "event_adjustments": [],
    "target_availability": 99.5,
    "expected_monitored_objects": [
        {
            "host_name": "host1.example.com",
            "service_description": "disk_agent"
        },
        {
            "host_name": "host1.example.com",
            "service_description": null
        }
    ],
    "locale": "en_US"
}

This request triggers a calculation that:

  1. Takes into account only those states on Mondays between 08:00 and 18:00, and on Tuesdays between 08:00 and 12:00 and between 13:00 and 18:00.

  2. Parses all the dates and times with Europe/Rome as the time_zone.

  3. Aggregates the result by week, with the first day of the week set to Monday, according to the calculation_period.

  4. Performs the calculations for the time_range going from 1555000000000 (corresponding to Thursday April 11th 2019 16:26:40 PM UTC) to 1561000000000 (Thursday June 20th, 2019 03:06:40 AM UTC), as Unix times expressed in milliseconds.

  5. Defaults to state 0, if a state cannot be determined, for instance because there are no events in the indicated time range.

  6. Excludes all downtime events, including dt_start and dt_end from the calculations, since consider_downtime is false.

  7. Excludes all event adjustments since consider_event_adjustment is false.

  8. Processes a list of events that contains two events occurring at the 1559720464000 and 1559661092000 Unix time.

The Service Level Management module GUI automatically converts dates into Unix times. If you are using the REST API directly, you will need to convert the desired time into epoch milliseconds. You can get the current time in milliseconds (or any desired time by using the “-d” parameter) with the following commands:

# date +%s%3N      # get the current time in milliseconds
1561031493613
# date -d 2019-06-20 -d 08:01:55AM +%s%3N
1561010515000

Event types and states are defined by Icinga and included in their reference documentation, section Monitoring Basics for Hosts and Services.

To test an availability request, either construct one or cut and paste the request above into a shell as a shell variable (e.g., export DATA=‘<paste>’), then call the REST API with curl like this:

curl -Ss -X POST -H "Content-Type:application/json" --data "$DATA" http://slmd.neteyelocal:4949/api/availability_calculation_full

The resulting (unformatted) JSON response should be similar to the one shown here:

{
  "monitored_objects": [
    {
      "host_name": "host1.example.com",
      "service_description": null,
      "calculation_periods": [
        {
          "from": 1555279200000,
          "to": 1555884000000,
          "states_ms": {
            "OPERATIVE": {
              "HARD_UP": 68400000,
              "HARD_DOWN": 0,
              "HARD_UNREACHABLE": 0,
              "SOFT_DOWN": 0,
              "SOFT_UNREACHABLE": 0
            },
            "IN_DOWNTIME": {
              "HARD_UP": 0,
              "HARD_DOWN": 0,
              "HARD_UNREACHABLE": 0,
              "SOFT_DOWN": 0,
              "SOFT_UNREACHABLE": 0
            },
            "TOTAL": 68400000
          }
        },
        { "..." : "..."  },
      ]
    },
    {
      "host_name": "host1.example.com",
      "service_description": "disk_agent",
      "calculation_periods": [
        {
          "from": 1555279200000,
          "to": 1555884000000,
          "states_ms": {
            "OPERATIVE": {
              "HARD_OK": 68400000,
              "HARD_WARNING": 0,
              "HARD_CRITICAL": 0,
              "HARD_UNKNOWN": 0,
              "SOFT_WARNING": 0,
              "SOFT_CRITICAL": 0,
              "SOFT_UNKNOWN": 0
            },
            "IN_DOWNTIME": {
              "HARD_OK": 0,
              "HARD_WARNING": 0,
              "HARD_CRITICAL": 0,
              "HARD_UNKNOWN": 0,
              "SOFT_WARNING": 0,
              "SOFT_CRITICAL": 0,
              "SOFT_UNKNOWN": 0
            },
            "TOTAL": 68400000
          }
        },
        { "..." : "..."  },
        {
          "from": 1560117600000,
          "to": 1560722400000,
          "states_ms": {
            "OPERATIVE": {
              "HARD_OK": 68400000,
              "HARD_WARNING": 0,
              "HARD_CRITICAL": 0,
              "HARD_UNKNOWN": 0,
              "SOFT_WARNING": 0,
              "SOFT_CRITICAL": 0,
              "SOFT_UNKNOWN": 0
            },
            "IN_DOWNTIME": {
              "HARD_OK": 0,
              "HARD_WARNING": 0,
              "HARD_CRITICAL": 0,
              "HARD_UNKNOWN": 0,
              "SOFT_WARNING": 0,
              "SOFT_CRITICAL": 0,
              "SOFT_UNKNOWN": 0
            },
            "TOTAL": 68400000
          }
        }
      ]
    }
  ],
  "events": {
      "skipped": {
        "ack": 1,
        "ack_deleted": 12
      },
      "unknown": 1
  }
}

Remember that in the response, only periods that fall entirely into the calculation period are included; therefore, in the above response some days will be excluded because they are not part of a full week that starts on a Monday.

The events field of the response contains a report of the events that the Backend has skipped on purpose or was not able to process because it is of type unknown. In this example, we have 13 skipped events (1 of type ack and 12 of type ack_deleted) and one event that was completely unknown. Detailed information about unknown events can be found in the application logs.

Warning

If you are constructing such a structure by hand, note that it is easy to make mistakes which will be rejected by the SLM service. For instance, formatting the hour as 8:00 instead of 08:00 will return an error. Additionally, errors when creating timestamps will typically lead to the return of an empty set of results like this: {"periods":[], "events": { "skipped": {}, "unknown": 0 }}

Implementation Notes

  • All timestamps used throughout the SLM module are expressed as Unix time in milliseconds.

  • The from field of a time range is inclusive, while the to field instead is exclusive.

Jobs

Report Scheduling Job

The systemd service icinga-reporting.service is the job that is in charge of performing the schedule of the Reports configured in the Reporting module.

This means that, in case you need to have a look at the actions performed to schedule your Reports, you can refer to this service, for example with:

journalctl -u icinga-reporting.service -f

The icinga-reporting.service service is bound to the php-fpm.service, in such a way that on NetEye Cluster environments the service will run only on one node.

Note

For debugging the failure of reporting scheduling jobs, check /neteye/shared/icingaweb2/log/icinga-reporting.log which contains the complete error logs occurred during the execution.

Report generation

Resource Report

SLM Resource Report is generated through icingacli using the dedicated user neteye_grafana_readonly authenticated using a JWT token. All required configurations are automatically performed during neteye_secure_install and must not be modified by the user.

Configuration perfomed are the following:

  • A role neteye_grafana_read_only_role is added to /neteye/shared/icingaweb2/conf/roles.ini

  • The JWT token is generated in /neteye/shared/icingaweb2/conf/modules/analytics/jwt-tokens/neteye_grafana_readonly.jwt alongside a key-pair. Public key can be found in /neteye/shared/icingaweb2/conf/modules/neteye/jwt-keys/neteye-jwt.pub

  • A backend is added into /neteye/shared/icingaweb2/conf/authentication.ini to allow by default neteye login using JWT tokens validated using the aforementioned public key.

Note

During the Resource Report generation, some temporary users (i.e., neteye_report_temporary_XXXXXXXX) are created in the ITOA Module. They are part of the process and will be removed as soon as the process completes.