User Guide

Concepts

The Service Level Management Module (SLM) allows to setup contracts between one SLM customer and the service provider, with the purpose to sending a periodic report to the customer.

There are two types of contracts available:

  • The Availability contract measures the availability of hosts and services within a given period of time, to verify if the level of availability required by the customer has been met. An availability contract is defined by a customer, a SLA type and a set of monitored objects.

  • The Resource Contract shows a dashboard with diagrams that show the load of the monitored objects within a given time range. Dashboards are Grafana-based and need to be created for each contract. A resource contract is defined by a customer and a set of diagrams.

While the set up of a contract is quite straightforward, and it is described in dedicated sections below, it is important to highlight a few points that should be understood correctly, in order to avoid possible sources of problems. They are described in the next section, that you can look up for reference.

Important concepts

  • Users, customers, and their permissions

    When configuring new object in the SLM module, it is important to highlight the importance of permissions in the management of customers. Indeed, NetEye users with access to the SLM module can see and assign to a customer only roles that they belong to.

    For example, if a user Jake has B and C roles, then he can only see and assign roles B or C to a customer. The only exception is for users with Role Administrative access in NetEye, which can assign every role.

    This affects both Availability and Resource contracts: if in the SLM Module there are contracts involving a role that is not assigned to a user, then they will not be seen by the user.

    It is therefore important to assign appropriate roles to a user of the SLM Module, to allow them to create and manage the contracts of his customers.

  • Module permissions

    While the role of users and customers is important to understand their access to contracts, but also the module permissions are relevant and need to be clearly understood.

    • Full Module Access. It is essentially a shortcut and enables all the permissions below. User with full module access can see all the content of the module related to his role.

    • General Module Access. This permission gives only the ability to load the module configuration and provide only View access to Event Adjustment, This permission is mandatory for enabling the following permissions. It is also necessary to enable the SLM extension of the Reporting module. To give Add/Edit/Delete permissions you need to enable slm/admin.

    • slm/admin. With this permission it is possible to view and edit everything that the user’s role allows to see.

    • slm/report-adjustment-override. Granting this permission allows to modify the Consider Event Adjustments field in the Reporting module, provided the Reporting’s SLM extensions have been enabled.

Note

Users with slm/admin or slm/report-adjustment-override permissions but without General Module Access can neither see nor access the SLM module (and the SLM extensions of the Reporting module), but this is the expected behaviour: the General Module Access is required, to load the configuration of the SLM module and activate the Extensions.

  • Operational Time explained

    The operational time does not need to indicate a single contiguous extent of time. For instance, it may be defined as “Business Hours” (i.e., “Monday through Friday, 9:00AM to 5:00PM”), which would exclude evening and early morning hours. You would construct such a Time Period in Director by specifying each individual contiguous Time Range separately, e.g. first “monday 9:00AM to 5:00PM”, then “tuesday 9:00AM to 5:00PM”, etc.

    When calculating availability, the monitored object’s initial state is valid until the first state change event (if one exists) during that Time Range, and the last state change event occurring in the Time Range is valid until the end of the Time Range. Thus given the “Business Hours” example above where the initial state of a service is OK on Monday at 9:00AM, and a single state change event of type CRITICAL occurs at 4:30PM, then the resulting availability will be 7 and a half hours of OK and 30 minutes of CRITICAL.

  • How Downtime affects calculation

    A downtime is a scheduled period in which a host or service is not available. Suppose we have an overall operational time of 10 seconds where a series of state change events result in:

    • 1 second where the host is DOWN

    • 7 seconds with the host in an OK state

    • 2 more seconds where the host is DOWN

    And let’s also assume that at the first second the downtime was unexpected, but the final 2 seconds of this period was scheduled downtime.

    • If the Downtime box is checked, the availability will be calculated as (OK + DOWNTIME)/OPERATIONAL TIME = (7s + 2s)/10s, therefore 9/10 or 90%.

    • If instead the box is not checked, the availability will be: OK/OPERATIONAL TIME = 7s / 10s, therefore 7/10 or 70%.

Availability

This section describes all the necessary steps to create new availability contracts.

Defining SLA Types

An SLA is a commitment between a service provider and a client, defining particular aspects of a service. Within the SLM module, an SLA type can be associated with a customer contract and defines limits for metrics to be guaranteed by the service provider as well as the exact temporal boundaries during which the metrics must be guaranteed.

When the SLM module is first launched, the SLA Type panel is focused, displaying a row for each configured SLA type. Additional panels allow to define Contracts and Customers respectively. Search functionality is available for all three panels, but the text being searched will be restricted to the Name and Contract columns in the first two panels.

Note

Before you can successfully set up SLA types and contracts, you need to have defined a few other objects in the Directory, namely one or more TimePeriods (as Operational Time) and a filter expression (as Object Filter). An Object Filter is used to define for which host(s) or service(s) the contract is defined; in other words, on these hosts or services it will be calculated the availabilty required by the customer. Examples of valid input for the Object Filter are: host_name=*neteye* or service_description=jenkins*.

Before you can create an SLA contract to see the availability of monitored objects for a customer, you must first define the parameters for the SLAs you intend to use. You can do this at SLM / SLA Types, using the follow options:

  • Name: The name of this SLA type (e.g., “Gold level”)

  • Description: A more user-friendly description of the SLA type

  • Operational Time: The exact time(s) during which all elements necessary for a monitored object to function properly should in fact be in operation. The operational time is precisely defined by a TimePeriod object in Director. This field lets you select either a pre-defined Timeperiod object or one that you have created at Icinga Director / Timeperiods / Timeperiods.

  • Calculation Period: The unit of time over which the data will be aggregated into service level reports. For instance, if you want an availability report for the current year, you might want it broken down into “Monthly” or “Weekly” subsections.

  • Availability %: The target percentage of SLA availability for the calculation period. For hosts, only Down states have a negative impact on availability. For services instead, both Critical and Unknown (but not Warning) will decrease availability.

  • Downtime: When this box is checked, the scheduled downtimes of monitored objects will be taken into account for any related availability calculations. When downtime is in effect, the related monitored object is considered available, regardless of its actual state during that period. Once the scheduled downtime ends, the object’s state will be reset to the value of its most recent state change.

At the moment the only supported TimePeriod values (i.e., Ranges) are exact dates and names of weekdays. There is also currently no support for excluded ranges and included ranges.

More precise definitions of Calculation Period, Availability, Downtime, and other terms can be found in the glossary, while an example of the algorithm on which the SLM is based, is shown in How the availability is calculated.

The following table defines each calculation period more precisely:

Calculation Period

Unit of Time

daily

from 00:00 to 24:00 of that same day

weekly_sunday

from 00:00 on Sunday to 24:00 on Saturday

weekly_monday

from 00:00 on Monday to 24:00 on Sunday

monthly_1

from 00:00 on the first day of one month to 24:00 on the last day of the same month

monthly_2

from 00:00 on the first day of one month to 24:00 on the last day of the subsequent month

monthly_3

from 00:00 on the first day of one month to 24:00 on the last day of the third month

monthly_4

from 00:00 on the first day of one month to 24:00 on the last day of the fourth month

monthly_6

from 00:00 on the first day of one month to 24:00 on the lsat day of the sixth month

monthly_12

from 00:00 on the firstday of one month to 24:00 on the last day of the twelfth month

Note

The “last day” of a month may be the the 28th or 29th for February, 30th or 31st otherwise.

Creating an SLA Customer

To provide the ability to group contracts by customer, the SLM module allows you to define specific customers that you can then associate with one or more SLA/Resource contracts (e.g., because each contract concern different customer’s hosts or services or different levels of availability).

To add a new customer for SLM, go to SLM / Customers and enter values for the following:

  • Name: The name of this customer

  • Description:* A longer description for the customer

  • Role: The roles assigned to this customer, which give a set of permissions to access the resources to be reported. The Role will be used only for resource contract, not SLA contracts.

  • Report Template: The report template assigned to this customer, will be used while generating the SLM report. In any case, if a user failed to select any template then the default template will be used.

Creating SLA Contracts

Average Availability provides aggregated statistics to support the verification of SLA for customers with more complex contract definitions. Specifically, every report includes by default the measure of the Average Availability of all hosts and services in the time period. In order to enable Average Availability, you should tick the Include Average Availability checkbox as described below. You can override this setting by changing the same variable in the report creation section–see SLA Reports.

Once you have defined an SLA type, you can begin creating Service Level Agreement contracts. Click on SLM / Contracts and enter appropriate values for the following options:

  • Name: The name of this contract

  • Description: A more user-friendly description of the contract. This description can be displayed in the report and supports formatting through GitHub Flavored Markdown

  • Customer: You can set in the Customer tab the customer whose monitored objects (typically hosts and services) will be included in the availability report. Only customers with the same role as the logged in user are displayed.

  • SLA Type: The type of SLA you defined in the section above

  • Consider Event Adjustments: This checkbox should be set by an administrator if you want to allow event adjustments to be considered when generating a report.

  • Include Average Availability: This checkbox should be set if you want to include average availability when generating a report.

  • Render Contract Description in Report: This checkbox should be set if you want to insert contract description in rendered report.

  • Objects Type: can be set to host or service* for including respectively only host or service objects into the current ***Contract* or to all for considering both hosts and services.

  • Objects Filter: A set of monitored objects determined by an Icinga filter expression. It is important to check that the filter expression actually returns at least one monitored object.

  • Monitoring Views: Depending on the choice of the Object types, here will be shown in parentheses the count of hosts, services, or both, that match the object filter. A click on the Hosts link or Services link will take you to the related Monitoring Overview.

Outages

An outage is a period of time during which a Monitored Object (i.e., a host or a service) or a Business Process is in a non-available state. In the simplest case, an outage is defined as the period of time between two consecutive available statuses, during which an event takes place, that influences the Availability; the states that affect Availability negatively and that are used for the calculation of outages are HARD events only.

Each Outage has an associated Duration, which is the total amount of time during which the Monitored Object or Business Process was in a non-available state.

During each calculation period, the sum of the duration of all outages plus the duration of Available statuses must equal the duration of the calculation period itself.

Outages appear as part of each SLM report, if configured, and it is possible to configure to show only the number of outages, or a detailed list of outages for each monitored object in the report. It is also possible to include Outage Annotation for every outage, which will then be rendered in the SLM report, if defined. More information can be found in the Outage configuration section.

Whenever the availability is below the required target availability, the monitored host or service is highlighted. You can drill down on any calculation period with an Availability below 100%, to a new page, presenting a detailed report of the related events.

In case an Operational Time has been defined in a SLA, the definition and computation of an outage change slightly. In this case, indeed, Outages are recorded only when they fall within an Operational Time; when the state of non-availability takes place outside an Operational Time, it is called Unavailability Period. In other words, an Unavailability Period is an interval of time during which a service, host, or Business Process is not available and becomes an Outage if it falls at least partially within an Operational Time.

In the remainder of this section, we show examples of how Outages are calculated with Operational time and how they are considered within and across Calculation Periods.

Records of Outages with Operational Time

This section enumerates all the basic cases that are taken into account for the calculation of outages, by considering a scenario with one Calculation Period composed of two Operational Time Ranges: the first from 8AM to 9AM and the second from 10AM to 11AM.

More complex scenarios can be reproduced by suitably combining the following cases.

Throughout this section, in all the diagrams and in the text, OT stands for Operational Time and reports all the Operational Time Ranges, while UnPs represent Unavailability Periods, i.e., the intervals during which a host or service is unavailable.

Unavailability Periods outside the Operational Time

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1---1   2---2
          |       |
          |       \-> 1 outage of 30 minutes
          \-> no outage

In this example the Unavailability Period 1 [7:20AM-7:40AM] starts and ends outside any Operational Time Range, so it causes no outage.

The Unavailability Period 2 [8AM-8:30AM] instead, falls completely inside the first Operational Time Range, so it causes an outage that starts at 8AM and ends at 8:30AM, with a duration of 30min.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 8:30AM*

  • Outage duration: 30min

Unavailability Periods starting or ending outside Operational Times

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs      1---------------------1
                |
                \-> 1 outage of 1 hour

The UnP 1 [7:30AM-9:30AM] starts before the OT range [8AM-9AM] and ends after the OT range [8AM-9AM].

Since Outages exist only during the Operational Time, the UnP causes an Outage that starts together with the OT at 8AM, which is the beginning of the OT range [8AM-9AM] and ends at 9AM, which is the end of the OT range [8AM-9AM].

The duration of the Outage is 1h, which is the amount of Operational Time affected by the Outage.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 9AM

  • Outage duration: 1h

Unavailability Periods across different Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs      1--------------------------------1
                |
                \-> 1 single outage of 90 minutes

The UnP 1 [7:30AM-10:30AM] starts before the first OT range [8AM-9AM], and ends during the second OT range [10AM-11AM].

This means that the Unavailability Period causes an Outage that starts at 8AM (the start of OT range [8AM-9AM]) and ends at 10:30AM, after the second OT range starts. The duration of the Outage is 1h 30min, since the Outage affected the full OT range [8AM-9AM] and the first 30min of the OT range [10AM-11AM].

Reported Outage:

  • Outage start: 8AM

  • Outage end: 10:30AM*

  • Outage duration: 1h 30min

Temporary Uptimes between Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1---------------------1   2-----------------2
                            ↓       ↓
                       1 single outage of 2 hours

In this case the UnP 1 [7:20AM-9:20AM] entirely covers the OT range [8AM-9AM] and the UnP 2 [9:40AM-11:30AM] entirely covers the subsequent OT range [10AM-11AM].

The resulting Outage is a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h. Only one Outage is reported because during the Operational Time, the Monitored Object or Business Process was continuously unavailable, with the short availability window completely outside the OT. The short interval during which the Monitored Object or Business Process was available [9:20AM-9:40 AM] is disregarded as it is outside the OT.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 11AM

  • Outage duration: 2h

Temporary Uptimes during Operational Time ranges

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OT              |----------|          |---------|
UnPs    1----------------1      2----------2
                |                  |
                |                  \-> 1 outage of 30 minutes
                \-> 1 outage of 50 minutes

Here the UnP 1 [7:20AM-8:50AM] ends just before the end of the first OT range [8AM-9AM], and the UnP 2 [9:30AM-10:30AM] overlaps with the beginning of the second OT range [10AM-11AM].

This causes 2 different Outages, because during both Operational Time Ranges the Monitored Object or Business Process changes status (here, it becomes available). The first Outage starts at 8AM and ens at 8:50AM, with a duration of 50min. The 2nd Outage starts at 10AM and ends at 10:30AM, with a duration of 30min

1st Reported Outage:

  • Outage start: 8AM

  • Outage end: 8:50AM*

  • Outage duration: 50min

2nd Reported Outage:

  • Outage start: 10AM

  • Outage end: 10:30AM*

  • Outage duration: 30min

Unavailability Periods starting or ending on OT range start or end

    7AM        8AM        9AM       10AM      11AM     12AM
     |----------|----------|----------|---------|--------|
OTs             |----------|          |---------|
UnPs    1------------------1          2-----------------2
                          ↓            ↓
                       1 single outage of 2 hours

In case an Unavailability Period starts or ends exactly in the same instant when an OT range starts or ends, then the start or end of the Unavailability Period will be always considered as happening outside the OT range. This case is very similar to the first case presented, with the only difference that two events coincide with the start or end f an Operational Time Range.

The UnP 1 [7:20AM-9AM] ends exactly in the same moment when the first OT range [8AM-9AM] ends, while UnP 2 [10AM-11:50AM] starts exactly in the same moment when the second OT range [10AM-11AM] starts.

This causes a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h.

This happens because the end of UnP 1 [7:20AM-9AM] is considered as happening after the first OT range [8AM-9AM] and the start of UnP 2 [10AM-11:50AM] is considered as happening before the second OT range [10AM-11AM]. So the Monitored Object or Business Process was continuously unavailable for the whole Operational Time.

Reported Outage:

  • Outage start: 8AM

  • Outage end: 11AM

  • Outage duration: 2h

Outages and Calculation Periods.

The previous cases take into account Operational Time Ranges, during which the interval between two consecutive Time Ranges is disregarded, so an Unavailability Period that spans multiple OT Ranges is considered as a single Outage.

In this section, we define how Outages are considered when Calculation Periods are taken into account.

The simplest case, when one Unavalability Period (and therefore the Outage) is entirely included in a Calculation Period, results in a single Outage reported in that Calculation Period.

However, when an Unavailability Period spans across two Calculation Periods, although Calculation Periods are by default contiguous time intervals, the Outage will be split into two outages, one for the first Calculation Period, one for the second. The same applies when multiple Calculation Periods are involved, like the following diagram shows. Here, CP are Calculation Periods and OUT are the recorded Outages.

    Jan 2020   Feb        Mar       Apr       May      June
     |----------|----------|----------|---------|--------|
CPs  |----------|----------|----------|---------|--------|
UnPs   1-------------1         2----------------------2
OUT    |---O1---|-O2-|         |--O3--|---O4----|--O5-|

In this example, although there are only two UnP, five Outages will be reported, one for each month in which the Monitored Object or Business Process was not available. In summary (note that the actual Duration in the report will be in HH:MM:SS, here we use days for simplicity), the first Unavailability Period results in:

Reported Outage (January):

  • Outage start: 7th of January,

  • Outage end: 31st of January

  • Outage duration: 25 days

Reported Outage (February):

  • Outage start: 1st of February

  • Outage end: 14th of February

  • Outage duration: 14 days

while the second Unavailability Period results in:

Reported Outage (March):

  • Outage start: 10th of March

  • Outage end: 31st of March

  • Outage duration: 21 days

Reported Outage (April):

  • Outage start: 1st of April

  • Outage end: 30th of April

  • Outage duration: 30 days

Reported Outage (May):

  • Outage start: 1st of May

  • Outage end: 19th of may

  • Outage duration: 19 days

Event Adjustment

The Event Adjustment feature allows Service Owners and Service Level Managers to retroactively add events. Indeed, an Event Adjustment is the action of adding to a monitored host or service an event over a given period of time, that actually did not take place. This proves useful in situations where, for instance, the Operational team has forgotten to schedule Downtime in advance: this oversight would add to a wrong calculation of the availability, therefore adding an Event Adjustment that covers the unscheduled downtime would fix the problem. It is worth noting that Event Adjustments will affect the availability calculations for hosts and services, and potentially affect whether the service provider is or is not satisfying the target SLA specified in the customer contract.

How It Works

The Event Adjustment feature is an extension of the SLM module, and can only be performed by a NetEye user with a certain privilege level. All inserted event adjustments are stored in a dedicated database table in order to ensure that the data (i.e., the existing timeline) cannot be manipulated. Event adjustments are taken into account during Availability calculations without any additional intervention.

Each adjustment must be applied to a monitored object. The following table indicates which types of event adjustments can be applied to each type of monitored object.

Table 15 Available event types for host and services

Monitored Object Type

Event Type

host

up, down, downtime

service

ok, warning, critical, unknown, downtime

Multiple event adjustments on the same host or service cannot overlap, except if the event type of one of the adjustments is downtime. If a new or modified event does overlap the time bounds of an existing adjustment, NetEye will report an error and the new event adjustment will not be processed.

Resource

In order to configure a resource contract, you need to have defined a customer like described in the corresponding section above. You also need to have set up suitable dashboard(s) in Grafana that should be linked to the contract.

A user can define Resource Contracts for its resources to generate the SLM Resource reports. To add a new Resource Contracts for SLM, go to SLM > Resource Contracts and enter appropriate values for the following options:

  • Name: The name of this contract

  • Description: A more user-friendly description of the contract

  • Customer: You can set in the Customer tab the customer whose analytical dashboard in Grafana will be included in the resource report. The customers will appear the dropdown, if they have Analytics Module access, and an assigned role that is in common with the SLM user.

  • Dashboard: An analytical dashboard (static) of a customer in Grafana

To prevent a user from creating very large resource reports, there is a restriction in place, that sets the maximum number of panels that can be included in the report. A user cannot add a new Resource Contract, if the number of panels in the selected dashboard exceeds the limit.

This limit can be increased manually, by updating the dashboard_panel_size value in below SLM Module config file /neteye/shared/icingaweb2/conf/modules/slm/config.ini

Please be warned that increasing this limit will lead to a proportional decrease in performance.