Concepts¶
The Service Level Management Module (SLM) allows to setup contracts between one SLM customer and the service provider, with the purpose to sending a periodic report to the customer.
There are two types of contracts available:
The Availability contract measures the availability of hosts and services within a given period of time, to verify if the level of availability required by the customer has been met. An availability contract is defined by a customer, a SLA type and a set of monitored objects.
The Resource Contract shows a dashboard with diagrams that show the load of the monitored objects within a given time range. Dashboards are Grafana-based and need to be created for each contract. A resource contract is defined by a customer and a set of diagrams.
While the set up of a contract is quite straightforward, and it is described in dedicated sections below, it is important to highlight a few points that should be understood correctly, in order to avoid possible sources of problems. They are described in the next section, that you can look up for reference.
Important concepts¶
Users, customers, and their permissions
When configuring new object in the SLM module, it is important to highlight the importance of permissions in the management of customers. Indeed, NetEye users with access to the SLM module can see and assign to a customer only roles that they belong to.
For example, if a user Jake has B and C roles, then he can only see and assign roles B or C to a customer. The only exception is for users with Role Administrative access in NetEye, which can assign every role.
This affects both Availability and Resource contracts: if in the SLM Module there are contracts involving a role that is not assigned to a user, then they will not be seen by the user.
It is therefore important to assign appropriate roles to a user of the SLM Module, to allow them to create and manage the contracts of his customers.
Module permissions
While the role of users and customers is important to understand their access to contracts, but also the module permissions are relevant and need to be clearly understood.
Full Module Access. It is essentially a shortcut and enables all the permissions below. User with full module access can see all the content of the module related to his role.
General Module Access. This permission gives only the ability to load the module configuration and provide only View access to Event Adjustment, This permission is mandatory for enabling the following permissions. It is also necessary to enable the SLM extension of the Reporting module. To give Add/Edit/Delete permissions you need to enable slm/admin.
slm/admin. With this permission it is possible to view and edit everything that the user’s role allows to see.
slm/report-adjustment-override. Granting this permission allows to modify the
Consider Event Adjustments
field in the Reporting module, provided the Reporting’s SLM extensions have been enabled.
Note
Users with slm/admin or slm/report-adjustment-override permissions but without General Module Access can neither see nor access the SLM module (and the SLM extensions of the Reporting module), but this is the expected behaviour: the General Module Access is required, to load the configuration of the SLM module and activate the Extensions.
Operational Time explained
The operational time does not need to indicate a single contiguous extent of time. For instance, it may be defined as “Business Hours” (i.e., “Monday through Friday, 9:00AM to 5:00PM”), which would exclude evening and early morning hours. You would construct such a Time Period in Director by specifying each individual contiguous Time Range separately, e.g. first “monday 9:00AM to 5:00PM”, then “tuesday 9:00AM to 5:00PM”, etc.
When calculating availability, the monitored object’s initial state is valid until the first state change event (if one exists) during that Time Range, and the last state change event occurring in the Time Range is valid until the end of the Time Range. Thus given the “Business Hours” example above where the initial state of a service is OK on Monday at 9:00AM, and a single state change event of type CRITICAL occurs at 4:30PM, then the resulting availability will be 7 and a half hours of OK and 30 minutes of CRITICAL.
How Downtime affects calculation
A downtime is a scheduled period in which a host or service is not available. Suppose we have an overall operational time of 10 seconds where a series of state change events result in:
1 second where the host is DOWN
7 seconds with the host in an OK state
2 more seconds where the host is DOWN
And let’s also assume that at the first second the downtime was unexpected, but the final 2 seconds of this period was scheduled downtime.
If the Downtime box is checked, the availability will be calculated as (OK + DOWNTIME)/OPERATIONAL TIME = (7s + 2s)/10s, therefore 9/10 or 90%.
If instead the box is not checked, the availability will be: OK/OPERATIONAL TIME = 7s / 10s, therefore 7/10 or 70%.
Availability¶
This section describes all the necessary steps to create new availability contracts.
Defining SLA Types¶
An SLA is a commitment between a service provider and a client, defining particular aspects of a service. Within the SLM module, an SLA type can be associated with a customer contract and defines limits for metrics to be guaranteed by the service provider as well as the exact temporal boundaries during which the metrics must be guaranteed.
When the SLM module is first launched, the SLA Type panel is focused, displaying a row for each configured SLA type. Additional panels allow to define Contracts and Customers respectively. Search functionality is available for all three panels, but the text being searched will be restricted to the Name and Contract columns in the first two panels.
Note
Before you can successfully set up SLA types and contracts,
you need to have defined a few other objects in the Directory,
namely one or more TimePeriods (as Operational Time) and a
filter expression (as Object Filter). An Object Filter is
used to define for which host(s) or service(s) the contract is
defined; in other words, on these hosts or services it will be
calculated the availabilty required by the customer. Examples of
valid input for the Object Filter are: host_name=*neteye*
or
service_description=jenkins*
.
Before you can create an SLA contract to see the availability of monitored objects for a customer, you must first define the parameters for the SLAs you intend to use. You can do this at
, using the follow options:Name: The name of this SLA type (e.g., “Gold level”)
Description: A more user-friendly description of the SLA type
Operational Time: The exact time(s) during which all elements necessary for a monitored object to function properly should in fact be in operation. The operational time is precisely defined by a TimePeriod object in Director. This field lets you select either a pre-defined Timeperiod object or one that you have created at
.Calculation Period: The unit of time over which the data will be aggregated into service level reports. For instance, if you want an availability report for the current year, you might want it broken down into “Monthly” or “Weekly” subsections.
Availability %: The target percentage of SLA availability for the calculation period. For hosts, only Down states have a negative impact on availability. For services instead, both Critical and Unknown (but not Warning) will decrease availability.
Downtime: When this box is checked, the scheduled downtimes of monitored objects will be taken into account for any related availability calculations. When downtime is in effect, the related monitored object is considered available, regardless of its actual state during that period. Once the scheduled downtime ends, the object’s state will be reset to the value of its most recent state change.
At the moment the only supported TimePeriod values (i.e., Ranges) are exact dates and names of weekdays. There is also currently no support for excluded ranges and included ranges.
More precise definitions of Calculation Period, Availability, Downtime, and other terms can be found in the glossary, while an example of the algorithm on which the SLM is based, is shown in How the availability is calculated.
The following table defines each calculation period more precisely:
Calculation Period |
Unit of Time |
---|---|
daily |
from 00:00 to 24:00 of that same day |
weekly_sunday |
from 00:00 on Sunday to 24:00 on Saturday |
weekly_monday |
from 00:00 on Monday to 24:00 on Sunday |
monthly_1 |
from 00:00 on the first day of one month to 24:00 on the last day of the same month |
monthly_2 |
from 00:00 on the first day of one month to 24:00 on the last day of the subsequent month |
monthly_3 |
from 00:00 on the first day of one month to 24:00 on the last day of the third month |
monthly_4 |
from 00:00 on the first day of one month to 24:00 on the last day of the fourth month |
monthly_6 |
from 00:00 on the first day of one month to 24:00 on the lsat day of the sixth month |
monthly_12 |
from 00:00 on the firstday of one month to 24:00 on the last day of the twelfth month |
Note
The “last day” of a month may be the the 28th or 29th for February, 30th or 31st otherwise.
Creating an SLA Customer¶
To provide the ability to group contracts by customer, the SLM module allows you to define specific customers that you can then associate with one or more SLA/Resource contracts (e.g., because each contract concern different customer’s hosts or services or different levels of availability).
To add a new customer for SLM, go to
and enter values for the following:Name: The name of this customer
Description:* A longer description for the customer
Role: The roles assigned to this customer, which give a set of permissions to access the resources to be reported. The Role will be used only for resource contract, not SLA contracts.
Report Template: The report template assigned to this customer, will be used while generating the SLM report. In any case, if a user failed to select any template then the default template will be used.
Creating SLA Contracts¶
Average Availability provides aggregated statistics to support the verification of SLA for customers with more complex contract definitions. Specifically, every report includes by default the measure of the Average Availability of all hosts and services in the time period. In order to enable Average Availability, you should tick the Include Average Availability checkbox as described below. You can override this setting by changing the same variable in the report creation section–see SLA Reports.
Once you have defined an SLA type, you can begin creating Service Level Agreement contracts. Click on
and enter appropriate values for the following options:Name: The name of this contract
Description: A more user-friendly description of the contract. This description can be displayed in the report and supports formatting through GitHub Flavored Markdown
Customer: You can set in the Customer tab the customer whose monitored objects (typically hosts and services) will be included in the availability report. Only customers with the same role as the logged in user are displayed.
SLA Type: The type of SLA you defined in the section above
Consider Event Adjustments: This checkbox should be set by an administrator if you want to allow event adjustments to be considered when generating a report.
Include Average Availability: This checkbox should be set if you want to include average availability when generating a report.
Render Contract Description in Report: This checkbox should be set if you want to insert contract description in rendered report.
Objects Type: can be set to host or service* for including respectively only host or service objects into the current ***Contract* or to all for considering both hosts and services.
Objects Filter: A set of monitored objects determined by an Icinga filter expression. It is important to check that the filter expression actually returns at least one monitored object.
Monitoring Views: Depending on the choice of the Object types, here will be shown in parentheses the count of hosts, services, or both, that match the object filter. A click on the Hosts link or Services link will take you to the related Monitoring Overview.
Outages¶
An outage is a period of time during which a Monitored Object (i.e., a host or a service) or a Business Process is in a non-available state. In the simplest case, an outage is defined as the period of time between two consecutive available statuses, during which an event takes place, that influences the Availability; the states that affect Availability negatively and that are used for the calculation of outages are HARD events only.
Each Outage has an associated Duration, which is the total amount of time during which the Monitored Object or Business Process was in a non-available state.
During each calculation period, the sum of the duration of all outages plus the duration of Available statuses must equal the duration of the calculation period itself.
Outages appear as part of each SLM report, if configured, and it is possible to configure to show only the number of outages, or a detailed list of outages for each monitored object in the report. It is also possible to include Outage Annotation for every outage, which will then be rendered in the SLM report, if defined. More information can be found in the Outage configuration section.
Whenever the availability is below the required target availability, the monitored host or service is highlighted. You can drill down on any calculation period with an Availability below 100%, to a new page, presenting a detailed report of the related events.
In case an Operational Time has been defined in a SLA, the definition and computation of an outage change slightly. In this case, indeed, Outages are recorded only when they fall within an Operational Time; when the state of non-availability takes place outside an Operational Time, it is called Unavailability Period. In other words, an Unavailability Period is an interval of time during which a service, host, or Business Process is not available and becomes an Outage if it falls at least partially within an Operational Time.
In the remainder of this section, we show examples of how Outages are calculated with Operational time and how they are considered within and across Calculation Periods.
Records of Outages with Operational Time¶
This section enumerates all the basic cases that are taken into account for the calculation of outages, by considering a scenario with one Calculation Period composed of two Operational Time Ranges: the first from 8AM to 9AM and the second from 10AM to 11AM.
More complex scenarios can be reproduced by suitably combining the following cases.
Throughout this section, in all the diagrams and in the text, OT
stands for Operational Time and reports all the Operational Time
Ranges, while UnPs
represent Unavailability Periods, i.e., the
intervals during which a host or service is unavailable.
Unavailability Periods outside the Operational Time
7AM 8AM 9AM 10AM 11AM 12AM
|----------|----------|----------|---------|--------|
OT |----------| |---------|
UnPs 1---1 2---2
| |
| \-> 1 outage of 30 minutes
\-> no outage
In this example the Unavailability Period 1
[7:20AM-7:40AM]
starts and ends outside any Operational Time Range, so it causes no
outage.
The Unavailability Period 2
[8AM-8:30AM] instead, falls
completely inside the first Operational Time Range, so it causes an
outage that starts at 8AM and ends at 8:30AM, with a duration of
30min.
Reported Outage:
Outage start: 8AM
Outage end: 8:30AM*
Outage duration: 30min
Unavailability Periods starting or ending outside Operational Times
7AM 8AM 9AM 10AM 11AM 12AM
|----------|----------|----------|---------|--------|
OT |----------| |---------|
UnPs 1---------------------1
|
\-> 1 outage of 1 hour
The UnP 1
[7:30AM-9:30AM] starts before the OT range
[8AM-9AM] and ends after the OT range [8AM-9AM].
Since Outages exist only during the Operational Time, the UnP causes an Outage that starts together with the OT at 8AM, which is the beginning of the OT range [8AM-9AM] and ends at 9AM, which is the end of the OT range [8AM-9AM].
The duration of the Outage is 1h, which is the amount of Operational Time affected by the Outage.
Reported Outage:
Outage start: 8AM
Outage end: 9AM
Outage duration: 1h
Unavailability Periods across different Operational Time ranges
7AM 8AM 9AM 10AM 11AM 12AM
|----------|----------|----------|---------|--------|
OT |----------| |---------|
UnPs 1--------------------------------1
|
\-> 1 single outage of 90 minutes
The UnP 1
[7:30AM-10:30AM] starts before the first OT range
[8AM-9AM], and ends during the second OT range [10AM-11AM].
This means that the Unavailability Period causes an Outage that starts at 8AM (the start of OT range [8AM-9AM]) and ends at 10:30AM, after the second OT range starts. The duration of the Outage is 1h 30min, since the Outage affected the full OT range [8AM-9AM] and the first 30min of the OT range [10AM-11AM].
Reported Outage:
Outage start: 8AM
Outage end: 10:30AM*
Outage duration: 1h 30min
Temporary Uptimes between Operational Time ranges
7AM 8AM 9AM 10AM 11AM 12AM
|----------|----------|----------|---------|--------|
OT |----------| |---------|
UnPs 1---------------------1 2-----------------2
↓ ↓
1 single outage of 2 hours
In this case the UnP 1
[7:20AM-9:20AM] entirely covers the OT
range [8AM-9AM] and the UnP 2
[9:40AM-11:30AM] entirely
covers the subsequent OT range [10AM-11AM].
The resulting Outage is a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h. Only one Outage is reported because during the Operational Time, the Monitored Object or Business Process was continuously unavailable, with the short availability window completely outside the OT. The short interval during which the Monitored Object or Business Process was available [9:20AM-9:40 AM] is disregarded as it is outside the OT.
Reported Outage:
Outage start: 8AM
Outage end: 11AM
Outage duration: 2h
Temporary Uptimes during Operational Time ranges
7AM 8AM 9AM 10AM 11AM 12AM
|----------|----------|----------|---------|--------|
OT |----------| |---------|
UnPs 1----------------1 2----------2
| |
| \-> 1 outage of 30 minutes
\-> 1 outage of 50 minutes
Here the UnP 1
[7:20AM-8:50AM] ends just before the end of the
first OT range [8AM-9AM], and the UnP 2
[9:30AM-10:30AM]
overlaps with the beginning of the second OT range [10AM-11AM].
This causes 2 different Outages, because during both Operational Time Ranges the Monitored Object or Business Process changes status (here, it becomes available). The first Outage starts at 8AM and ens at 8:50AM, with a duration of 50min. The 2nd Outage starts at 10AM and ends at 10:30AM, with a duration of 30min
1st Reported Outage:
Outage start: 8AM
Outage end: 8:50AM*
Outage duration: 50min
2nd Reported Outage:
Outage start: 10AM
Outage end: 10:30AM*
Outage duration: 30min
Unavailability Periods starting or ending on OT range start or end
7AM 8AM 9AM 10AM 11AM 12AM
|----------|----------|----------|---------|--------|
OTs |----------| |---------|
UnPs 1------------------1 2-----------------2
↓ ↓
1 single outage of 2 hours
In case an Unavailability Period starts or ends exactly in the same instant when an OT range starts or ends, then the start or end of the Unavailability Period will be always considered as happening outside the OT range. This case is very similar to the first case presented, with the only difference that two events coincide with the start or end f an Operational Time Range.
The UnP 1
[7:20AM-9AM] ends exactly in the same moment when the
first OT range [8AM-9AM] ends, while UnP 2
[10AM-11:50AM]
starts exactly in the same moment when the second OT range
[10AM-11AM] starts.
This causes a single Outage that starts at 8AM and ends at 11AM, with a duration of 2h.
This happens because the end of UnP 1
[7:20AM-9AM] is considered
as happening after the first OT range [8AM-9AM] and the start of UnP
2
[10AM-11:50AM] is considered as happening before the second OT
range [10AM-11AM]. So the Monitored Object or Business Process was
continuously unavailable for the whole Operational Time.
Reported Outage:
Outage start: 8AM
Outage end: 11AM
Outage duration: 2h
Outages and Calculation Periods.¶
The previous cases take into account Operational Time Ranges, during which the interval between two consecutive Time Ranges is disregarded, so an Unavailability Period that spans multiple OT Ranges is considered as a single Outage.
In this section, we define how Outages are considered when Calculation Periods are taken into account.
The simplest case, when one Unavalability Period (and therefore the Outage) is entirely included in a Calculation Period, results in a single Outage reported in that Calculation Period.
However, when an Unavailability Period spans across two Calculation
Periods, although Calculation Periods are by default contiguous time
intervals, the Outage will be split into two outages, one for the
first Calculation Period, one for the second. The same applies when
multiple Calculation Periods are involved, like the following diagram
shows. Here, CP
are Calculation Periods and OUT
are the recorded
Outages.
Jan 2020 Feb Mar Apr May June
|----------|----------|----------|---------|--------|
CPs |----------|----------|----------|---------|--------|
UnPs 1-------------1 2----------------------2
OUT |---O1---|-O2-| |--O3--|---O4----|--O5-|
In this example, although there are only two UnP, five Outages will be reported, one for each month in which the Monitored Object or Business Process was not available. In summary (note that the actual Duration in the report will be in HH:MM:SS, here we use days for simplicity), the first Unavailability Period results in:
Reported Outage (January):
Outage start: 7th of January,
Outage end: 31st of January
Outage duration: 25 days
Reported Outage (February):
Outage start: 1st of February
Outage end: 14th of February
Outage duration: 14 days
while the second Unavailability Period results in:
Reported Outage (March):
Outage start: 10th of March
Outage end: 31st of March
Outage duration: 21 days
Reported Outage (April):
Outage start: 1st of April
Outage end: 30th of April
Outage duration: 30 days
Reported Outage (May):
Outage start: 1st of May
Outage end: 19th of may
Outage duration: 19 days
Event Adjustment¶
The Event Adjustment feature allows Service Owners and Service Level Managers to retroactively add events. Indeed, an Event Adjustment is the action of adding to a monitored host or service an event over a given period of time, that actually did not take place. This proves useful in situations where, for instance, the Operational team has forgotten to schedule Downtime in advance: this oversight would add to a wrong calculation of the availability, therefore adding an Event Adjustment that covers the unscheduled downtime would fix the problem. It is worth noting that Event Adjustments will affect the availability calculations for hosts and services, and potentially affect whether the service provider is or is not satisfying the target SLA specified in the customer contract.
How It Works¶
The Event Adjustment feature is an extension of the SLM module, and can only be performed by a NetEye user with a certain privilege level. All inserted event adjustments are stored in a dedicated database table in order to ensure that the data (i.e., the existing timeline) cannot be manipulated. Event adjustments are taken into account during Availability calculations without any additional intervention.
Each adjustment must be applied to a monitored object. The following table indicates which types of event adjustments can be applied to each type of monitored object.
Monitored Object Type |
Event Type |
---|---|
host |
up, down, downtime |
service |
ok, warning, critical, unknown, downtime |
Multiple event adjustments on the same host or service cannot overlap, except if the event type of one of the adjustments is downtime. If a new or modified event does overlap the time bounds of an existing adjustment, NetEye will report an error and the new event adjustment will not be processed.
Resource¶
In order to configure a resource contract, you need to have defined a customer like described in the corresponding section above. You also need to have set up suitable dashboard(s) in Grafana that should be linked to the contract.
A user can define Resource Contracts for its resources to generate the SLM Resource reports. To add a new Resource Contracts for SLM, go to SLM > Resource Contracts and enter appropriate values for the following options:
Name: The name of this contract
Description: A more user-friendly description of the contract
Customer: You can set in the Customer tab the customer whose analytical dashboard in Grafana will be included in the resource report. The customers will appear the dropdown, if they have Analytics Module access, and an assigned role that is in common with the SLM user.
Dashboard: An analytical dashboard (static) of a customer in Grafana
To prevent a user from creating very large resource reports, there is a restriction in place, that sets the maximum number of panels that can be included in the report. A user cannot add a new Resource Contract, if the number of panels in the selected dashboard exceeds the limit.
This limit can be increased manually, by updating the
dashboard_panel_size value in below SLM Module config file
/neteye/shared/icingaweb2/conf/modules/slm/config.ini
Please be warned that increasing this limit will lead to a proportional decrease in performance.
Reporting¶
The Reporting module is the central component for creating reports over a specified timeperiod and schedule their generation.
The Reporting module offers a set of functionality integrated from other modules for you to create a report of a certain type.
SLM: the SLM integration gives the user the possibility to generate reports based on Availability Contracts and on Resource Contracts, respectively named as SLM Reports and SLM Resource Reports.
IDO: Icinga IDO Reports provides host and service availability reports based on the monitoring database (IDO). Users can create both Host SLA and Service SLA reports.
Access to the Reports list¶
The list of reports, visible under
, is filtered out based on user’s roles and permissions. The filtering logic applied by the SLM and the IDO modules is slightly different: let’s see in details how the list is rendered, fot both the categories.SLM reports¶
Users with only Reporting General Module Access permissions (defined in
), will see a filtered list of SLM reports. The filter is based on the Customer selected in report definition. A logged-in user will see only their own Customer (i.e. customers with the same or inherited role as the logged-in user) reports.NetEye admin and reporting admin users can see and can create/modify all SLM reports.
IDO reports¶
Reports are filtered by tenants. This means that the logged-in user can see or create (based on the user’s permission defined in
) reports linked to its tenant.The reports, that are not linked to any tenant, are visible to all users, regardless of the tenant they belong to.
Once a report is created, only the NetEye admin can modify the tenant it is associated with.