User Guide

Glossary

Availability

Availability is a key performance indicator for Service Level Agreement (SLA) contracts. It is a metric that measures how much a given monitored object is working as expected and is calculated with a procedure that involves the total time during which the object was actually available and the time that it was contractually expected to be available. The latter is also called Operational Time; the procedure for computing availability is called Availability Calculation.

Multiple factors influence both the actual time and expected time, which vary from contract to contract. See Creating SLA contract for more details.

The states affecting Availability and as such count as non-available are HARD DOWN for Host Objects and HARD UNKNOWN as well as HARD CRITICAL for Service Objects.

Availability Report

An Availability Report contains a list of the hosts and services that are subject of an SLA contract together with their availability percentage (computed by the Availability Calculation) for each one.

Calculation Period

A Calculation Period is used in the Service Level Management module and is the unit of time over which the data will be aggregated into service level reports like an Availability Report. For example, if the time span of the report is one year, you might want the Calculation Period to be a Month or a Week, depending on the required granularity.

Event Adjustment

An Event Adjustment is a retroactive modification of the event history of a monitored object. The events’ timeline and the actual event that take place are not altered in any way; rather, Event Adjustment find themselves on a separate layer on top of a timeline; They are manually applied in case check results were temporarily wrong (e.g., a faulty check command) or when an undesired outage happened due to incorrectly scheduled downtime.

Event adjustments therefore do not directly alter the original event history: the original timeline, together with all events can always be reconstructed.

Note that an Event Adjustment does not influence a Downtime: a monitored object in downtime during a given period is always considered as available, regardless of any Event Adjustment defined on the same period.

Operational Time

The Operational Time corresponds to the expected time when determining the Availability of a Monitored Object and is defined as the sum of all the Ranges during which a Monitored Object must be properly working according to a SLA. While the Monitored Object is usually a single host or service, it can be a more complex object like a Business Process: In this case, the Operational Time refers to all its components.

Outage

An outage is a period of time during which a Monitored Object or Business Process is not available, usually due to an unforeseen event. An outage starts when an object enters a state of non-availability and ends when it returns operative.

To each outage corresponds a Duration, which is the total amount of time during which the Monitored Object or Business Process was not available.

Resource Contract

A Resource Contract is stipulated between an SLM customer and his service provider. The service provider uses NetEye to monitor the consumption of the resources and to report them in either pdf or html format to the SLM customer. Resources could be related to different monitoring objects, like CPU, RAM, Storage, or Network.

Service Level Agreement

A Service Level Agreement is a contractual commitment between a service provider and a client defining particular quantitative aspects of a service. It may specify the details of various metrics, thresholds, etc. such as: * Quality * Availability * Responsibilities

In NetEye, particularly in the Service Level Management module, a single Service Level Agreement can be modeled as an SLA contract.

Service Level Management

Service Level Management is the practice (including methods and tools) of ensuring that monitored objects meet their target service levels. It defines the basic units and metrics necessary for creating, measuring, accepting and documenting Service Level Agreements.

Service Level Manager

A Service Level Manager, as defined by ITIL, is a role that engages in the day-to-day management of a Service Level Agreement, with tasks that include documenting requirements, negotiating service levels and targets, conducting reviews, and ensuring to acted upon review results and provide them to customers.

Service Owner

A Service Owner, as defined by ITIL, is the role that is accountable for the delivery of a specific IT service. The service owner is responsible for the service management of a specific service in the organization and typically controls funding for it and is also the representative and spokesperson of the service in the whole organization.

Target Availability

Target Availability is the agreed-upon minimum time of guaranteed availability as specified contractually in a Service Level Agreement. It is typically expressed as a percentage, such as 99.5%.

Time Frame

Within the Icinga 2 Reporting module, a Time Frame specifies the starting and ending times for a given report and is a (positive) integer multiple of the Calculation Period.

In the SLM API, Time Frames are represented as time_range objects consisting of from and to keys, with the Unix Time as value.

Unavailability Period

An Unavailability Period is an interval of time during which a Monitored Object or a Business Process is a not available state due to an unforeseenable event, i.e., they should have been available but they were not, and no Downtime was scheduled.

An Unavailability Period that occurs during an Operational Time Range is called an Outage.

Average Availability

The Average Availability measures the availability of host/services based on the average of all host/services.

Business Process

A Business Process is a high-level logical service that is composed of multiple monitored objects (and potentially other smaller business processes) interrelated by logical operations. The state of this logical service is calculated by substituting the status of each individual monitored object into the business process’ logical expression.

By treating a logical service as if it were a monitored object, you can calculate its availability, create more complex check commands, and set up Grafana dashboards based on them.

Downtime

In the context of monitoring within NetEye, Downtime is a scheduled period of time when a monitored object is intentionally either not available or will not perform its expected function, but which should nonetheless be considered as available.

Downtime is typically planned announced in advance and is meant for periods of maintenance such as software or hardware upgrades.

For further reference, please consult Icinga’s Downtime documentation.

Event

In the context of monitoring within NetEye, an Event refers to one of multiple possible event types, as declared by Icinga 2. The most common type is the state change event caused by a host or service check result that differs from a previous check result. An Event has a single timestamp, it is not a duration.

The types of events currently defined on NetEye are: * State Change: A host or service has changed from one state to another, e.g from OK to CRITICAL. * Downtime: The host or service is scheduled to be down. * Flapping: A host or service is continually alternating between two states, e.g. UP and DOWN. * Comment: A NetEye user flagged a point in time with a written note. * Notification: NetEye sends an alert, e.g. an email to a system administrator.

Host State

The Host State is the reported state of a monitored host object in Icinga 2 at any point in time.

As defined by Icinga, hosts can be in any one of the following states:

Name

Description

UP

The host is available

DOWN

The host is not available

Monitored Object

A Monitored Object is a host or service configured with a check command that can be checked either regularly and automatically (active check) or whenever requested (passive check).

Monitoring Filter

A Monitoring Filter, or Filtering Expression is a logical expression used to select a subset of monitored objects.

For example, the following filter expression will select all hosts whose name begins with the string “server”:

host_name=server*
Service State

The Service State is the reported state of a monitored service object in Icinga 2 at any point in time.

As defined by Icinga, services can be in any one of the following states:

Name

Description

OK

The service is working properly

WARNING

The service is experiencing some problems but is still considered to be in working condition

CRITICAL

The service is in a critical state

UNKNOWN

The check could not determine the service’s state

State Change

A State Change is one type of monitoring event where a host changes from one Host State to another (e.g., from UP to DOWN) or a service changes from one Service State to another (e.g., from WARNING to OK).

Time Period

Within Icinga Director, a Time Period is a set of Ranges that together specify exactly when a monitored object should be available. Its length is related to the Calculation Period and is used to specify the Operational Time in an SLA Type for use in a Report from SLM data.

In the SLM API, Time Periods are represented as time_period objects, while in the configuration files they are stored as TimePeriods. The alternative spelling of Timeperiod is also employed in the GUI.

Icinga 2 provides documentation for the Timeperiod object as an overview within the Object Types and in details under the Advanced Topics.

Range (Time Range)

Within Icinga Director, a Range is the definition of a single unit of contiguous time; multiple ranges can be used as defining blocks of one Time Period, for example Time Period 24x5 can be defined as the union of [TimeRange Monday 00:00-24:00, TimeRange Tuesday 00:00-24:00, TimeRange Wednesday 00:00-24:00, TimeRange Thursady 00:00-24:00, TimeRange Friday 00:00-24:00]

In the SLM API, Ranges are represented as a ranges object within a time_period.

NetEye Components

A NetEye Component is a software module that extends the functionalities of NetEye Core. There are three categories of NetEye Components: |ne| Feature Modules, Preview Software, and Beta Software. You can refer to Section Additional NetEye Components for detailed information.

NetEye Feature Modules

NetEye Feature Modules are NetEye components that perform very specific functions, and that can be installed on top of NetEye Core, thanks to its modular architecture. Unlike Preview Software, NetEye modules are officially supported; each module has its own, distinct contract, and can be quickly installed on demand. To learn how to do, simply check Section How to install one additional module.

Preview Software

A Preview Software is a new NetEye module that has been developed and integrated into NetEye, but it is not yet in its final form. A Preview Software can be used AS IS, but not all functionalities are guaranteed to be stable and you should expect significant changes in the future. Feedback on Preview Software is always appreciated! Each module can be easily installed on demand with just a few commands: to learn how, check How to install one additional module.

Beta Software

A Beta Software is a NetEye component that is usually a single package that can be installed at will. A Beta Software is part of a module, but in some cases it can also be a standalone software, that provides a specific functionality which is not yet part of NetEye. Multiple versions of a same Beta Software can coexist in the repository, and, like for Preview Software, Beta Software is provided AS IS, with no guarantee of stability and maturity.

NetEye Single Purpose Node

A NetEye Single Purpose Node is a specialized Node in a NetEye Cluster. There are currently two types of Single Purpose Nodes: Elastic-only nodes, which are also marked ad (E), and Voting-only nodes, which are also marked ad (V). (E) are nodes that host the DB component of the Elastic stack. On the other hand, (V) are used by a NetEye Cluster as quorum devices.

Fencing

In a cluster, fencing is the process of recovering services and resources running on a disconnected node by shutting it down and redistributing them on the other nodes. Fencing prevents data loss and maintains data integrity across the cluster.

Multi-tenancy

A type of the system architecture that allows a single NetEye instance to monitor several business units in isolated environments.

NetEye Single Node

A NetEye setup type that is aimed at small environments that require limited resources.

NetEye Cluster

A setup type, where a system runs on a combination of various types of nodes: operating nodes, elastic-only nodes, and voting-only nodes; NetEye clustering service is based on a stack of software: Corosync, Pacemaker, and DRBD, although some NetEye services rely on their own clustering technologies.

Failover cluster

A principle of organizing nodes in a cluster, which allows to avoid downtime or service disruption whenever one node in the cluster goes offline by means of moving services to another node.

Intracluster security

It refers to the secure communication between the nodes in a cluster, granted by certificates signed by a Certificate Authority.

NetEye Core

It is the set of most commonly used functionalities offered by NetEye, including monitoring, visualization (with dashboards and maps), configuration, reporting, and event handling.

NetEye services

They are a selection of software, provided in packages, used to perform functionalities either within NetEye Core or NetEye Component(s).

Master

The Master is a NetEye instance that holds configuration files, receives data directly from the clients or via Satellites and processes the data to carry out actions based on the customer business needs.

Satellite

An intermediary node between monitored objects and the Master. A Satellite receives the configuration from its parent node (Master), collects data from hosts/services, and forwards the data to the Master to be processed. Satellite can also execute checks on its own, and pass the results to the Master afterwards.

Telegraf

An agent to collect, process, and write metrics to InfluxDB. Telegraf agent sends data to a Satellite or to the Master. Telegraf instance is installed by default as part of NetEye core. A special telegraf package should be installed on external hosts.

Client

Client is a node that receives configurations from the Master directly or through a Satellite, implements them, runs checks, and reports back the results. Some clients, also known as agents, are to be configured manually in order to send data to a Satellite or to the Master, as in the case of a Telegraf agent.

Shutdown Command

A Shutdown Command defines all the actions that have to be executed by the Shutdown Manager in order to power-down a host. Each Shutdown Command can contain variables that will be replaced on the Shutdown host.

Shutdown Definition

A Shutdown Definition is the specification describing groups of hosts that should be shut down when a specified condition on a host or a service is met, and the order in which those groups should be shut down.

Shutdown Group

A Shutdown Group contains a list of hosts which should all be shut down at the same time. Shutdown groups within the same shutdown definition can be given a relative ordering to determine which shutdown groups should be processed before another shutdown group.

Shutdown Host

A Shutdown host is a single host on which a Shutdown Command is executed. A Shutdown Host must have a Shutdown Command defined and can be part of one or more Shutdown Groups; if this is the case, the host will be shut down when the first group on which it is part of will be processed. In the subsequent groups, the host is simply ignored.