Concepts¶
Architecture¶
General Elasticsearch Cluster Information¶
In order to avoid excessive, useless network traffic generated when the cluster reallocates shards across cluster nodes after you restart an Elasticsearch instance, NetEye employs systemd post-start and pre-stop scripts to automatically enable and disable shard allocation properly on the current node whenever the Elasticsearch service is started or stopped by systemctl.
Details on how shard allocation works can be found here.
Note
By starting a stopped Elasticsearch instance, shard allocation will be enabled globally for the entire cluster. So if you have more than one Elasticsearch instance down, shards will be reallocated in order to prevent data loss.
- Therefore best practice is to:
Never keep an Elasticsearch instance stopped on purpose. Stop it only for maintenance reasons (e.g. for restarting the server) and start it up again as soon as possible.
Restart or stop/start one Elasticsearch node at a time. If something bad happens and multiple Elasticsearch nodes go down, then start them all up again together.
Elastic Only Nodes¶
From Neteye 4.9 is possible to install Elastic-only nodes in order to improve elasticsearch performance without a full Neteye installation.
To create an elastic only node you have to create an entry of type
ElasticOnlyNodes in the file /etc/neteye-cluster
as in the following
example. Syntax is the same used for standard Node
{ "ElasticOnlyNodes": [
{
"addr" : "192.168.47.3",
"hostname" : "neteye03.neteyelocal",
"hostname_ext" : "rdneteye03.si.wp.lan"
}
]
}
Voting Only Nodes¶
From Neteye 4.16 is possible to install Voting-only nodes in order to add a node dedicated only to provide quorum. If SIEM module is installed this node provides a voting-only functionalities also to Elasticsearch cluster.
This functionality is achieved configuring the node as voting-only master-eligible node specifying the
variable ES_NODE_ROLES="master, voting_only"
in the sysconfig file
/neteye/local/elasticsearch/conf/sysconfig/elasticsearch-voting-only
.
Voting-only node is defined in /etc/neteye-cluster
as in the following example
{ "VotingOnlyNode": {
"addr" : "192.168.47.3",
"hostname" : "neteye03.neteyelocal",
"hostname_ext" : "rdneteye03.si.wp.lan",
"id" : 3
}
}
Please note that VotingOnlyNode is a json obect and not an array because you can have a single Voting-only node in a NetEye cluster.
Elasticsearch Clusters¶
Design and Configuration¶
With NetEye 4 we recommend that you use at least 3 nodes to form an Elasticsearch cluster. If nevertheless you decide you need a 2-node cluster, our next recommendation is that you talk with a Würth Phoenix NetEye Solution Architect who can fully explain the risks in your specific environment and help you develop strategies that can mitigate those risks.
Elasticsearch coordination
subsystem
is in charge to choose which nodes can form a quorum (note that all
NetEye cluster nodes are master eligible by default). If Log Manager is
installed, the neteye_secure_install
script will properly set
seed_hosts and initial_master_nodes according to Elasticsearch’s
recommendations and no manual intervention is required.
neteye_secure_install
will set two options to configure cluster
discovery:
discovery.seed_hosts: ["host1", "host2", "host3"]
cluster.initial_master_nodes: ["node1"]
Please note that the value for initial_master_nodes will be set only on the first installed node of the cluster (it is optional on other nodes and if set it must be the same for all nodes in the cluster). Option seed_hosts will be set on all cluster nodes, included Elastic Only nodes, and will have the same value on all nodes.
Elasticsearch reverse proxy¶
Starting with NetEye 4.13, NGINX has been added to NetEye. NGINX acts as a reverse proxy, by exposing a single endpoint and acting as a load-balancer, to distribute incoming requests across all nodes and, in this case, to all Elasticsearch instances. This solution improves the overall performance and reliability of the cluster.
The elasticsearch endpoint is reachable at URI https://elasticsearch.neteyelocal:9200/. Please note that this is the same port used before so no additional change is required; old certificates used for elastic are still valid with the new configuration.
All services connected elastic stack services like Kibana, Logstash and Filebeat have been updated in order to reflect this improvement and to take advantages of the new load balancing feature.
Elasticsearch Only Nodes¶
Large NetEye installations, especially those running the SIEM feature module, will often have multi-node clusters where some nodes are only running Elasticsearch. An Elasticsearch Only node can be setup with a procedure similar to NetEye nodes.
An Elasticsearch Only node does not belong to the RedHat cluster so that it runs a local master-eligible Elasticsearch service, connected to all other Elasticsearch nodes.
El Proxy¶
El Proxy (also called Elastic Blockchain Proxy) allows a secure live signature of log streams from Logstash to Elasticsearch.
It provides protection against data tampering by transforming an input stream of plain logs into a secured blockchain where each log is cryptographically signed.
Architecture¶
From an high level point of view, three are the main components of the architecture:
The first component is Logstash, which collects logs from various sources and sends them to the El Proxy using the
json_batch
format of Elastic’s http-output plugin.Note
Due to the fact that the El Proxy does not provide persistence, Logstash should always be configured to take care of the persistence of the involved logs pipelines.
The second component is El Proxy itself, which receives batches of logs from Logstash, signs every log with a cryptographic key used only once, and, finally, forwards the signed logs to the Elasticsearch Bulk API;
The third component is Elasticsearch, which acquires the signed logs from El Proxy and persists them on the dedicated index.
How the El Proxy works¶
REST Endpoints¶
The El Proxy receives data from REST endpoints:
log endpoint:
description: Receives and processes a single log message in JSON format
path : /api/log
method: POST
request type: JSON
request example:
{ "entry": "A log message" }
response: The HTTP Status code 200 is used to signal a successful processing. Other HTTP status codes indicate a failure.
log_batch endpoint:
description: Receives and processes an array of log messages in JSON format
path : /api/log_batch
method: POST
request type: JSON
request example:
[ { "entry": "A log message", "other": "Additional values...", "EBP_METADATA": { "agent": { "type": "auditbeat", "version": "7.10.1" }, "customer": "neteye", "retention": "six_months", "blockchain_tag": "0", "event": { "module": "elproxysigned" } } }, { "entry1": "Another log message", "entry2": "Another log message", "EBP_METADATA": { "agent": { "type": "auditbeat", "version": "7.10.1" }, "customer": "neteye", "retention": "six_months", "blockchain_tag": "0", "event": { "module": "elproxysigned" } } }, { "entry": "Again, another message", "EBP_METADATA": { "agent": { "type": "auditbeat", "version": "7.10.1" }, "customer": "neteye", "retention": "six_months", "blockchain_tag": "0", "event": { "module": "elproxysigned" } } } ]
response: The HTTP Status code 200 is used to signal a successful processing. Other HTTP status codes indicate a failure.
DeadLetterQueue status endpoint:
description: Returns the status of the Dead Letter Queue. The status contains information on whether the DLQ is empty or not.
path : /api/v1/status/dlq
method: GET
response: The status of the DLQ in JSON format.
response example:
{ "empty": true }
KeysBackup status endpoint:
description: Returns the status of the keys backup. The status contains information on whether any key backup is present in the
{data_backup_dir}
folder.path : /api/v1/status/keys_backup
method: GET
response: The status of the Keys Backup in JSON format.
response example:
{ "empty": true }
Log Signature Flow - Generation of Signature Keys¶
The El Proxy achieves secure logging by authentically encrypting each log record with an individual cryptographic key used only once and protects the integrity of the whole log archive by a cryptographic authentication code.
The encryption keys are generated on-the-fly when needed and saved on
the filesystem in the {data_dir}
folder. Each key is bound to a
specific customer, module, retention policy and blockchain tag; they
are saved in the filesystem with the following naming convention:
{data_dir}/{customer}/{module}-{customer}-{retention_policy}-{blockchain_tag}_key.json
When a new key is generated, a copy of the key file is created in the {data_backup_dir}
folder
with the same naming convention.
As soon as a new key file is created in the {data_backup_dir}
folder, the Icinga2 service
logmanager-blockchain-keys-neteyelocal will enter in CRITICAL state, indicating
that a new key has been generated on the system. The new key must be moved in a safe place such
as a password manager or a restricted access storage.
The content of the key file is in the form:
{
"key": "initial_key",
"iteration": 0
}
Where:
key
is the encryption key to be used to sign the next incoming logiteration
: is the iteration number for which the signature key is valid.
Every time a log is signed, a new pair key/iteration is generated starting from the last one. The new pair will have the following values:
key
equals to the SHA256 hash of the previous keyiteration
equals to the previous iteration incremented by one
For example, if the key at iteration 10 is:
{
"key": "abcdefghilmno",
"iteration": 10
}
the next key will be:
{
"key": "d1bf0c925ec44e073f18df0d70857be56578f43f6c150f119e931e85a3ae5cb4",
"iteration": 11
}
This mechanism creates a blockchain of keys that cannot be altered without breaking the validity of the entire chain.
In addition, every time a set of logs is successfully sent to Elasticsearch, the specific key file is updated with the new value. Consequently, every reference to the previous keys is removed from the system making it impossible to recover and reuse old keys.
However, in case of unmanageable Elasticsearch errors, the El Proxy will reply with an error message to Logstash and will reuse the keys of the failed call for the next incoming logs.
Note
To be valid, the iteration
values of signed logs in
Elasticsearch should be incremental with no missing or duplicated
values.
When the first log is received after its startup, the El Proxy calls
Elasticsearch to query for the last indexed log iteration
value to
determine the correct iteration
number for the next log. If the
last log iteration
value returned from Elasticsearch is greater
than the value stored in the key file, the El Proxy will fail to
process the log.
Log Signature Flow - How are signature keys used¶
For each incoming log, the El Proxy retrieves the first available encryption key, as described in the previous section, and then uses it to calculate the HMAC-SHA256 hash of the log.
The calculation of the HMAC hash takes into account:
the log itself as received from Logstash
the iteration number
the timestamp
the hash of the previous log
At this point, the signed log is a simple JSON object composed by the following fields:
All fields of the original log : all fields from the original log message
- ES_BLOCKCHAIN: an object containing all the El Proxy’s calculated values. They are:
fields: fields of the original log used by the signature process
hash: the hmac hash calculate as described before
previous_hash: the hmac hash of the previous log message
iteration: the iteration number of the signature key
timestamp_ms: the signature epoch timestamp in milliseconds
For example, given this key:
{
"key": "d1bf0c925ec44e073f18df0d70857be56578f43f6c150f119e931e85a3ae5cb4",
"iteration": 11
}
when this log is received:
{
"value": "A log message",
"origin": "linux-apache2",
"EBP_METADATA": {
"agent": {
"type": "auditbeat",
"version": "7.10.1"
},
"customer": "neteye",
"retention": "six_months",
"blockchain_tag": "0",
"event": {
"module": "elproxysigned"
}
}
}
then this signed log will be generated:
{
"value": "A log message",
"origin": "linux-apache2",
"EBP_METADATA": {
"agent": {
"type": "auditbeat",
"version": "7.10.1"
},
"customer": "neteye",
"retention": "six_months",
"blockchain_tag": "0",
"event": {
"module": "elproxysigned"
}
},
"ES_BLOCKCHAIN": {
"fields": {
"value": "A log message",
"origin": "linux-apache2"
},
"hash": "HASH_OF_THE_CURRENT_LOG",
"previous_hash": "HASH_OF_THE_PREVIOUS_LOG",
"iteration": 11,
"timestamp_ms": 123456789
}
}
Log Signature Flow - How the index name is determined¶
The name of the Elasticsearch index for the signed logs is determined by the content of
the EBP_METADATA
field of the incoming Log.
The index name has the following structure:
{EBP_METADATA.agent.type}-{EBP_METADATA.agent.version}-{EBP_METADATA.event.module}-{EBP_METADATA.customer}-{EBP_METADATA.retention}-{EBP_METADATA.blockchain_tag}-YYYY.MM.DD
The following rules and constraints are valid:
- All of these fields are mandatory:
EBP_METADATA.agent.type
EBP_METADATA.customer
EBP_METADATA.retention
EBP_METADATA.blockchain_tag
The
YYYY.MM.DD
part of the index name is based on the epoch timestamp of the signatureIf the
{EBP_METADATA.event.module}
is not present, El Proxy will use by defaultelproxysigned
For example, given this log is received on 23 March, 2021:
{
"value": "A log message",
"origin": "linux-apache2",
"EBP_METADATA": {
"agent": {
"type": "auditbeat",
"version": "7.10.1"
},
"customer": "neteye",
"retention": "six_months",
"blockchain_tag": "0",
"event": {
"module": "elproxysigned"
}
}
}
Then the inferred index name is: auditbeat-7.10.1-elproxysigned-neteye-six_months-0-2021.03.23
As a consequence of the default values and of the default Logstash configuration,
most of the indexes created by El Proxy will have elproxysigned
in the name.
Consequently, special care should be applied when manipulating those indexes and documents;
in particular, the user must not delete or rename *-elproxysigned-*
indices manually nor alter
the content of ES_BLOCKCHAIN
or EBP_METADATA
fields as any change could lead to a broken
blockchain.
Sequential logs processing¶
An important aspect to bear in mind of is that the log requests for the same blockchain are always processed sequentially by El Proxy. This means that, when a batch of logs is received from Logstash, it is queued in an in-memory queue and it will be processed only when all the previously received requests are completed.
This behavior is required to assure that the blockchain is kept coherent
with no holes in the iteration
sequence.
Nevertheless, as no parallel processing is possible for a single blockchain, this puts some hard limits on the maximum throughput reachable.
Error Handling¶
El Proxy implements two different recovery strategies in case of errors. The first one is a simple retry strategy used in case of unrecoverable communication errors with Elasticsearch; the second one, instead, is used in case Elasticsearch has issues with processing some of the sent logs and aims at addressing a widely known issue named backpressure.
Figure Fig. 147 shows how El Proxy integrates the two aforementioned strategies, which will be thoroughly explained in the next paragraphs.
Backpressure explained
In the context of data pipelines, backpressure represents the situation where a resistance prevents a fast processing of the input data. In the long term this resistance causes the overload of the system, because inside a certain component of the system data arrive at a higher rate with respect to the output rate.
In the context of the El Proxy, the resistance is represented by Elasticsearch refusing to index some specific logs because, for example, some unexpected fields inside a log lead to a failure in the Elasticsearch pipelines. This would cause El Proxy and Logstash to keep trying to index these logs without ever succeeding.
The way El Proxy overcomes the backpressure is by:
trying to modify the failing logs to avoid the indexing error
writing these logs into a Dead Letter Queue, as explained in the Strategy two section
By performing these actions El Proxy can quickly process also the logs which cannot be indexed in Elasticsearch and so it guarantees that these logs will not cause the backpressure.
Strategy one: Bulk retry¶
El Proxy implements an optional retry strategy to handle communication errors with Elasticsearch; when enabled (see the Configuration section), whenever a generic error is returned by Elasticsearch, El Proxy will retry for a fixed amount of times to resubmit the same signed_logs to Elasticsearch.
This strategy permits to deal, for example, with temporary networking issues without forcing Logstash to resubmit the logs for a new from-scratch processing.
Nevertheless, while this can be useful in dealing with a large set of use cases, it should also be used very carefully. In fact, due to the completely sequential nature of the blockchain processing, a too high number of retries could lead to an ever growing queue of logs waiting to be processed while El Proxy is busy with processing over and over again the same failed logs.
In conclusion, whether it is better to let El Proxy fail fast or retry more times is a decision that needs a careful, case-by-case analysis.
Strategy two: Single log reprocessing and Dead Letter Queue¶
In some cases, Elasticsearch can correctly process the log bulk request but it can fail to index some of the contained logs. In this situation, El Proxy will reprocess only those failed logs, removing all the fields not required by the signature process and sending them back again to Elasticsearch.
This strategy is based on the assumption that the indexing of the specific log fails due to incompatible operations requested by a pipeline. For example, a pipeline could attempt to lowercase a non-string field causing the failure. Consequently, resubmitting the log without the problematic fields could lead to a successful indexing.
If even this attempt fails, then El Proxy will dump the failed logs to a Dead Letter Queue on the filesystem and it will send an ok response to Logstash. Usually, when this happens, the blockchain will be in an incoherent state caused by holes in the iteration numeric progression. There is no automatic way to recover from this state. The administrator is in charge of investigating the issue and manually acknowledge the problem.
As mentioned before, the Dead Letter Queue is saved on the filesystem inside the
{dlq_dir}
folder as set in the configuration file.
The Dead Letter Queue consists of a set of text files in Newline delimited JSON format
with each row in a file representing a failed log.
These files are grouped by customer and index name following the naming convention:
{dlq_dir}/{customer}/{index_name}.ndjson
where {index_name}
is the name of the Elasticsearch index where the failed logs
were supposed to be written.
Each entry in the Dead Letter Queue file contains the original log whose indexation failed and, if present, the Elasticsearch error that caused the failure. For example:
{
"document": {
"value": "A log message",
"origin": "linux-apache2",
"EBP_METADATA": {
"agent": {
"type": "auditbeat",
"version": "7.10.1"
},
"customer": "neteye",
"retention": "six_months",
"blockchain_tag": "0",
"event": {
"module": "elproxysigned"
}
},
"ES_BLOCKCHAIN": {
"fields": {
"value": "A log message",
"origin": "linux-apache2"
},
"hash": "HASH_OF_THE_CURRENT_LOG",
"previous_hash": "HASH_OF_THE_PREVIOUS_LOG",
"iteration": 11,
"timestamp_ms": 123456789
}
},
"el_error": {
"reason": "field [string_field] of type [java.lang.Integer] cannot be cast to [java.lang.String]",
"type": "illegal_argument_exception"
}
}
Example Scenarios¶
As a running example, let’s imagine sending the following authentication event to El Proxy:
{
"host": {
"name": "myhost.wp.lan",
"ip": "172.17.0.2"
},
"event": {
"category": "authentication"
}
}
To help you understand how the El Proxy architecture works together with Logstash and Elasticsearch, please have a look at the following scenarios.
In each scenario, the event is sent from Logstash to Elasticsearch through El Proxy.
Scenario 1¶
In our first scenario, no particular error happens during the signing process, so El Proxy signs the event, adds a new block to an existing blockchain, or creates a new chain from scratch if needed, and indexes the resulting document in a dedicated Elasticsearch index.
This is the most basic scenario, please refer to How the El Proxy works for additional details.
Scenario 2¶
As in the previous example, El Proxy signs the event, adds it to the blockchain, and indexes it in Elasticsearch.
Logstash, however, goes down before getting notified by El Proxy about the success of the whole signing/adding/indexing operation. Logstash, then, can not acknowledge the correct delivery of the event, even though El Proxy has already successfully indexed the event.
Logstash, in the meanwhile, is restarted successfully and it sends the same event to El Proxy again. El Proxy goes through the signing/adding/indexing operation for a second time, creating a duplicated event in Elasticsearch, but keeping a coherent blockchain.
Scenario 3¶
In this scenario, El Proxy is down while Logstash is sending events to it and therefore the event cannot be signed, added to the blockchain, and indexed.
In this case, Logstash tries to send the event to El Proxy until succeeding. If also Logstash is restarted before being able to successfully send the event to El Proxy, no event loss is experienced since events are disk persisted. As soon as Logstash is up and running again, it will send the pending event to El Proxy. Differently from scenario 2, this will not cause any event duplication in Elasticsearch.
Scenario 4¶
In this scenario, Logstash, instead of sending the example event to El Proxy, sends an event with a field that does not match the Elasticsearch mapping definition of the index in which the resulting document will be stored.
In the running example, the host
field is mapped as an object (as you can see in the code snipped
reported in the introduction). Logstash, however, has received an event in which the host
field
appears as a string:
{
"host": "myhost.wp.lan",
"event": {
"category": "authentication"
}
}
El Proxy signs the event, adds it to the blockchain, and tries to index it in Elasticsearch.
Elasticsearch, however, refuses to index the document, returning an error to El Proxy.
El Proxy then removes all event fields that are not specified in the configuration file
elastic_blockchain_proxy_fields.toml
for being signed and tries to reindex the event again.
At this point we can have different outcomes:
the host field is not included in the signature:
the field is removed from the event, fixing the mapping definition issue, and the resulting document is then successfully indexed
the host field must be included in the signature:
the mapping definition issue still exists, then the event is again rejected by Elasticsearch
the event is then put in the Dead Letter Queue (DLQ) waiting for manual intervention
in case of failure writing the event in the DQL, El Proxy returns an error to Logstash, which tries to send the event again
Please refer to El Proxy Configuration for additional details.
Scenario 5¶
In this scenario, El Proxy signs the event and adds it to the blockchain. When trying to index the event in Elasticsearch, however, El Proxy gets some communication errors from Elasticsearch. For example, Elasticsearch is temporarily down, or the disk has less than 15% of free space, causing Elasticsearch to refuse to index.
Then, El Proxy retries to index the event with exponential back-off and:
if succeeding before hitting the maximum amount of retries, then the event is indexed
if the maximum amount of retries is hit without indexing, then an error is returned to Logstash, which tries to send the event again
The number of retries can be defined in the El Proxy configuration. Please refer to El Proxy Configuration for additional details.
Agents¶
The Elastic Beat feature¶
A Beat is a small, self-contained agent installed on devices within an infrastructure (mostly servers and workstations) that acts as a client to send data to a centralised server where they are processed in a suitable way.
Beats are part of the Elastic Stack; they gather data and can send them to either Logstash or Elasticsearch and visualised on a Kibana dashboard.
There are different types of Beat agents available, each tailored for a different type of data. NetEye currently uses the Filebeat NetFlow Module for internal use. Additional information about the Beat feature can be found in the official documentation.
NetEye can receive data from Beats installed on monitored hosts (i.e., on the clients). The remainder of this section shows first how NetEye is configured to receive data from Beats, i.e., as a receiving point for data sent by Beats, then explains how to install and configure Beats on clients, using SSL certificates to protect the communication.
Overview of NetEye’s Beat infrastructure setup¶
Beats are part of the SIEM module, which is an additional module, that can be installed following the directions in the Feature Modules Installation section if you have the subscription.
Warning
Beats are intended as a replacement for Safed, even if they can coexist. However, since both Beat and Safed might process the same data, they would double the time and resources required, therefore it is suggested to activate only one of them.
The NetEye implementation allows Logstash to listen to incoming data on a secured TCP port (5044). Logstash then sends data into two flows:
to a file on disk, in the /neteye/shared/rsyslog/data folder, with the following name:
%{[agent][hostname]}/%{+YYYY}/%{+MM}/%{+dd}/[LS]%{[host][hostname]}.log
. The format of the file is the same used forsafed
files. This file is encrypted and its integrity validated, like it happens for Safed, and written to disk to preserve its inalterability.to Elastic, to be displayed into preconfigured Kibana dashboards.
Communication is SSL protected, and certificates need to be installed on clients together with the agents, see next section for more information.
Note
When the module is installed there is no data flow until agents are installed on the clients to be monitored. Indeed, deployment on NetEye consists only of the set up of the listening infrastructure.
The Beat feature is currently a CLI-only feature: no GUI is available and the configuration should be done by editing configuration files.