Purpose

This document will detail gCORE’s observability stack. The topic of observability comprises:

Logging
Monitoring
Tracing

Logging and monitoring are immediate needs and are actively being worked on. When we get the point where we’re covering the basics of these two we can talk about tracing. For now, I’ll just acknowledge its inclusion in the umbrella term: observability.

NOTE: At the present time, this is document depicts a partial, work-in-progress, non-production prototype. None of this technology has been peer reviewed or approved.

Logging

NOTE: Amazon just rebranded their most recent version of ElasticSearch, OpenSearch. The quickstart link below has been updated to reflect the new cluster endpoint, but the rest of this document still references ElasticSearch Service. We will provide a more thorough scrubbing later. For now, just keep this in mind.

QuickStart

ACE OpenSearch Home

Syslogs

Topology

The image below depicts the first-pass solution for routing system logs from EC2 instances to our central logging solution. Notice that Kinesis Firehose is serving as the routing engine here, sending logs to:

ElasticSearch for analytics, visualization and more to come
S3 for long-term storage in accordance with (pending definition) retention policy

The idea here is that we might need to keep logs longer than we’ll need/want (cost/benefit TBD) them at our fingertips in ElasticSearch.

<!— image: image (from original wiki uploads) —>

Components

Component	Description
RSyslog	Default Linux syslog server. Defaults unchanged, but added stanza to forward logs to `udp://localhost:5140`
Fluentd	OSS unified logging solution with lots of great plugins, including one for Kinesis Firehose. Fluentd is receiving syslog events from RSyslog, transforming them into JSON and shipping them to Kinesis.
Amazon Kinesis Data Firehose	Receives streaming log data from our EC2 instances (via Fluentd) and loads it into AWS ElasticSearch and S3.
Amazon ElasticSearch Service w/ Kibana	ElasticSearch is an OSS, distributed, RESTful, JSON-based search engine, the AWS service version of which comes packaged with the Kibana user interface for joyful navigation of our log data.
Amazon S3	AWS’ faithful object store is used here to store a copy of all log data, providing a cost-effective long-term storage solution. The plan is to keep “hot” data in ElasticSearch for near-term use (analytics, troubleshooting, security information and event management (SIEM), etc…) while backing everything up to S3 (and eventually Glacier) to balance log management, compliance and cost management objectives.

Context

Here I want to provide a bit background to help readers understand my technology choices. Seasoned ELK Stack users in particular may be wondering why I’m using Fluentd and Kinesis rather than simply shipping logs directly to ElasticSearch via LogStash.

First it’s important to note that my primary goal was to have two log event repositories: One for near-term analytics purposes and another for long-term backup purposes. I initially planned on maintaining the status quo whereby logs would continue to go to CloudWatch Logs, where they currently reside, for long-term storage. I would then implement a process for loading CloudWatch logs into ElasticSearch to improve our log management experience. And while this is a viable solution, I didn’t think it was the best one simple reason: S3 is a simpler and cheaper long-term backup solution than Cloudwatch Logs.

So with Cloudwatch Logs now replaced with S3, I needed a solid solution for routing logs to two destinations:

ElasticSearch for near-term analytics.
S3 for cost-effective, long-term backup.

Amazon’s recommended solution for preparing and loading real-time data streams into data stores and analytics services is Kinesis so given my inherent bias for leveraging Amazon services for business critical tasks, I decided to go with it.

Now, it’s true that I could (and still can) forego Kinesis in favor of LogStash and leverage its output plugins to ship to both ElasticSearch and S3, but I made the decision to try Kinesis for this purpose so that’s where I’m at.

At that point, it was just a matter of picking the agent for shipping logs from our instances to Kinesis. Amazon provides their own Kinesis Agent, but OS packages for distros other than Amazon Linux are hard to come by so I opened myself up to simpler alternatives. That’s when I discovered that Fluentd has a Kinesis plugin. Fluentd has also been growing in popularity over the past few years, has tremendous community support and momentum, is operationally simpler (and likely more performant (TBD)) than AWS’ agent and has a robust plugin library that we can leverage to satisfy other requirements down the road. Simply put: Fluentd is an absolute joy to work with so it’s an easy pick over Amazon’s agent.

Okay, hopefully that provides folks with answer to their “Why did you go with that” questions. If any remain or anyone wants to challenge any of my assumptions or conclusions please do so. It’s still just a prototype. Performance testing may lead me to change some of my decisions here and I’m perfectly fine with that.

RSyslog Configuration

RSyslog is part of our base Linux configuration. Updating it to route logs to Fluentd is just a matter of adding one line to the configuration and restarting the service

# Send log messages to Fluentd
*.* @127.0.0.1:5140

For POC purposes, I’ve added this line to the end of /etc/rsyslog.conf. If we decide to go into production with this, I will instead configure Ansible to push out a dedicated file (e.g /etc/rsyslog.d/fluentd.conf) which RSyslog will include thanks to this included stanza in the default configuration file.

# /etc/rsyslog.conf
# Include all config files in /etc/rsyslog.d/
#
$IncludeConfig /etc/rsyslog.d/*.conf

Fluentd Configuration

New Fluentd users will immediately look to install the community version of Fluentd and that’s fine for those who are willing to trade operational simplicity for bleeding edge features. For those who, like me, want something stable and nicely packaged, use td-agent. Visit their FAQ page for an explanation of the difference along with links to the various OS packages they provide.

To install on my Ubuntu 18.04 (Bionic) server, I simply followed these instructions. Again, this is just for POC purposes so I’m erring on the side of expediency. If we adopt this in production, I will likely leverage this Ansible role to manage installation as well as configuration.

curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-bionic-td-agent4.sh | sh

With td-agent installed, I next needed to install the Kinesis plugin like so:

sudo td-agent-gem install fluent-plugin-kinesis

That’s it for the installation bits. Now I just need to configure the agent to receive my system logs (from RSyslog) and ship them to Kinesis. Here is my POC configuration.

$ cat /etc/td-agent/td-agent.conf
<source>
  @type syslog
  port 5140
  bind 127.0.0.1
  tag syslog
  facility_key facility # add facility to record
  severity_key severity # add severity to record
</source>
 
<match syslog.**>
  @type copy
  <store>
    @type stdout
    <inject>
      time_key timestamp
    </inject>
  </store>
  <store>
    @type kinesis_firehose
    region us-west-2
    delivery_stream_name gcore-es-prod-kinesis-firehose
    <inject>
      time_key timestamp
    </inject>
  </store>
</match>

Some important notes:

Please note that while it’s fully functional, it’s not yet tuned for performance.
The match stanza is configured to also send events to stdout, where they’re available via /var/log/td-agent/td-agent.log. This is ONLY for testing purposes.
gcore-es-prod-kinesis-firehose is the current name for the data stream. I already have plans to change this when I destroy and rebuild the infra for phase 2.

Usage

Just hit the Kibana UI, navigate to the discover tab on the left. Choose syslogs-* from the index drop-down. That will show you some logs along with the search interface.

Note that Kibana does not yet support authentication. I plan to integrate this with Cognito for Roche SSO. This is listed in the Todo section below.

Data Sources

Data Source	Description	Location	Solution
Syslog	Basic system logs collected by Linux Syslog daemon	Linux Servers (`syslogd`)	Fluentd syslog input
OSSEC	OSS host intrusion detection solution	Linux Servers (`/var/ossec/logs/alerts/alerts.*`)	If possible, redirect to Rsyslog. Otherwise use Fluentd tail input. Also look into replacing current OSSEC with Wazuh as it offers more features and simplified operations. Only kicker is that it doesn’t by default offer a `local` installation (only `manager` or `agent`) so need to research this a bit before proceeding.
ClamAV	OSS antivirus solution	Linux Servers (`/var/log/clamav/*`)	Update `freshclam.conf` and `clamd.conf` to route to RSyslog (default is log to file).
AWS “Security Logs”	CloudTrail and/or GuardDuty and/or SecurityHub and/or Config	AWS	Need to determine which of these overlapping solutions provides most bang for buck and direct its data to ElasticSearch.

VPC Flow Logs

Topology

The image below depicts the first-pass solution for routing VPC flow logs from all VPCs to our central logging solution. Here I chose a common pattern whereby VPCs dump their logs to CloudWatch Logs. From there, I have a CloudWatch triggered Lambda Function that JSON formats the logs and ships them to ElasticSearch for centralization.

<!— image: image (from original wiki uploads) —>

Components

Component	Description
Cloudwatch Logs	…
Lambda	…
Amazon ElasticSearch Service w/ Kibana	ElasticSearch is an OSS, distributed, RESTful, JSON-based search engine, the AWS service version of which comes packaged with the Kibana user interface for joyful navigation of our log data.

Context

This first pass solution was hastily thrown together to prepare for our important Internet traffic re-routing exercise (#38). We needed the data in ElasticSearch post-haste so I went with this common workflow.

My plan is to refactor this to use a similar Kinesis Firehose driven solution that I’m using for syslogs so that we stick to the same S3 archive + ElasticSearch analytics pattern that I think will serve us well.

Version 2 of this solution coming as soon as critical DCR issues are resolved.

Example

VPC flow logs are very noisy and not so intuitive so this is my attempt to make sense of them using an example.

Network Connection

For our test, I made a single GET request to an HTTP endpoint like so:

curl -L http://95.217.228.176/json

Here’s the network depiction of the connection:

<!— image: image (from original wiki uploads) —>

Flow Log Data

That single GET request generated the following VPC flow logs:

<!— image: image (from original wiki uploads) —>

As you can see, VPC flow logs are… verbose, to say the least! Let’s break the log entries down and then we can think about ways to tighten up our searches so we get exactly what we want back from Kibana.

Some points (hastily thrown together for now):

flow logs are unidirectional
absent context, it’s impossible to discern the originator of the traffic.
essentially getting two views: one from the instance’s interface and the other from the NAT device’s interface.
NAT devices don’t have an instance-id so if you want to see NAT traffic only from the perspective of the NAT Gateway you can filter where NOT instance_id:"i-*" (because Kibana doesn’t let you do instance_id:"-" for some reason).
Conversely, if you only want to see from the perspective of the EC2 instance: instance_id:"i-*" will exclude the NAT Gateway.
No explanation yet for the timestamps. I’ll need to look into this a bit. I think we’ll probably just have to eat the imprecision though.

Todo

AI Application log ingestion
Cloudtrail log ingestion
Sample Kibana dashboards
Kibana authentication & authorization
Additional Kinesis R&D
Performance testing
Fluentd OS tuning
CentOS support
Ansible automation
Data retention plan (S3 and Elasticsearch)
Cloudwatch metrics integration
ES domain rename, rightsize and rebuild
Queueing and buffering
Miscellaneous hardening

Monitoring

Summary

Monitoring is very much in development phase so I’ll just cover some high points here.

Infrastructure and application monitoring are handled by Prometheus.
Grafana is our dashboard solution and is available here.
Grafana leverages LDAP for SSO as the OSS version does not support OIDC or SAML.
- As of June 2022, SSO is broken because we don’t have an owner responsible for pushing out Bind password updates. Therefore you must use admin credentials which are located in LastPass.
- AWS now provides a Grafana service so it’s likely we’ll migrate to this when time permits.
Monitoring infrastructure is in POC phase and is under heavy development.
- At present, we have no resources supporting our monitoring stack, but help is on the way.
Prometheus provisioning code is available at aws-infra-monitoring.
The plan is to merge this code with terraform-aws-prod.

EC2 Instance Monitoring

Grafana dashboard is Node Exporter for Prometheus Dashboard.
Our base AMI includes Prometheus Node Exporter to enable monitoring.
Prometheus is configured to automatically monitor systems with the following characteristics
- tag::Name is set.
- tag::Environment is prod`.

Tracing

TBD

Engineering Standards Kibana Logging Workflow