DevOps & GitOpsObservability

Purpose

This document will detail gCORE’s observability stack. The topic of observability comprises:

  • Logging
  • Monitoring
  • Tracing

Logging and monitoring are immediate needs and are actively being worked on. When we get the point where we’re covering the basics of these two we can talk about tracing. For now, I’ll just acknowledge its inclusion in the umbrella term: observability.

NOTE: At the present time, this is document depicts a partial, work-in-progress, non-production prototype. None of this technology has been peer reviewed or approved.

Logging

NOTE: Amazon just rebranded their most recent version of ElasticSearch, OpenSearch. The quickstart link below has been updated to reflect the new cluster endpoint, but the rest of this document still references ElasticSearch Service. We will provide a more thorough scrubbing later. For now, just keep this in mind.

QuickStart

Syslogs

Topology

The image below depicts the first-pass solution for routing system logs from EC2 instances to our central logging solution. Notice that Kinesis Firehose is serving as the routing engine here, sending logs to:

  • ElasticSearch for analytics, visualization and more to come
  • S3 for long-term storage in accordance with (pending definition) retention policy

The idea here is that we might need to keep logs longer than we’ll need/want (cost/benefit TBD) them at our fingertips in ElasticSearch.

<!— image: image (from original wiki uploads) —>

Components

ComponentDescription
RSyslogDefault Linux syslog server. Defaults unchanged, but added stanza to forward logs to udp://localhost:5140
FluentdOSS unified logging solution with lots of great plugins, including one for Kinesis Firehose. Fluentd is receiving syslog events from RSyslog, transforming them into JSON and shipping them to Kinesis.
Amazon Kinesis Data FirehoseReceives streaming log data from our EC2 instances (via Fluentd) and loads it into AWS ElasticSearch and S3.
Amazon ElasticSearch Service w/ KibanaElasticSearch is an OSS, distributed, RESTful, JSON-based search engine, the AWS service version of which comes packaged with the Kibana user interface for joyful navigation of our log data.
Amazon S3AWS’ faithful object store is used here to store a copy of all log data, providing a cost-effective long-term storage solution. The plan is to keep “hot” data in ElasticSearch for near-term use (analytics, troubleshooting, security information and event management (SIEM), etc…) while backing everything up to S3 (and eventually Glacier) to balance log management, compliance and cost management objectives.

Context

Here I want to provide a bit background to help readers understand my technology choices. Seasoned ELK Stack users in particular may be wondering why I’m using Fluentd and Kinesis rather than simply shipping logs directly to ElasticSearch via LogStash.

First it’s important to note that my primary goal was to have two log event repositories: One for near-term analytics purposes and another for long-term backup purposes. I initially planned on maintaining the status quo whereby logs would continue to go to CloudWatch Logs, where they currently reside, for long-term storage. I would then implement a process for loading CloudWatch logs into ElasticSearch to improve our log management experience. And while this is a viable solution, I didn’t think it was the best one simple reason: S3 is a simpler and cheaper long-term backup solution than Cloudwatch Logs.

So with Cloudwatch Logs now replaced with S3, I needed a solid solution for routing logs to two destinations:

  • ElasticSearch for near-term analytics.
  • S3 for cost-effective, long-term backup.

Amazon’s recommended solution for preparing and loading real-time data streams into data stores and analytics services is Kinesis so given my inherent bias for leveraging Amazon services for business critical tasks, I decided to go with it.

Now, it’s true that I could (and still can) forego Kinesis in favor of LogStash and leverage its output plugins to ship to both ElasticSearch and S3, but I made the decision to try Kinesis for this purpose so that’s where I’m at.

At that point, it was just a matter of picking the agent for shipping logs from our instances to Kinesis. Amazon provides their own Kinesis Agent, but OS packages for distros other than Amazon Linux are hard to come by so I opened myself up to simpler alternatives. That’s when I discovered that Fluentd has a Kinesis plugin. Fluentd has also been growing in popularity over the past few years, has tremendous community support and momentum, is operationally simpler (and likely more performant (TBD)) than AWS’ agent and has a robust plugin library that we can leverage to satisfy other requirements down the road. Simply put: Fluentd is an absolute joy to work with so it’s an easy pick over Amazon’s agent.

Okay, hopefully that provides folks with answer to their “Why did you go with that” questions. If any remain or anyone wants to challenge any of my assumptions or conclusions please do so. It’s still just a prototype. Performance testing may lead me to change some of my decisions here and I’m perfectly fine with that.

RSyslog Configuration

RSyslog is part of our base Linux configuration. Updating it to route logs to Fluentd is just a matter of adding one line to the configuration and restarting the service

# Send log messages to Fluentd
*.* @127.0.0.1:5140

For POC purposes, I’ve added this line to the end of /etc/rsyslog.conf. If we decide to go into production with this, I will instead configure Ansible to push out a dedicated file (e.g /etc/rsyslog.d/fluentd.conf) which RSyslog will include thanks to this included stanza in the default configuration file.

# /etc/rsyslog.conf
# Include all config files in /etc/rsyslog.d/
#
$IncludeConfig /etc/rsyslog.d/*.conf

Fluentd Configuration

New Fluentd users will immediately look to install the community version of Fluentd and that’s fine for those who are willing to trade operational simplicity for bleeding edge features. For those who, like me, want something stable and nicely packaged, use td-agent. Visit their FAQ page for an explanation of the difference along with links to the various OS packages they provide.

To install on my Ubuntu 18.04 (Bionic) server, I simply followed these instructions. Again, this is just for POC purposes so I’m erring on the side of expediency. If we adopt this in production, I will likely leverage this Ansible role to manage installation as well as configuration.

curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-bionic-td-agent4.sh | sh

With td-agent installed, I next needed to install the Kinesis plugin like so:

sudo td-agent-gem install fluent-plugin-kinesis

That’s it for the installation bits. Now I just need to configure the agent to receive my system logs (from RSyslog) and ship them to Kinesis. Here is my POC configuration.

$ cat /etc/td-agent/td-agent.conf
<source>
  @type syslog
  port 5140
  bind 127.0.0.1
  tag syslog
  facility_key facility # add facility to record
  severity_key severity # add severity to record
</source>
 
<match syslog.**>
  @type copy
  <store>
    @type stdout
    <inject>
      time_key timestamp
    </inject>
  </store>
  <store>
    @type kinesis_firehose
    region us-west-2
    delivery_stream_name gcore-es-prod-kinesis-firehose
    <inject>
      time_key timestamp
    </inject>
  </store>
</match>

Some important notes:

  • Please note that while it’s fully functional, it’s not yet tuned for performance.
  • The match stanza is configured to also send events to stdout, where they’re available via /var/log/td-agent/td-agent.log. This is ONLY for testing purposes.
  • gcore-es-prod-kinesis-firehose is the current name for the data stream. I already have plans to change this when I destroy and rebuild the infra for phase 2.

Usage

Just hit the Kibana UI, navigate to the discover tab on the left. Choose syslogs-* from the index drop-down. That will show you some logs along with the search interface.

Note that Kibana does not yet support authentication. I plan to integrate this with Cognito for Roche SSO. This is listed in the Todo section below.

Data Sources

Data SourceDescriptionLocationSolution
SyslogBasic system logs collected by Linux Syslog daemonLinux Servers (syslogd)Fluentd syslog input
OSSECOSS host intrusion detection solutionLinux Servers (/var/ossec/logs/alerts/alerts.*)If possible, redirect to Rsyslog. Otherwise use Fluentd tail input. Also look into replacing current OSSEC with Wazuh as it offers more features and simplified operations. Only kicker is that it doesn’t by default offer a local installation (only manager or agent) so need to research this a bit before proceeding.
ClamAVOSS antivirus solutionLinux Servers (/var/log/clamav/*)Update freshclam.conf and clamd.conf to route to RSyslog (default is log to file).
AWS “Security Logs”CloudTrail and/or GuardDuty and/or SecurityHub and/or ConfigAWSNeed to determine which of these overlapping solutions provides most bang for buck and direct its data to ElasticSearch.

VPC Flow Logs

Topology

The image below depicts the first-pass solution for routing VPC flow logs from all VPCs to our central logging solution. Here I chose a common pattern whereby VPCs dump their logs to CloudWatch Logs. From there, I have a CloudWatch triggered Lambda Function that JSON formats the logs and ships them to ElasticSearch for centralization.

<!— image: image (from original wiki uploads) —>

Components

ComponentDescription
Cloudwatch Logs
Lambda
Amazon ElasticSearch Service w/ KibanaElasticSearch is an OSS, distributed, RESTful, JSON-based search engine, the AWS service version of which comes packaged with the Kibana user interface for joyful navigation of our log data.

Context

This first pass solution was hastily thrown together to prepare for our important Internet traffic re-routing exercise (#38). We needed the data in ElasticSearch post-haste so I went with this common workflow.

My plan is to refactor this to use a similar Kinesis Firehose driven solution that I’m using for syslogs so that we stick to the same S3 archive + ElasticSearch analytics pattern that I think will serve us well.

Version 2 of this solution coming as soon as critical DCR issues are resolved.

Example

VPC flow logs are very noisy and not so intuitive so this is my attempt to make sense of them using an example.

Network Connection

For our test, I made a single GET request to an HTTP endpoint like so:

curl -L http://95.217.228.176/json

Here’s the network depiction of the connection:

<!— image: image (from original wiki uploads) —>

Flow Log Data

That single GET request generated the following VPC flow logs:

<!— image: image (from original wiki uploads) —>

As you can see, VPC flow logs are… verbose, to say the least! Let’s break the log entries down and then we can think about ways to tighten up our searches so we get exactly what we want back from Kibana.

Some points (hastily thrown together for now):

  • flow logs are unidirectional
  • absent context, it’s impossible to discern the originator of the traffic.
  • essentially getting two views: one from the instance’s interface and the other from the NAT device’s interface.
  • NAT devices don’t have an instance-id so if you want to see NAT traffic only from the perspective of the NAT Gateway you can filter where NOT instance_id:"i-*" (because Kibana doesn’t let you do instance_id:"-" for some reason).
  • Conversely, if you only want to see from the perspective of the EC2 instance: instance_id:"i-*" will exclude the NAT Gateway.
  • No explanation yet for the timestamps. I’ll need to look into this a bit. I think we’ll probably just have to eat the imprecision though.

Todo

  1. AI Application log ingestion
  2. Cloudtrail log ingestion
  3. Sample Kibana dashboards
  4. Kibana authentication & authorization
  5. Additional Kinesis R&D
  6. Performance testing
  7. Fluentd OS tuning
  8. CentOS support
  9. Ansible automation
  10. Data retention plan (S3 and Elasticsearch)
  11. Cloudwatch metrics integration
  12. ES domain rename, rightsize and rebuild
  13. Queueing and buffering
  14. Miscellaneous hardening

Monitoring

Summary

Monitoring is very much in development phase so I’ll just cover some high points here.

  • Infrastructure and application monitoring are handled by Prometheus.
  • Grafana is our dashboard solution and is available here.
  • Grafana leverages LDAP for SSO as the OSS version does not support OIDC or SAML.
    • As of June 2022, SSO is broken because we don’t have an owner responsible for pushing out Bind password updates. Therefore you must use admin credentials which are located in LastPass.
    • AWS now provides a Grafana service so it’s likely we’ll migrate to this when time permits.
  • Monitoring infrastructure is in POC phase and is under heavy development.
    • At present, we have no resources supporting our monitoring stack, but help is on the way.
  • Prometheus provisioning code is available at aws-infra-monitoring.
  • The plan is to merge this code with terraform-aws-prod.

EC2 Instance Monitoring

  • Grafana dashboard is Node Exporter for Prometheus Dashboard.
  • Our base AMI includes Prometheus Node Exporter to enable monitoring.
  • Prometheus is configured to automatically monitor systems with the following characteristics
    • tag::Name is set.
    • tag::Environment is prod`.

Tracing

TBD