DevOps & GitOpsAnsible System Configuration

Summary

Ansible is our tool of choice for installing and configuring Linux servers. Ansible code is managed as follows:

CodeGitHub LocationDescription
Ansible Runtimegred-ecdi/aws-infra-live- Ansible playbooks, inventory and group/host variables
- Ansible is run from here to configure servers
Ansible Rolesgred-ecdi/ansible-role-*- GitHub repos hosting Ansible roles
- Each role has its own repository to enable versioning and testing
- Roles are consumed via Ansible Runtime

Roles

Instructions soon on how to use Ansible Molecule to develop Ansible roles.

Inventories & Variables

Summary

As of February 2022, for security and predictability reasons, we use separate inventories and variable sets for each our environments.

EnvironmentDirectoryDescription
Developmentenvironments/devLegacy name for our non-production AWS environment in our production AWS account. It’s being phased out in favor of test and this directory is now just a symlink to test.
Testenvironments/testNon-production environment in our production AWS account.
Productionenvironments/prodProduction environment in our production AWS account.

To understand the logic behind the decision to keep distinct inventories for each environment requires understanding of Ansible’s group intersection problem. This blog post perfectly identifies the problem and recommends a number of solutions, the last of which we adopted here.

Working with Multiple Environments

Our Ansible environment is configured to default to non-production. This enables us to test changes before rolling them out to production.

So by default, your Ansible commands are only going to reflect non-production.

For example, below I’ll run ansible-inventory to retrieve my Talend servers (Application = "talend").

Default behavior shows non-production only:

$ ansible-inventory --graph app_talend
@app_talend:
  |--gcore-talend-dev
  |--gcore-talend-dev-job1
  |--gcore-talend-uat
  |--gcore-talend-uat-job1

Add the -i path/to/inventory option to show other environments; for example, production:

$ ansible-inventory --graph app_talend -i environments/prod
@app_talend:
  |--gcore-talend-prod
  |--gcore-talend-prod-job1
  |--talend-app-prod-bastionhost

User/Group Management

Overview

At present, we deploy local (Linux) system users to each system in our fleet. Users are processed like so:

  1. Our base playbook is configured to run on all hosts.
  2. This playbook calls our base role.
  3. The base role includes a tasks/users.yml file which includes some tasks to generate a data structure of authorized users and pass that data structure to an upstream role to process user adds, removes and changes.
  4. The upstream role that does all the heavy lifting is singleplatform-eng/ansible-users.

Managing Linux users in this context is a matter of populating user variables to achieve your desired outcome. Below I’ll describe those variables

VariableFileDetails
base_active_usersenvironments/000_globals/all/users_active.ymlGlobal list of active users eligible for installation on our server fleet, depending on their group assignment.
groups_to_createenvironments/000_globals/all/groups.ymlGlobal list of user groups that get installed on all servers.
base_host_groupsenvironments/(prod|test)/group_vars/app_{app_name}.ymlThis variable is a list containing the names of user groups to install on particular servers, as defined by the application class.
base_sudo_usersenvironments/(prod|test)/group_vars/app_{app_name}.ymlThis variable is a dictionary that defines names and sudo parameters for user groups that should have sudo access on particular servers, as defined by the application class.

Let’s walk through a few examples

Add a user

Before adding a new user, you must ensure you have the following information:

  • Roche assigned UNIX username
  • Full name, department and role
  • SSH public key(s)
  • Assigned user class(es) (defined as groups in environments/000_globals/all/groups.yml)

Armed with this information, you just need to open environments/000_globals/all/users_active.yml and follow the existing pattern. There are just a few caveats to note:

  1. Entries are in UID order so please stick to this pattern.
  2. Infra-Team (admin) users are UIDs 2000-2099 and are at the top of the file. When adding a new Infra-Team user, find the last one (entry before 2100) and add them there. This isn’t a hard rule, but it’s a pattern that we’d like to keep if we can.
  3. All other users are UIDs 2100-2999. If adding a non-admin user, just add them to the bottom of the file.
  4. Be sure NOT to duplicate UIDs or you will break things. When copying and pasting from a previous entry, double check that you’ve updated the UID.

Here is what my user entry looks like:

  - username: bushnet1
    name: Todd Michael Bushnell, External, gRed Informatics GCGAC
    groups: ['infra-team']
    uid: 2001
    ssh_key:
      - "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQD6eFho5Y+OerNrWkk/XgN/sihlkwRAyHAXzr5qqkVxR5JwV2WKflwDt9SibcjXYjKOcss/fEZJYf8BRm9r01EXzjbGnapq9Ixwm2vz6hKP8hnWvhKCB89Vf9+oUxBL/WmRwhzyrd/Lqrh3WhXUfrkHuGpMXg5toLYEtHRqmeIvz2xRoBS+gNk6/wDfkjETnIkdi2y2r2CKIFAMY6e6Dq1mYgIYSyeaUHthkOxCSr
tjrhjG2DjIPr4huSV3EJR8a1ij3b8iFjU4ydc2KkU9MdN6kMoxV8N2fEWQ/c+WYsM87RcQISKkN7qLpbTTJA4LBNeKIgL
aInUWzu6TBUQjSQ86Vaq5s3kqDCem2jxin1MsXwqmI674p58RJIs41NwqZ7k6JEczCfp0GH2mV5vkt+9ZdMrEsMBJQuvv
981aYrnEuDTjawOtvU9+a9iBJafUG/t6Sx09QQbA7Jt+oHhH5Zu5cAkHDr9YzJ5wHvJP7klk++1ogsuKVqKIkr8FedBe0
E+vfM6hrrfMmg+R3XMJEanYKHWAl//OvaM6ZT6d/ZKkqOhl3PDU9ZYi9qmhPNS5DqOOR+ebaZs/NMa33AB8zOv8MFRIBf
mCoacZnwjQtskSegJocflbItbSDTezWW8OKjJL1LvE2UBGKd8GFVel0IAD8sX8gO68dUkCXR/NUrSYew== todd@strat
aconsulting.com"

NOTE:

  • The username field is the UNIX username provided by Roche and should be in the ticket.
  • The name field translates to the UNIX Gecos field and is in the format, full name, role, department. You can get this information by searching for the username in Gwiz.
  • The groups field is a list of assigned user groups and is essential to getting a user on the desired servers. Users do not get assigned to servers; groups do. More on this later.
  • The uid field must, must, must be unique!
  • The ssh_key field is a YAML list to allow you to add more than one key. Just add them to the list and ensure they’re quoted.

Assuming:

  1. the user has been assigned to the desired group(s).
  2. the desired group(s) are defined in environments/000_globals/all/groups.yml.
  3. the desired group(s) have been assigned to the desired system class(es) via the base_host_groups variable.

Your user will be distributed to the appropriate systems when you run Ansible.

If all of these assumptions aren’t met, there may be additional work to do.

Add a user group

Groups are defined globally (for all environments) in environments/000_globals/all/groups.yml. To add a new group, just open the file and add your group to the bottom of the file following the established pattern and noting the caveats at the top of the file.

# We create same groups on all systems, whether populated or not
# using 600-999 for ACE groups

# NOTE: pattern for application group accounts is $app-team
# This enables us to differentiate between human members of an
# application team and the application (system) user/group
# Use case: I want members of delphi-users to be granted sudo, but
# I don't want the delphi system user/group to also get sudo

groups_to_create:
  - name: infra-team
    gid: 600
  - name: dev-team
    gid: 601
  - name: talend-team
    gid: 602
  - name: delphi-team
    gid: 603
  - name: imo-team
    gid: 604
<your group goes here>

Add a user group (and its users) to a server class

We classify our servers based on their Application tag. For example, servers tagged with Application: Talend are automatically grouped in Ansible as app_talend by our AWS dynamic inventory configuration.

# ansible inventory command returning (non-prod) talend servers
$ ansible-inventory --graph app_talend
@app_talend:
  |--gcore-talend-dev
  |--gcore-talend-dev-job1
  |--gcore-talend-uat
  |--gcore-talend-uat-job1

# ansible inventory command returning (prod) talend servers
$ ansible-inventory --graph app_talend -i environments/prod
@app_talend:
  |--gcore-talend-prod
  |--gcore-talend-prod-job1
  |--talend-app-prod-bastionhost

So now we’ve established application class, let’s say we want to add a user to the Talend servers. To that we need to:

  1. Select (or create) a group to associate with the Talend servers

An obvious choice (and one that already exists) is to use this existing group

# environments/000_globals/all/groups.yml
  - name: talend-team
    gid: 602
  1. Associate the group with the server class

This is done by setting the base_host_groups variable in the inventory file associated with the talend server class.

  • environments/test/group_vars/app_talend.yml for non-production
  • environments/prod/group_vars/app_talend.yml for production

Here’s what it will look like for non-prod:

# environments/test/group_vars/app_talend.yml
---

base_host_groups: ["talend-team"]

Repeat this pattern for your desired server class.

  1. Associate the user with the group

This takes us back to our global active users file. Below I’ll provide an example user that will achieve the desired result for our example (talend-team)

# environments/000_globals/all/users_active.yml
base_active_users:
  - username: xander
    name: Xander Mathews, Solutions Architect, ECD Information Mgmt Office GEGDF
    groups: ['dev-team', 'imo-team', 'talend-team', 'alation-team']
    ...

Note that this user is in multiple user groups which is perfectly acceptable. The key is that the groups list includes talend-team.

You now have:

  • a user assigned to a user group
  • that user group associated with a server class

The next time you run the Ansible (base playbook), your changes will be deployed. Read on for instructions on how to deploy your changes to your server class.

Remove a user

Removing users is a little less intuitive than adding a user, but it’s simple. For additional context please read the upstream role instructions.

Removing a user is a matter of:

  1. removing the user’s entry from environments/000_globals/all/users_active.yml.
  2. adding the user entry in environments/000_globals/all/users_deleted.yml.
  3. committing/pushing/pulling changes & running Ansible base playbook on all servers to effect change.

I’ll demonstrate the process for Adam’s removal. Please see ace/roadmap#840 for the Git diffs.

1. Remove user entry from users_active

# git diff 000_globals/all/users_active.yml
 
-  - username: fouaa1
-    name: AbdelRahman (Adam) Fouad, External, ECD Artificial Intelligence
-    groups: ['infra-team']
-    uid: 2005
-    ssh_key:
-      - "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCoYFl11Mvmn+0rO+1fMKNVhJHQF1K+b09JHp3z2aCyoEGrAXTeJpV6J6k0gBVIXpf8efAxNX8z8odtRPEHbKeiEfwqCOTZoC2IQI7PKrQozc8yfuJ+hpXRcaxqYURnNAXfVOdHa4/pD/R4R3p9FQp/1EaRry643+8IR+DofxyAQazLGz3MmFfBtJnmDG0CGtGp4XbuMh03pzI2tLI90CYLRr/XJUyP66Ul88Fjm8n6woiazX4SZGuj/JuauXRYgW6ZJveasYaZoaKFt5kAfZKoUk9fvl/rBjCn8oGvO4xN/4ROhOkOn++QClsxRSqHvXjSZTb9W+bV9MpFPs/vH3lb akhaled@akhaled-XPS-13-9350"
+# do not use UID 2005 as this was Adam's and is now represented in users_deleted.yml

Note that I added a comment regarding reuse of his UID. This was a unique circumstance in that Adam was coincidentally the last Infra-Team user added and so without this entry, we’d accidentally reuse his UID.

2. Add user entry from users_deleted

# git diff environments/000_globals/all/users_deleted.yml
 
+  - username: fouaa1  # Adam Fouad (ace/roadmap#840)
+    uid: 2005
+    remove: true
+    force: false

3. Run Ansible Base playbook to effect change

I’m running my changes from my test server which is acting as an Ansible Control Machine (ACM) for now.

# login to tmbtest-svcs1.gred.ai as myself
$ ssh -A bushnet1@tmbtest-svcs1.gred.ai
 
# change into Ansible home directory (set as environment variable in my bashrc
$ cd $ANSIBLE_HOME
 
# run poetry shell to activate my python environment
$ poetry shell
 
# git pull changes
$ git pull
 
# run ansible on non-prod environment
$ ansible-playbook playbooks/base.yml -t users,sudo
 
# run ansible on prod environment
$ ansible-playbook playbooks/base.yml -t users,sudo -i environments/prod

Note: As is often the case when you run Ansible across our fleet of servers, you may encounter failure reports (inaccessible servers, python interpreter errors, etc.). Please ensure you resolve these issues as part of this task or this tech debt will pile up and create more issues down the road. Thank you.

Deploy your user updates to your desired server class

Our Ansible base playbook is responsible for running all base system tasks, including those related to user management. Sometimes you want to deploy your change without running all Ansible playbook tasks on all of our servers.

The example below runs only our user management tasks on just our Talend servers. Change the instance class for your use case.

# run this command from the ansible directory
# -t users restricts the run to only the user management tasks
# -l limits the run to our talend servers
# -i inventory option only required for production servers (default to non-prod)
ansible-playbook playbooks/base.yml -t users -l app_talend [-i environments/prod]

Misc Howto

This section needs some cleanup, but there is some decent info here so keeping it. I’ll clean it up ASAP.

Deploy Prometheus Node Exporter

Prometheus Node Exporter (client software) is part of our base role. To deploy to an Ansible host without running the entire base role you can do this:

ansible-playbook playbooks/base.yml -t monitoring -l "{your_host_pattern}"
 
# for example, this deploys to tmbtest-* systems
ansible-playbook playbooks/base.yml -t monitoring -l "tmbtest-*"

List all Inventory Hosts

# summary graph structure (less verbose)
ansible-inventory --graph
 
# detailed yaml structure (more verbose)
ansible-inventory --list --yaml

Install Dependent Roles/Collections

Run this command to install all dependencies listed in requirements.yml:

  • the first time you use this repository
  • after updating requirements.yml
ansible-galaxy install -r requirements.yml --force

Ping an Ansible Host

# run ping module against a host to check connectivity
ansible $ansible_host_name -m ping [-u ssh_username]
 
# for example
ansible algorithmia-bastion -m ping -u centos

Run a Shell Command on an Ansible Host

# use -a option followed by the command
# add -u if using non-default user
ansible $ansible_host_name -a "free -h" [-u ssh_username]
 
# for example
ansible algorithmia-bastion -a "free -h" -u centos

Run Ansible Playbook on Limited Subset of Hosts

# runs site playbook only on these hostname patterns
ansible-playbook playbooks/site.yml -l "tmbtest-*,ai-*,delphi-*"

Run Tag Limited Subset of Ansible Tasks

This requires tagging of tasks in your role or playbook. Note this code block in our base role:

- name: "base | install and configure prometheus node_exporter"
  import_tasks: node_exporter.yml
  tags:
    - monitoring

To install Prometheus client on all my hosts without running the rest of base, I run the following:

ansible-playbook playbooks/base.yml -t monitoring

What’s Next

This code and process is under heavy development as of January 2022. Here some things on our roadmap. If you don’t see something you want, please add it to the list.

  1. Employ an automated CI/CD driven workflow that automates deployments when changes are merged.
  2. Ansible Control Machine from which to run Ansible commands without cumbersome workstation setup OR
  3. Dockerized setup from which to run Ansible commands without cumbersome workstation setup.
  4. Replace SSH with AWS Session Manager.
  5. Replace local Linux user management with IAM/SSO integration.