Summary
Ansible is our tool of choice for installing and configuring Linux servers. Ansible code is managed as follows:
| Code | GitHub Location | Description |
|---|---|---|
| Ansible Runtime | gred-ecdi/aws-infra-live | - Ansible playbooks, inventory and group/host variables - Ansible is run from here to configure servers |
| Ansible Roles | gred-ecdi/ansible-role-* | - GitHub repos hosting Ansible roles - Each role has its own repository to enable versioning and testing - Roles are consumed via Ansible Runtime |
Roles
Instructions soon on how to use Ansible Molecule to develop Ansible roles.
Inventories & Variables
Summary
As of February 2022, for security and predictability reasons, we use separate inventories and variable sets for each our environments.
| Environment | Directory | Description |
|---|---|---|
| Development | environments/dev | Legacy name for our non-production AWS environment in our production AWS account. It’s being phased out in favor of test and this directory is now just a symlink to test. |
| Test | environments/test | Non-production environment in our production AWS account. |
| Production | environments/prod | Production environment in our production AWS account. |
To understand the logic behind the decision to keep distinct inventories for each environment requires understanding of Ansible’s group intersection problem. This blog post perfectly identifies the problem and recommends a number of solutions, the last of which we adopted here.
Working with Multiple Environments
Our Ansible environment is configured to default to non-production. This enables us to test changes before rolling them out to production.
So by default, your Ansible commands are only going to reflect non-production.
For example, below I’ll run ansible-inventory to retrieve my Talend servers (Application = "talend").
Default behavior shows non-production only:
$ ansible-inventory --graph app_talend
@app_talend:
|--gcore-talend-dev
|--gcore-talend-dev-job1
|--gcore-talend-uat
|--gcore-talend-uat-job1Add the -i path/to/inventory option to show other environments; for example, production:
$ ansible-inventory --graph app_talend -i environments/prod
@app_talend:
|--gcore-talend-prod
|--gcore-talend-prod-job1
|--talend-app-prod-bastionhostUser/Group Management
Overview
At present, we deploy local (Linux) system users to each system in our fleet. Users are processed like so:
- Our base playbook is configured to run on all hosts.
- This playbook calls our base role.
- The base role includes a tasks/users.yml file which includes some tasks to generate a data structure of authorized users and pass that data structure to an upstream role to process user adds, removes and changes.
- The upstream role that does all the heavy lifting is singleplatform-eng/ansible-users.
Managing Linux users in this context is a matter of populating user variables to achieve your desired outcome. Below I’ll describe those variables
| Variable | File | Details |
|---|---|---|
base_active_users | environments/000_globals/all/users_active.yml | Global list of active users eligible for installation on our server fleet, depending on their group assignment. |
groups_to_create | environments/000_globals/all/groups.yml | Global list of user groups that get installed on all servers. |
base_host_groups | environments/(prod|test)/group_vars/app_{app_name}.yml | This variable is a list containing the names of user groups to install on particular servers, as defined by the application class. |
base_sudo_users | environments/(prod|test)/group_vars/app_{app_name}.yml | This variable is a dictionary that defines names and sudo parameters for user groups that should have sudo access on particular servers, as defined by the application class. |
Let’s walk through a few examples
Add a user
Before adding a new user, you must ensure you have the following information:
- Roche assigned UNIX username
- Full name, department and role
- SSH public key(s)
- Assigned user class(es) (defined as groups in
environments/000_globals/all/groups.yml)
Armed with this information, you just need to open environments/000_globals/all/users_active.yml and follow the existing pattern. There are just a few caveats to note:
- Entries are in UID order so please stick to this pattern.
- Infra-Team (admin) users are UIDs 2000-2099 and are at the top of the file. When adding a new Infra-Team user, find the last one (entry before 2100) and add them there. This isn’t a hard rule, but it’s a pattern that we’d like to keep if we can.
- All other users are UIDs 2100-2999. If adding a non-admin user, just add them to the bottom of the file.
- Be sure NOT to duplicate UIDs or you will break things. When copying and pasting from a previous entry, double check that you’ve updated the UID.
Here is what my user entry looks like:
- username: bushnet1
name: Todd Michael Bushnell, External, gRed Informatics GCGAC
groups: ['infra-team']
uid: 2001
ssh_key:
- "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQD6eFho5Y+OerNrWkk/XgN/sihlkwRAyHAXzr5qqkVxR5JwV2WKflwDt9SibcjXYjKOcss/fEZJYf8BRm9r01EXzjbGnapq9Ixwm2vz6hKP8hnWvhKCB89Vf9+oUxBL/WmRwhzyrd/Lqrh3WhXUfrkHuGpMXg5toLYEtHRqmeIvz2xRoBS+gNk6/wDfkjETnIkdi2y2r2CKIFAMY6e6Dq1mYgIYSyeaUHthkOxCSr
tjrhjG2DjIPr4huSV3EJR8a1ij3b8iFjU4ydc2KkU9MdN6kMoxV8N2fEWQ/c+WYsM87RcQISKkN7qLpbTTJA4LBNeKIgL
aInUWzu6TBUQjSQ86Vaq5s3kqDCem2jxin1MsXwqmI674p58RJIs41NwqZ7k6JEczCfp0GH2mV5vkt+9ZdMrEsMBJQuvv
981aYrnEuDTjawOtvU9+a9iBJafUG/t6Sx09QQbA7Jt+oHhH5Zu5cAkHDr9YzJ5wHvJP7klk++1ogsuKVqKIkr8FedBe0
E+vfM6hrrfMmg+R3XMJEanYKHWAl//OvaM6ZT6d/ZKkqOhl3PDU9ZYi9qmhPNS5DqOOR+ebaZs/NMa33AB8zOv8MFRIBf
mCoacZnwjQtskSegJocflbItbSDTezWW8OKjJL1LvE2UBGKd8GFVel0IAD8sX8gO68dUkCXR/NUrSYew== todd@strat
aconsulting.com"NOTE:
- The
usernamefield is the UNIX username provided by Roche and should be in the ticket. - The
namefield translates to the UNIX Gecos field and is in the format,full name, role, department. You can get this information by searching for theusernamein Gwiz. - The
groupsfield is a list of assigned user groups and is essential to getting a user on the desired servers. Users do not get assigned to servers; groups do. More on this later. - The
uidfield must, must, must be unique! - The
ssh_keyfield is a YAML list to allow you to add more than one key. Just add them to the list and ensure they’re quoted.
Assuming:
- the user has been assigned to the desired group(s).
- the desired group(s) are defined in
environments/000_globals/all/groups.yml. - the desired group(s) have been assigned to the desired system class(es) via the
base_host_groupsvariable.
Your user will be distributed to the appropriate systems when you run Ansible.
If all of these assumptions aren’t met, there may be additional work to do.
Add a user group
Groups are defined globally (for all environments) in environments/000_globals/all/groups.yml. To add a new group, just open the file and add your group to the bottom of the file following the established pattern and noting the caveats at the top of the file.
# We create same groups on all systems, whether populated or not
# using 600-999 for ACE groups
# NOTE: pattern for application group accounts is $app-team
# This enables us to differentiate between human members of an
# application team and the application (system) user/group
# Use case: I want members of delphi-users to be granted sudo, but
# I don't want the delphi system user/group to also get sudo
groups_to_create:
- name: infra-team
gid: 600
- name: dev-team
gid: 601
- name: talend-team
gid: 602
- name: delphi-team
gid: 603
- name: imo-team
gid: 604
<your group goes here>Add a user group (and its users) to a server class
We classify our servers based on their Application tag. For example, servers tagged with Application: Talend are automatically grouped in Ansible as app_talend by our AWS dynamic inventory configuration.
# ansible inventory command returning (non-prod) talend servers
$ ansible-inventory --graph app_talend
@app_talend:
|--gcore-talend-dev
|--gcore-talend-dev-job1
|--gcore-talend-uat
|--gcore-talend-uat-job1
# ansible inventory command returning (prod) talend servers
$ ansible-inventory --graph app_talend -i environments/prod
@app_talend:
|--gcore-talend-prod
|--gcore-talend-prod-job1
|--talend-app-prod-bastionhostSo now we’ve established application class, let’s say we want to add a user to the Talend servers. To that we need to:
- Select (or create) a group to associate with the Talend servers
An obvious choice (and one that already exists) is to use this existing group
# environments/000_globals/all/groups.yml
- name: talend-team
gid: 602- Associate the group with the server class
This is done by setting the base_host_groups variable in the inventory file associated with the talend server class.
environments/test/group_vars/app_talend.ymlfor non-productionenvironments/prod/group_vars/app_talend.ymlfor production
Here’s what it will look like for non-prod:
# environments/test/group_vars/app_talend.yml
---
base_host_groups: ["talend-team"]Repeat this pattern for your desired server class.
- Associate the user with the group
This takes us back to our global active users file. Below I’ll provide an example user that will achieve the desired result for our example (talend-team)
# environments/000_globals/all/users_active.yml
base_active_users:
- username: xander
name: Xander Mathews, Solutions Architect, ECD Information Mgmt Office GEGDF
groups: ['dev-team', 'imo-team', 'talend-team', 'alation-team']
...Note that this user is in multiple user groups which is perfectly acceptable. The key is that the groups list includes talend-team.
You now have:
- a user assigned to a user group
- that user group associated with a server class
The next time you run the Ansible (base playbook), your changes will be deployed. Read on for instructions on how to deploy your changes to your server class.
Remove a user
Removing users is a little less intuitive than adding a user, but it’s simple. For additional context please read the upstream role instructions.
Removing a user is a matter of:
- removing the user’s entry from
environments/000_globals/all/users_active.yml. - adding the user entry in
environments/000_globals/all/users_deleted.yml. - committing/pushing/pulling changes & running Ansible base playbook on all servers to effect change.
I’ll demonstrate the process for Adam’s removal. Please see ace/roadmap#840 for the Git diffs.
1. Remove user entry from users_active
# git diff 000_globals/all/users_active.yml
- - username: fouaa1
- name: AbdelRahman (Adam) Fouad, External, ECD Artificial Intelligence
- groups: ['infra-team']
- uid: 2005
- ssh_key:
- - "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCoYFl11Mvmn+0rO+1fMKNVhJHQF1K+b09JHp3z2aCyoEGrAXTeJpV6J6k0gBVIXpf8efAxNX8z8odtRPEHbKeiEfwqCOTZoC2IQI7PKrQozc8yfuJ+hpXRcaxqYURnNAXfVOdHa4/pD/R4R3p9FQp/1EaRry643+8IR+DofxyAQazLGz3MmFfBtJnmDG0CGtGp4XbuMh03pzI2tLI90CYLRr/XJUyP66Ul88Fjm8n6woiazX4SZGuj/JuauXRYgW6ZJveasYaZoaKFt5kAfZKoUk9fvl/rBjCn8oGvO4xN/4ROhOkOn++QClsxRSqHvXjSZTb9W+bV9MpFPs/vH3lb akhaled@akhaled-XPS-13-9350"
+# do not use UID 2005 as this was Adam's and is now represented in users_deleted.ymlNote that I added a comment regarding reuse of his UID. This was a unique circumstance in that Adam was coincidentally the last Infra-Team user added and so without this entry, we’d accidentally reuse his UID.
2. Add user entry from users_deleted
# git diff environments/000_globals/all/users_deleted.yml
+ - username: fouaa1 # Adam Fouad (ace/roadmap#840)
+ uid: 2005
+ remove: true
+ force: false3. Run Ansible Base playbook to effect change
I’m running my changes from my test server which is acting as an Ansible Control Machine (ACM) for now.
# login to tmbtest-svcs1.gred.ai as myself
$ ssh -A bushnet1@tmbtest-svcs1.gred.ai
# change into Ansible home directory (set as environment variable in my bashrc
$ cd $ANSIBLE_HOME
# run poetry shell to activate my python environment
$ poetry shell
# git pull changes
$ git pull
# run ansible on non-prod environment
$ ansible-playbook playbooks/base.yml -t users,sudo
# run ansible on prod environment
$ ansible-playbook playbooks/base.yml -t users,sudo -i environments/prodNote: As is often the case when you run Ansible across our fleet of servers, you may encounter failure reports (inaccessible servers, python interpreter errors, etc.). Please ensure you resolve these issues as part of this task or this tech debt will pile up and create more issues down the road. Thank you.
Deploy your user updates to your desired server class
Our Ansible base playbook is responsible for running all base system tasks, including those related to user management. Sometimes you want to deploy your change without running all Ansible playbook tasks on all of our servers.
The example below runs only our user management tasks on just our Talend servers. Change the instance class for your use case.
# run this command from the ansible directory
# -t users restricts the run to only the user management tasks
# -l limits the run to our talend servers
# -i inventory option only required for production servers (default to non-prod)
ansible-playbook playbooks/base.yml -t users -l app_talend [-i environments/prod]Misc Howto
This section needs some cleanup, but there is some decent info here so keeping it. I’ll clean it up ASAP.
Deploy Prometheus Node Exporter
Prometheus Node Exporter (client software) is part of our base role. To deploy to an Ansible host without running the entire base role you can do this:
ansible-playbook playbooks/base.yml -t monitoring -l "{your_host_pattern}"
# for example, this deploys to tmbtest-* systems
ansible-playbook playbooks/base.yml -t monitoring -l "tmbtest-*"List all Inventory Hosts
# summary graph structure (less verbose)
ansible-inventory --graph
# detailed yaml structure (more verbose)
ansible-inventory --list --yamlInstall Dependent Roles/Collections
Run this command to install all dependencies listed in requirements.yml:
- the first time you use this repository
- after updating
requirements.yml
ansible-galaxy install -r requirements.yml --forcePing an Ansible Host
# run ping module against a host to check connectivity
ansible $ansible_host_name -m ping [-u ssh_username]
# for example
ansible algorithmia-bastion -m ping -u centosRun a Shell Command on an Ansible Host
# use -a option followed by the command
# add -u if using non-default user
ansible $ansible_host_name -a "free -h" [-u ssh_username]
# for example
ansible algorithmia-bastion -a "free -h" -u centosRun Ansible Playbook on Limited Subset of Hosts
# runs site playbook only on these hostname patterns
ansible-playbook playbooks/site.yml -l "tmbtest-*,ai-*,delphi-*"Run Tag Limited Subset of Ansible Tasks
This requires tagging of tasks in your role or playbook. Note this code block in our base role:
- name: "base | install and configure prometheus node_exporter"
import_tasks: node_exporter.yml
tags:
- monitoringTo install Prometheus client on all my hosts without running the rest of base, I run the following:
ansible-playbook playbooks/base.yml -t monitoringWhat’s Next
This code and process is under heavy development as of January 2022. Here some things on our roadmap. If you don’t see something you want, please add it to the list.
- Employ an automated CI/CD driven workflow that automates deployments when changes are merged.
- Ansible Control Machine from which to run Ansible commands without cumbersome workstation setup OR
- Dockerized setup from which to run Ansible commands without cumbersome workstation setup.
- Replace SSH with AWS Session Manager.
- Replace local Linux user management with IAM/SSO integration.