This document is out of dated, please check AWS S3 Buckets document.

Summary

Over the past few years we’ve been managing AWS S3 object storage in a somewhat reactive fashion, the consequences of which have created inflexibility, operational complexity and end user confusion. This new and improving design model is our attempt to remedy this.

Bucket Classes

We’ve thus far come up with three S3 bucket classes that capture existing customer requirements. In this section we’ll define and exemplify each class.

Please note that it’s likely that requests will arise that don’t perfectly fit into one of our existing classes. When edge cases arise, we’ll do one of the following:

create a new class if the request proves unique and common enough to warrant it.
classify the bucket according to best match and document the exception in this document’s FAQ section.

Application Bucket

Definition

S3 bucket created for the exclusive use of an application.

Access Model

Access will be granted to the application in accordance with its requirements.
User access will be handled by the application itself.
Direct user access via IAM will require an approved exception.

Provisioning Model

Bucket provisioning code and IAM permissions managed in the same Terraform workspace as the rest of the application components. This simplifies provisioning and auditing by keeping all application requirements colocated in a single Terraform directory.

Example

AI Spell Platform

s3://ai-spell-prod-eee4461d.
All Spell application provisioning code, including S3 bucket/perms, contained here.
User access to bucket contents managed by the Spell application. Object actions (e.g. copy) are done via the spell CLI and leverage the application’s attached permissions.

Team Bucket

Definition

S3 bucket created for the exclusive use of a team. This bucket, unlike the others, is created with a standard structure that somewhat mimics a traditional Linux /home directory.

.
├── public          # for shared objects that are `read/write` accessible to all team-members
|__ users           # parent (/home) directory containing per-user subdirectories
│   ├── bob         # bob's home directory 
│   └── alice       # alice's home directory
│   └── etc

Access Model

AWS users will have access to a team bucket if they’re in the AWS IAM user group associated with that team.
Team IAM groups and team buckets will follow the same naming standard, as depicted in the Example section below.
All team-members will have read/write access to the /public directory.
All team-members will have a personal directory under /users and will have read/write access to that directory.
All team-members will have read access to each others personal directories.

Provisioning Model

IAM users/groups are managed here.
Team buckets are managed in dedicated Terraform workspaces, as depicted in the Example section below.

NOTE

Users added to a group AFTER the S3 bucket has been created require two Terraform applies to activate:

Apply the iam-user-mgmt workspace to create the user and add them to the IAM group.
Apply the s3-nnnn-team workspace to detect the new user and setup their IAM permissions and personal directory.

As of 23-Sep-2021 we have not setup run triggers to automate the dependent runs. We do plan to do this in the near future.

Example

MLP Team

Layout:

2021-09-22 18:04            0  s3://gred-mlp-team/public
2021-09-22 18:04            0  s3://gred-mlp-team/users/kayas7
2021-09-22 18:04            0  s3://gred-mlp-team/users/khormala
2021-09-22 18:04            0  s3://gred-mlp-team/users/qiul13
2021-09-22 18:04            0  s3://gred-mlp-team/users/wanc13
2021-09-22 18:04            0  s3://gred-mlp-team/users/zewden

IAM user/group provisioning code Bucket provisioning code

Dataset Bucket

Definition

S3 bucket created to host a specific dataset. This class of bucket was created to address some gaps in our previous model:

Datasets are particularly sensitive and their access must be managed carefully.
Dataset access requirements span teams so we can’t host such data in a team bucket.
Our previous (single bucket for everything) model required extremely complex IAM rules to control access to these datasets.
Datasets should be (largely) immutable and thereby require a special S3 feature, S3 Object Locking, to restrict change.

Access Model

Default dataset access is WriteOnceReadMany (WORM).
Object locking is configured for Governance Mode with a data retention of 1000 days. Governance mode enables administrator override of object locking, thus enabling auditable object deletion under special circumstances.
Datasets will be assigned a dataset owner. Only dataset owners will have permission to change datasets and all changes will be logged.
Users/groups will be granted access to datasets upon completion of requisite training and upon approval from dataset owners. Access must be requested and approved via Gitlab issue.
Applications will be granted access to datasets in accordance with business requirements and as approved by dataset owners. Access must be requested and approved via Gitlab issue.

Provisioning Model

Dataset buckets are managed in dedicated Terraform workspaces, as depicted in the Example section below.
The naming configuration is gred-cids-dataset-{dataset name}
Access provisioning for users is done via a bucket policy colocated with the bucket provisioning code, thereby enabling us to answer the question, “which users have access to this dataset?”
Access provisioning for applications is presently managed with the application to keep all application provisioning logic colocated, but this may change to the bucket policy so that assessment of who or what has access to data can be answered in one location.

Example

COPDGene Dataset

NOTE

FAQ

What about projects that span teams?

As noted in the document, dataset buckets are designed to span teams, BUT they’re also expected to be (mostly) immutable so won’t perfectly solve this problem. We’re recommending addition of a 4th bucket class for this: Project Buckets. Thanks to Arindam for this recommendation. When I formally add Project Buckets to this document, I’ll remove or amend this question/answer.

What controls are in place to control S3Fuse mounted buckets on EC2 instances?

We’ve recently improved the group access management workflow on our EC2 machines so we can configure S3Fuse mounts accordingly via /etc/fstab. Where access spans groups we can consider group aggregation or bucket multi-mounts.

What is our data retention story for team buckets?

Thanks to Mohsen for this question. We’re pondering and will answer soon.

What’s the workflow for manipulating datasets if I can’t write to the dataset bucket?

The dataset bucket is your immutable source of truth. You read it, but don’t change it.
If you’re a user, you read a dataset from the dataset bucket and write your processed data to your team bucket, either /public or /user/{your_personal_directory.
If you’re an application, your read a dataset from the dataset bucket and write your processed data to your application bucket.

How do I grant a non-team member access to objects in my team bucket

If a team-member wants to grant access to a non-team-member, the team-member must request this via Gitlab issue and provide the following:

Name and UNIX ID of user to whom access should be granted.
Business reason for this access.
Duration of access.
Object path(s) to which access is required.

We have an example of this.

Requesting Gitlab ticket.
Bucket policy to enable requested access.

Infra Team TODO: Create a Gitlab template for non-member access to team-bucket resources

How do I mount an S3 bucket?

In this section, we’ll describe how to mount an Amazon S3 file system step by step. Mounting an Amazon S3 bucket using S3FS is fairly straightforward.

Step 1: Installation For MacOS:

brew install --cask osxfuse
brew install s3fs

For Ubuntu:

apt-get install s3fs

Step 2: Configuration

Once S3FS is installed, set up the credentials as shown below:

echo ACCESS_KEY:SECRET_KEY > ~/.passwd-s3fs
cat ~/.passwd-s3fs

You will also need to set the right access permission for the passwd-s3fs file to run S3FS successfully. To do that, run the command below:

chmod 600 .passwd-s3fs

Now we’re ready to mount the Amazon S3 bucket. Create a folder the Amazon S3 bucket will mount:

mkdir -p ~/s3/sandbox
s3fs gred-ai-sandbox-prod ~/s3/sandbox/

For extra debugging in case something goes wrong:

s3fs gred-ai-sandbox-prod ~/s3/sandbox -o dbglevel=info -f -o curldbg -o passwd_file=${HOME}/.passwd-s3fs

API Authentication Framework CIDM Roles