AWS InfrastructureLarge Data Transfer & DataSync

Summary

This document provides some context on using AWS DataSync Service to transfer a large volume of data inside our infrastructure or bring new datasets from other providers into our account.

work in progress

Motivation

Before adding this module data transfers were usually performed manually by someone in the infra or requester team. However, transferring large volumes of data requires more automated work to avoid long waits and retries. AWS DataSync Service is designed to transfer these large volumes of data between cloud providers.

Please refer to aws documentation on data sync for more information. To simplify the process of creating new Datasync tasks we developed a module to avoid repetitive work. The module currently is hosted at terraform-ace-prod/modules/aws-datasync (todo: should host on its own repository for versioning)

Modules

At the moment we have three modules to create data transfer tasks. For full information about the variables refer to their description in the module.

Agent

Datasync uses an agent to read/write data from/to external providers.

To create a new agent inside our account you can use the following syntax:

 
module "datasync_service" {
  source = "../../modules/aws-datasync/"
 
  ec2_instance_state = "stopped"
  name               = local.name
  vpc_id             = data.aws_vpc.this.id
  ec2_key_pair       = local.ssh_key_name          
  additional_cidrs   = ["10.0.0.0/8"]
}

s3_to_s3_transfer

These transfers do not require an agent. However, we should make sure we are not exceeding the quota of 25 Million files per transfer execution.

module "datasynctask" {
  source = "../../modules/aws-datasync/s3_to_s3_transfer_task"
 
  name = "clinical"
 
  source_bucket_name  = "mybucket"
  source_subdirectory = "/subdir/" # these two variables define the source
 
  destination_s3_bucket_name  = "destination_bucket"
  destination_s3_subdirectory = "/subdir/"  # these two variables define the source
}

After applying this module you should start the transfer execution via aws console.

gcp_bucket_to_s3_transfer

module "datasync_gcp_to_s3_task" {
  source = "../../modules/aws-datasync/google_bucket_to_s3_task/"
 
  agent_arn = module.datasync_service.datasync_agent_arn # needs the agent arn to use
  name      = "my-transfer"
 
  destination_s3_bucket_name  = "mybucket"
  destination_s3_subdirectory = "/subdir/"
 
  source_bucket_name  = local.mimic_gcp_bucket_name
  source_subdirectory = "/${local.mimic_gcp_bucket_name}/"
  secret_key          = "secret_key"
  access_key          = "access_key"
}