Summary
This document provides some context on using AWS DataSync Service to transfer a large volume of data inside our infrastructure or bring new datasets from other providers into our account.
work in progress
Motivation
Before adding this module data transfers were usually performed manually by someone in the infra or requester team. However, transferring large volumes of data requires more automated work to avoid long waits and retries. AWS DataSync Service is designed to transfer these large volumes of data between cloud providers.
Please refer to aws documentation on data sync for more information. To simplify the process of creating new Datasync tasks we developed a module to avoid repetitive work. The module currently is hosted at terraform-ace-prod/modules/aws-datasync (todo: should host on its own repository for versioning)
Modules
At the moment we have three modules to create data transfer tasks. For full information about the variables refer to their description in the module.
Agent
Datasync uses an agent to read/write data from/to external providers.
To create a new agent inside our account you can use the following syntax:
module "datasync_service" {
source = "../../modules/aws-datasync/"
ec2_instance_state = "stopped"
name = local.name
vpc_id = data.aws_vpc.this.id
ec2_key_pair = local.ssh_key_name
additional_cidrs = ["10.0.0.0/8"]
}s3_to_s3_transfer
These transfers do not require an agent. However, we should make sure we are not exceeding the quota of 25 Million files per transfer execution.
module "datasynctask" {
source = "../../modules/aws-datasync/s3_to_s3_transfer_task"
name = "clinical"
source_bucket_name = "mybucket"
source_subdirectory = "/subdir/" # these two variables define the source
destination_s3_bucket_name = "destination_bucket"
destination_s3_subdirectory = "/subdir/" # these two variables define the source
}After applying this module you should start the transfer execution via aws console.
gcp_bucket_to_s3_transfer
module "datasync_gcp_to_s3_task" {
source = "../../modules/aws-datasync/google_bucket_to_s3_task/"
agent_arn = module.datasync_service.datasync_agent_arn # needs the agent arn to use
name = "my-transfer"
destination_s3_bucket_name = "mybucket"
destination_s3_subdirectory = "/subdir/"
source_bucket_name = local.mimic_gcp_bucket_name
source_subdirectory = "/${local.mimic_gcp_bucket_name}/"
secret_key = "secret_key"
access_key = "access_key"
}