Disaster RecoveryOverview

Purpose

The purpose of this document is to detail our disaster recovery (DR) policies & standards - our desired state - along with what’s in place at this time (current state).

Objectives

ObjectiveDescriptionDefaults (Unofficial)
Recovery Time Objective (RTO)Amount of time to restore the service after an outage24 hours
Recovery Point Objective (RPO)Amount of acceptable data loss, measured in time24 hours

As of November 2020, the Defaults are unofficial. Going forward, we’ll set global defaults designed to cover most circumstances and provide support for deviations on an application by application (or application class) basis. Such deviations will be requested by application owners.

Backup

The first step in any DR plan is data backup. It’s important to identify sources of application state and provide a backup solution for each. We can’t backup that which we don’t know exists.

Elastic Block Storage (EBS)

Amazon EBS is the block storage used by EC2 instances. All instances have a root volume for their operating system. Many will store all data files on this single partition while others may be configured with separate data volumes. Whatever their configuration, it’s important that any state stored here gets backed up. If an application only uses EBS for its OS and all important data is stored elsewhere (databases, object stores), EBS backups are probably not required. A failed system can easily be rebuilt with Terraform and reattached to its remote state sources with no data loss. However, many applications still store important data locally on disk and it’s therefore important that we have backup options in place.

Effective November 22, 2020, we’ve implemented Amazon Data Lifecycle Manager and configured a default policy that can easily be leveraged by application owners to achieve baseline EBS backups.

Configuration

Policy Home (AWS Console)https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Lifecycle:sort=PolicyId
Policy IDpolicy-03c6dcc7e7b83d702
FrequencyEvery 24 hours
Backup Time13:00 UTC (05:00 PST)
Retention PolicyA maximum of 7 snapshots will be retained of a target volume. The oldest snapshot retained will be ≤ 7 days old
TargetsAny EBS volume tagged with Backup = Default
Terraform Source Codehttps://code.roche.com/ecdi/gcore-platform/-/tree/stable/aws-infrastructure/prod/us_west_2/backups

Usage

If you have data volumes that you want backed you just need to identify them to DLM by tagging them accordingly:

KeyValue
Backupdefault

NOTE: You must tag the instance volumes. Tagging the instance is not sufficient.

Visit this link for a current list of backup targets.

The screenshot below shows the list at time of writing <!— image: image (from original wiki uploads) —>

Visit this link for a current list of snapshots created by this policy.

The screenshot below shows the list at time of writing <!— image: image (from original wiki uploads) —>

Recovery

Coming soon

Algorithmia

NOTE: This section (except for “Supplementary Backup Coverage”) migrated from the AWS Backups, Migration, and Recovery document provided to us by Algorithmia.

General

There are 3 essential pieces of persisted data within Algorithmia Enterprise:

MySQL DatabaseThe database houses much of the application state including users. The database is configured to create a new snapshot daily and keep the most recent 7 snapshots. Additionally snapshots can be manually triggered as needed, including during maintenance windows for scheduled upgrades or hotfixes. NOTE: See Issue #272 for config state discrepancy.
Object StoreHosted data and algorithm bundles are stored in the algorithmia-$UID-$STAGE-$REGION-data (link) S3 bucket. As S3 has very high availability SLAs, no additional backups are configured.
Git RepositoriesThe actual algorithm code is stored within git repositories on an EBS volume attached to the Legit instance (100GB: /dev/sdf -> nvme1n1 -> /algorithmia/legit). This volume is configured to create daily snapshots as backups. NOTE: See Issue #273 for config state discrepancy.

All of the above persisted data can be migrated to a new Algorithmia Enterprise cluster using the standard installer.

NOTE: There are other stateful services involved in platform operation (e.g. logs, metrics, Redis, search indices), however these services can tolerate data loss with minimal impact to any recovery scenario.

Recovering from backups is accomplished using the installer for the version that is or was running, and following the migration steps below.

NOTE: The installer generates a deployment backup tarball that must be kept in a secure location. The encryption secrets it contains may not be recoverable if this archive is lost.

Migration Requirements

To migrate to a new cluster (or recover from backups) you will need:

  1. EBS snapshot of the git volume attached to the legit instance. If migrating to a new region and/or account, copy this snapshot to the correct region and/or share it to the new account.
  2. RDS snapshot of the database - if migrating to a new region and/or account, copy this snapshot to the correct region and/or share it to the new account. Note that automatic RDS snapshots require copying the snapshot before you can share it with a different account.
  3. S3 bucket containing Algorithmia data (algorithmia-$UID-$STAGE-$REGION-data).
  4. Config file for the web-server (configs/frontend/application-$STAGE.conf in your deployment tarball).
  5. IAM permissions to perform migration. This is the standard deployment permissions plus permission to list and read all the objects in the S3 bucket containing Algorithmia data. Refer to S3 Account Transfer for full details, but you will generally want to attach the following policy to the S3 bucket.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DelegateS3Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<DESTINATION AWS ACCOUNT NUMBER>"
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<SOURCE DATA DIR>/*",
        "arn:aws:s3:::<SOURCE DATA DIR>"
      ]
    }
  ]
}

NOTE: the tuple (stage, region, aws account) must be unique for any existing installations. If you want to launch the new stage in the same region and AWS account, you will either need to use a different stage name or destroy all the AWS resources from the old stage (except the snapshots and S3 buckets mentioned above).

Migration Procedures

The entire migration process is handled by standard installation process using different responses to some of the installation prompts. Using the same algorithmiahq/codex-install version as the backed up cluster, simply follow the standard installation instructions until asked if you wish to Migrate existing git/database/snapshot?. Select y and you will be prompted for several additional details:

  1. Existing EBS Snapshot ID for Git: This is the snapshot ID of the volume that was attached to the legit instance (containing git repositories).
  2. Existing RDS Snapshot ARN: This is the snapshot ARN of the RDS instance that stored application state.
  3. Database Encryption Key ID: This is the algorithmia.simple_crypto.key_1_id value in configs/frontend /application-$STAGE.conf. This is needed so that a new deployment can still decrypt certain data in the database. This value is almost always 1.
  4. Database Encryption Key: This is the algorithmia.simple_crypto.key_1_key value in configs/application-$STAGE.conf. This is needed so that a new deployment can decrypt certain data in the database.
  5. Database Pre-Key Salt: This is the algorithmia.creds_pre_salt value in configs/application-$STAGE.conf. This is needed so that a new deployment can decrypt certain data in the database.
  6. Database Post-Key Salt: This is the algorithmia.creds_post_salt value in configs/application-$STAGE.conf. This is needed so that a new deployment can decrypt certain data in the database.
  7. Old Root Storage Bucket: This is the algorithmia-$UID-$STAGE-$REGION-data S3 bucket containing hosted data/models and algorithm bundles. The installation will copy this bucket into a newly created bucket.

Continue to answer the installation prompts, and be sure to use the same database root username that was used in the snapshot, otherwise Terraform will destroy and recreate the database in a loop.

Once the installation prompts finish, the install should proceed as a clean installation would. You do NOT need to click Perform Cluster Setup after the installer finishes.

NOTE: database passwords will be different in the new cluster, and the install will run an S3 sync from the old bucket to the new bucket that could take a while depending on the amount of algorithms and data that existed in the old cluster.

Once the installation script completes, you should have a fully migrated (or recovered) Algorithmia Enterprise cluster.

Supplementary Backup Coverage (gCORE DevOps)

As an additional precautionary measure, gCORE DevOps has included the algorithmia-bastion server’s EBS volume in its default DLM snapshot policy given the business criticality of the Algorithmia deployment state files that exist on it. See Disaster Recover for details.

Labelbox

Database

Labelbox database snapshots are configured directly in the Labelbox Administration Console.

Locations3://ai-labelbox-prod-ykmjo8gp/db_snapshots
Backup FrequencyHourly
Backups Retained3 (default (TODO))
Backups EncryptedYes

Block Storage

Backed up daily in accordance with our default DLM policy (above).

Spell

Much of the content from this section taken from Spell authored Spell ONPREM Disaster Recovery document.

Failure modes

The ONPREM deployment of Spell is likely to experience hardware failures and other mishaps that could lead the deployment to a corrupted state. Here are some possible events:

  • The underlying instance could experience hardware failures
  • The attached volume could become corrupted
  • User error might lead to resource deletion

Resource Backup

Here is an overview of the various resources and what ensures their backup in case of disaster

ComponentTypeDataStoryStatus
Postgres DBAmazon RDSThe database, containing all information about any resource on Spell, including but not limited to users, runs, labels, notes, clusters, machine types, workspaces, etc.It is recommended that the RDS database be backed up automatically at regular intervals. In case of corruption the RDS instance can be recovered from a snapshot. Activity on Spell after that Snapshot could become corrupted.Unimplemented
MetricsAmazon DynamoDBRun metrics, including user-defined metrics, hardware metrics, or framework metrics.It is recommended that backups be configured manuallyUnimplemented
Run LogsAmazon S3Application run logsN/A. S3 data protection is sufficient for this class of data.N/A
Gitlab StateAmazon EBS & RDSAll user code is stored in a deployment on Gitlab hosted on the Minikube instance, with repository contents stored on the attached EBS volume, with some database state also kept in Amazon RDS. Spell relies on some degree of synchronicity between the contents of Gitlab and the Postgres database, any deviation from this could cause data corruption.Daily EBS snapshots + reboot proofing of Minikube instance.Partially implemented
Redis StateAmazon EBSRedis state is used to keep temporary track of proxy routes for Jupyter workspaces.N/A. The temporary nature of this data means that data loss is not catastrophic.N/A

EBS Volume Loss/Corruption Recovery

In case of data corruption on the root volume attached to the Spell-Minikube instance, the contents of the most recent snapshot should be used:

  • Create an image (AMI) from the most recent snapshot.
  • Update the terraform spec to force-mount this volume to the EC2 instance:
    • Set the ami parameter inside aws_instance to this AMI.
    • Update the field private_ip to the hardcoded value of the existing instance. Skipping this step will cause failures within kubernetes.
  • Run terraform apply and observe that a new instance is created. Once that instance starts, all containers should come up as they were at the time the snapshot was taken.

NOTE: Runs performed after the most recent snapshot but before the EBS volume corruption will end up in a corrupt state. Please contact Spell for how to proceed in this case.