ACE PlatformDR & Resilience Concept

This page covers ACE’s approach to Disaster Recovery Plan (DRP) and Resilience Concept (RC) and identifies the critical systems that are covered by the DRP. In addition, links to the recovery checklist for each critical system can be found below under Critical Systems.

Definitions

A Disaster Recovery Plan (DRP) is a predefined procedure describing how an IT Service or Component has to respond to (some of) the disaster scenarios identified in the Business Continuity Plan of the responsible Organization. The DRP contains detailed instructions to quickly resume the operation of the IT Service fully or at a minimum level that allows the consumers to use its key functionality.

A Resilience Concept (RC) describes the security mechanisms implemented in the IT Service or Component to ensure a continuous delivery (or a minimal disruption) of the service to other IT applications or components relying on it in the case of a continuity event.

Differently from a Disaster Recovery Plan, a Resilience Concept relies on the built-in redundancy and automatic controls. Therefore, no manual action is required to restore the service, although manual intervention may be necessary after the event triggering the safeguards to bring the system to a situation where it can be challenged again.

A Test Plan is an indispensable part of the preparation against disasters and every DRP or RC must be accompanied by a Test Plan describing the testing scope and procedure. If Disaster Recovery Plans or Resilience Plans are not tested, there’s a real chance the plan will fail to execute as expected when they are really needed.

Recovery Point Capability (RPC) is the point in time to which data was restored and/or systems were recovered (at the designated recovery/alternate location) after an outage or during a disaster recovery exercise.

Recovery Time Capability (RTC) is the demonstrated amount of time in which systems, applications and/or functions have been recovered, during an exercise or actual event , at the designated recovery /alternate location (physical or virtual).

Roles and Responsibilities

ResourceRoleContact Info
ACE Infra TeamMonitoring for issues that could case a disaster.
Executing disaster recovery plan steps.
gred-ace-infra-d@gene.com
ACE Infra Slack Support Channel
ACE Infra Escalation PointPoint of contact for customer escalation.Saima Sherazi
Application usersTesting the applications to ensure they functionACE Infra Slack Support Channel
VendorsSupport recovery steps, as needed.Support info shared on the reocvery checklist page for each application

Critical Systems

Critical systems are broken down into three categories. The link below for each critical system will provide details of the recovery checklist.

  1. Some applications are deployed in a SaaS manner where ACE Infra’s responsibility is only to provide storage or other component. The SaaS Application subsection below covers this category.
  2. In other cases, the ACE Infra team is responsible for deploying an application and its required components to a VPC in our AWS account. The VPC-Deployed Applications subsection below covers this category.
  3. The final category are infrastructure services that are provided. This section is covered below by the Infrastructure subsection.

SaaS Applications

The ACE Infra team currently only provides the applications below with object level storage in S3. Since these are SaaS products, the vendor is required to ensure availability and resiliency of the application.

SaaS ApplicationOwner
LabelboxAI Leads
WandB*AI Leads
ClearMLAI Leads
V7Data Engineering Lead

*This application is owned by Research Biology. We are not responsible for the application configurations and are only providing S3 storage.

VPC-Deployed Applications

VPC Deployed ApplicationOwner
AlationIMO Lead
Talend Worker InstancesData Engineering Lead
DelphiMLP Lead
WinterlightData Engineering Lead
EyeNotateData Engineering Lead
RALEAI Leads

Infrastructure

Infrastructure ResourceOwner
AWS gCORE RedshiftIMO Lead
AWS RDS for Data Engineering ETLData Engineering Lead

Recovery Capabilities

CapabilityHours
Recovery Time Capability1 Week
Recovery Point Capability24 Hours

Disaster Scenarios

This Disaster Recovery Plan mitigates against the following scenarios listed as Yes Below.

Disaster ScenarioProtects
Yes/No
PandemicYes
Loss of Availability Zone in AWSYes
Loss of Region in AWSPartial
SaaS Vendor OutageNo
  • Regional loss is not a focus for us at this team, though some of the work we have done we account for a regional loss. Where applicable, regional loss protection will be called out.

Note that S3 buckets do not need backups by default as AWS provides a 99.999999999% (11 9s) data durability percentage. The only case where backups are needed are for multi-region support. Since we default to a single region, S3 data is not backed up.

Triggering Condition and Execution Approval

Application owners declares a disaster scenario has been met in conjunction with the Infrastructure lead.

Disaster Recovery Testing

Tabletop exercise is done an annual basis and documented in a github issue.

Resources

For references see the the Roche IT Continuity Page here.