This page covers ACE’s approach to Disaster Recovery Plan (DRP) and Resilience Concept (RC) and identifies the critical systems that are covered by the DRP. In addition, links to the recovery checklist for each critical system can be found below under Critical Systems.

Definitions

A Disaster Recovery Plan (DRP) is a predefined procedure describing how an IT Service or Component has to respond to (some of) the disaster scenarios identified in the Business Continuity Plan of the responsible Organization. The DRP contains detailed instructions to quickly resume the operation of the IT Service fully or at a minimum level that allows the consumers to use its key functionality.

A Resilience Concept (RC) describes the security mechanisms implemented in the IT Service or Component to ensure a continuous delivery (or a minimal disruption) of the service to other IT applications or components relying on it in the case of a continuity event.

Differently from a Disaster Recovery Plan, a Resilience Concept relies on the built-in redundancy and automatic controls. Therefore, no manual action is required to restore the service, although manual intervention may be necessary after the event triggering the safeguards to bring the system to a situation where it can be challenged again.

A Test Plan is an indispensable part of the preparation against disasters and every DRP or RC must be accompanied by a Test Plan describing the testing scope and procedure. If Disaster Recovery Plans or Resilience Plans are not tested, there’s a real chance the plan will fail to execute as expected when they are really needed.

Recovery Point Capability (RPC) is the point in time to which data was restored and/or systems were recovered (at the designated recovery/alternate location) after an outage or during a disaster recovery exercise.

Recovery Time Capability (RTC) is the demonstrated amount of time in which systems, applications and/or functions have been recovered, during an exercise or actual event , at the designated recovery /alternate location (physical or virtual).

Roles and Responsibilities

Resource	Role	Contact Info
ACE Infra Team	Monitoring for issues that could case a disaster. Executing disaster recovery plan steps.	gred-ace-infra-d@gene.com ACE Infra Slack Support Channel
ACE Infra Escalation Point	Point of contact for customer escalation.	Saima Sherazi
Application users	Testing the applications to ensure they function	ACE Infra Slack Support Channel
Vendors	Support recovery steps, as needed.	Support info shared on the reocvery checklist page for each application

Critical Systems

Critical systems are broken down into three categories. The link below for each critical system will provide details of the recovery checklist.

Some applications are deployed in a SaaS manner where ACE Infra’s responsibility is only to provide storage or other component. The SaaS Application subsection below covers this category.
In other cases, the ACE Infra team is responsible for deploying an application and its required components to a VPC in our AWS account. The VPC-Deployed Applications subsection below covers this category.
The final category are infrastructure services that are provided. This section is covered below by the Infrastructure subsection.

SaaS Applications

The ACE Infra team currently only provides the applications below with object level storage in S3. Since these are SaaS products, the vendor is required to ensure availability and resiliency of the application.

SaaS Application	Owner
Labelbox	AI Leads
WandB*	AI Leads
ClearML	AI Leads
V7	Data Engineering Lead

*This application is owned by Research Biology. We are not responsible for the application configurations and are only providing S3 storage.

VPC-Deployed Applications

VPC Deployed Application	Owner
Alation	IMO Lead
Talend Worker Instances	Data Engineering Lead
Delphi	MLP Lead
Winterlight	Data Engineering Lead
EyeNotate	Data Engineering Lead
RALE	AI Leads

Infrastructure

Infrastructure Resource	Owner
AWS gCORE Redshift	IMO Lead
AWS RDS for Data Engineering ETL	Data Engineering Lead

Recovery Capabilities

Capability	Hours
Recovery Time Capability	1 Week
Recovery Point Capability	24 Hours

Disaster Scenarios

This Disaster Recovery Plan mitigates against the following scenarios listed as Yes Below.

Disaster Scenario	Protects Yes/No
Pandemic	Yes
Loss of Availability Zone in AWS	Yes
Loss of Region in AWS	Partial
SaaS Vendor Outage	No

Regional loss is not a focus for us at this team, though some of the work we have done we account for a regional loss. Where applicable, regional loss protection will be called out.

Note that S3 buckets do not need backups by default as AWS provides a 99.999999999% (11 9s) data durability percentage. The only case where backups are needed are for multi-region support. Since we default to a single region, S3 data is not backed up.

Triggering Condition and Execution Approval

Application owners declares a disaster scenario has been met in conjunction with the Infrastructure lead.

Disaster Recovery Testing

Tabletop exercise is done an annual basis and documented in a github issue.

Resources

For references see the the Roche IT Continuity Page here.

Overview DR Checklist — Template