This page covers ACE’s approach to Disaster Recovery Plan (DRP) and Resilience Concept (RC) and identifies the critical systems that are covered by the DRP. In addition, links to the recovery checklist for each critical system can be found below under Critical Systems.
Definitions
A Disaster Recovery Plan (DRP) is a predefined procedure describing how an IT Service or Component has to respond to (some of) the disaster scenarios identified in the Business Continuity Plan of the responsible Organization. The DRP contains detailed instructions to quickly resume the operation of the IT Service fully or at a minimum level that allows the consumers to use its key functionality.
A Resilience Concept (RC) describes the security mechanisms implemented in the IT Service or Component to ensure a continuous delivery (or a minimal disruption) of the service to other IT applications or components relying on it in the case of a continuity event.
Differently from a Disaster Recovery Plan, a Resilience Concept relies on the built-in redundancy and automatic controls. Therefore, no manual action is required to restore the service, although manual intervention may be necessary after the event triggering the safeguards to bring the system to a situation where it can be challenged again.
A Test Plan is an indispensable part of the preparation against disasters and every DRP or RC must be accompanied by a Test Plan describing the testing scope and procedure. If Disaster Recovery Plans or Resilience Plans are not tested, there’s a real chance the plan will fail to execute as expected when they are really needed.
Recovery Point Capability (RPC) is the point in time to which data was restored and/or systems were recovered (at the designated recovery/alternate location) after an outage or during a disaster recovery exercise.
Recovery Time Capability (RTC) is the demonstrated amount of time in which systems, applications and/or functions have been recovered, during an exercise or actual event , at the designated recovery /alternate location (physical or virtual).
Roles and Responsibilities
| Resource | Role | Contact Info |
|---|---|---|
| ACE Infra Team | Monitoring for issues that could case a disaster. Executing disaster recovery plan steps. | gred-ace-infra-d@gene.com ACE Infra Slack Support Channel |
| ACE Infra Escalation Point | Point of contact for customer escalation. | Saima Sherazi |
| Application users | Testing the applications to ensure they function | ACE Infra Slack Support Channel |
| Vendors | Support recovery steps, as needed. | Support info shared on the reocvery checklist page for each application |
Critical Systems
Critical systems are broken down into three categories. The link below for each critical system will provide details of the recovery checklist.
- Some applications are deployed in a SaaS manner where ACE Infra’s responsibility is only to provide storage or other component. The
SaaS Applicationsubsection below covers this category. - In other cases, the ACE Infra team is responsible for deploying an application and its required components to a VPC in our AWS account. The
VPC-Deployed Applicationssubsection below covers this category. - The final category are infrastructure services that are provided. This section is covered below by the
Infrastructuresubsection.
SaaS Applications
The ACE Infra team currently only provides the applications below with object level storage in S3. Since these are SaaS products, the vendor is required to ensure availability and resiliency of the application.
| SaaS Application | Owner |
|---|---|
| Labelbox | AI Leads |
| WandB* | AI Leads |
| ClearML | AI Leads |
| V7 | Data Engineering Lead |
*This application is owned by Research Biology. We are not responsible for the application configurations and are only providing S3 storage.
VPC-Deployed Applications
| VPC Deployed Application | Owner |
|---|---|
| Alation | IMO Lead |
| Talend Worker Instances | Data Engineering Lead |
| Delphi | MLP Lead |
| Winterlight | Data Engineering Lead |
| EyeNotate | Data Engineering Lead |
| RALE | AI Leads |
Infrastructure
| Infrastructure Resource | Owner |
|---|---|
| AWS gCORE Redshift | IMO Lead |
| AWS RDS for Data Engineering ETL | Data Engineering Lead |
Recovery Capabilities
| Capability | Hours |
|---|---|
| Recovery Time Capability | 1 Week |
| Recovery Point Capability | 24 Hours |
Disaster Scenarios
This Disaster Recovery Plan mitigates against the following scenarios listed as Yes Below.
| Disaster Scenario | Protects Yes/No |
|---|---|
| Pandemic | Yes |
| Loss of Availability Zone in AWS | Yes |
| Loss of Region in AWS | Partial |
| SaaS Vendor Outage | No |
- Regional loss is not a focus for us at this team, though some of the work we have done we account for a regional loss. Where applicable, regional loss protection will be called out.
Note that S3 buckets do not need backups by default as AWS provides a 99.999999999% (11 9s) data durability percentage. The only case where backups are needed are for multi-region support. Since we default to a single region, S3 data is not backed up.
Triggering Condition and Execution Approval
Application owners declares a disaster scenario has been met in conjunction with the Infrastructure lead.
Disaster Recovery Testing
Tabletop exercise is done an annual basis and documented in a github issue.
Resources
For references see the the Roche IT Continuity Page here.