Leveraging AWS Backup for account-level disaster recovery

Introduction

In today’s digital landscape, businesses rely heavily on their data and IT infrastructure. However, unforeseen disasters can strike at any time, such as natural calamities, human errors, or cyber-attacks, jeopardizing critical business operations.
To mitigate such risks and ensure business continuity, copebit provides a robust disaster recovery solution with AWS Backup.
In this blog post, we will follow up on Ruben Knaus’ cross-account and cross-region backup post and discover how an AWS account can be restored if a disaster scenario occurs.

The importance of disaster recovery

Disaster recovery refers to the strategies, processes, and technologies that organizations employ to recover their critical systems and data in the event of a disaster.

The ability to swiftly restore operations and minimize downtime is paramount in maintaining customer trust, meeting regulatory requirements, and safeguarding business reputation.

High-level process flow

When viewed from a broad perspective, the account-level disaster recovery consists of three main stages. Each stage involves specific tasks that are performed sequentially. At copebit, we refer to these stages as:

Preparation steps
Disaster recovery deployment/execution
Application deployment and testing

The accompanying diagram illustrates the flow of the account disaster recovery process, highlighting its various stages and tasks:

Preparation steps

Prerequisites

Cross-account backup
In case of a full-on account disaster recovery, AWS Backup recovery points need to be available in a dedicated backup account that was not affected by the outage.

In the diagram below, account B is such a centralized backup account that was used to copy and store the backup recovery points of account A.

AWS Control Tower
At copebit, we leverage AWS Control Tower in combination with an account vending machine to automate AWS account provisioning.

In a disaster recovery scenario, the account vending machine can be leveraged to swiftly create a fresh AWS account (account C) from scratch which is going to take over account A’s lost workload.

Infrastructure as code
The ability to define infrastructure in code is another prerequisite for short disaster recovery times.

All AWS resources that have been provisioned by copebit are either written in CloudFormation templates or Terraform code.

By taking advantage of the streamlined templates that have been used to deploy the workload in account A, we quickly set up the same services in account C.

Part of the preparation is to bootstrap the newly created account C with base account prep templates, so called dependency templates. This will ensure certain services are available as they are needed for the next step (e.g. customer managed keys, IAM roles, etc.)

Copy backup recovery points
Last but not least, the backup recovery points need to be copied from the backup account B into the newly created account C.

This step is required because all recovery points need to be available in the account where the backup recovery targets reside.

The copy job duration of the recovery points varies depending on its size and the used regions. To provide some rough estimates, here some examples:

Cross-account / same region of RDS recovery point with a size of 170GB – estimation: 30min
Cross-account / cross-region for S3 recovery point with a size of 1.1TB / 2,000,000 objects – estimation: 1h15min

Disaster recovery deployment / execution

Once all the recovery points are available in account C, we can initiate the restoration process and proceed with further deployment of the resources that need to be restored. The specific recovery process employed may vary depending on the AWS resources requiring restoration. Here are several examples of different approaches of the recovery process:

Amazon RDS: The deployment involves directly referencing the snapshot ARN for snapshot deployment
Amazon EFS: Restoration entails creating a new EFS and importing the resource into the Infrastructure as Code (IaC) setup
Amazon S3: Restoration involves creating a new S3 bucket and restoring the data into it

It is crucial to choose a step-by-step approach based on the resource to be restored and its corresponding restore capabilities. For instance, RDS supports snapshot referencing, allowing for deployment during the restoration process. On the other hand, EFS requires restoration into a new filesystem, followed by importing it into the IaC setup as a separate step.
It is recommended to deploy resources that support snapshot referencing (e.g. RDS) and services that can be restored into new empty resources (e.g. S3) first. Afterwards, S3 buckets can be directly restored into the newly created S3 buckets. Subsequently, services that generate new resources during the restoration process (e.g. EFS) and require subsequent importation should follow. Any remaining resources can then be deployed if necessary. This completes the restoration of all infrastructure-related resources, allowing the commencement of application deployment and testing.

The restoration of the different recovery points varies depending on its size and the supported restore procedure. To provide some rough estimates here some examples:

RDS restore with a size of 170GB – estimation: 45min
S3 restore with a size of 1.1TB / 2,000,000 objects – estimation: 1h30min

Application deployment and testing

The type of deployment of the customers’ application to the newly restored resources in account C depends on the application itself and its deployment process. Various methods can be used, such as automated CI/CD pipelines, scripts, or manual procedures. Once the application is deployed, it can undergo functional testing to ensure its proper operation before being put into production use.

Conclusion

By implementing a centralized backup account, customers can confidently initiate an account disaster recovery process to a new AWS account, ensuring complete recoverability if the need arises. It is crucial for organizations to have predefined procedures in place in anticipation of potential disasters. Additionally, this approach empowers organizations to regularly test their disaster recovery capabilities, leveraging the flexibility of AWS as their chosen public cloud provider.

copebit as an AWS Advanced Consulting Partner has developed a comprehensive process that combines Infrastructure as Code (IaC) practices and various AWS services. This integration aims to streamline and automate the otherwise complex task of account disaster recovery.

If you would like more detailed information about account disaster recovery, we encourage you to get in touch with us. Our experts have successfully implemented and executed numerous tests, and they are well-equipped to provide insights and guidance on the topic.

Phil Wegmueller

Phil is a Cloud Consultant / Solution Architect specialised in AWS Solutions. He holds multiple AWS Certifications and is working in the IT industry for 19+ years'.