Cross-region disaster recovery with Amazon Elastic Container Service

Yegor Tokmakov
11 min readMar 9, 2021

With more and more businesses going digital, requirements for business continuity are getting higher every day. Natural disasters, data centre failures, power outages or even cyber attacks can impact revenue, lead to data losses, dissatisfied customers and brand damage. With todays complex software architectures it is important, and in some cases required by government regulations, to have a disaster recovery plan for such scenarios. AWS global infrastructure and AWS services, provide an ideal environment to build resilient applications to satisfy business continuity. For most of the workloads using geographically distributed AWS Availability Zones deployments is enough, but the most mission-critical applications might require deployments in multiple regions. CloudEndure Disaster Recovery provides continuous replication of machines from both AWS and non-AWS environments to target AWS account and preferred Region. Disaster recovery for your applications data layer can be implemented using DynamoDB Global Tables, RDS Cross-region Read Replicas, S3 Replication.

In this post, I’m going to demonstrate a disaster recovery concept known as Pilot Light strategy for container workloads. We are going to set up a standby Amazon Elastic Container Service (ECS) cluster in a secondary region us-east-1. To reduce the costs of the solution to minimum, this cluster will be a scaled down to zero copy of the production environment in eu-west-1 region and automatically scale out when regional service impairment is detected using Amazon Route 53 and AWS Lambda.

Setting up standby cluster

To demonstrate the concept, I’m going to use examples of a 3-tier web application from ECS Workshop. I’m going to deploy the same set of base resources to both us-east-1 and eu-west-1 regions, including Virtual Private Cloud (VPC) networks, Security Groups, Elastic Load Balancers (ELB) and ECS clusters backed by EC2 capacity provider. After this, we will deploy 3 services from the workshop examples and enable autoscaling on all of the services. If you would like to recreate the same environment in your AWS account, please follow instructions provided in the workshop.

To be sure failover environment is always up to date, it is important to deploy new versions of software to both clusters at the same time in your CI/CD pipeline. Setting up and managing complex pipelines is possible with AWS Proton.

After deployment, you’ll have load balancer endpoints in two regions similar to http://conta-publi-us6g0jba3nqy-1111222233.eu-west-1.elb.amazonaws.com/ and http://conta-publi-1l4pgcw2dz7n7-1111222233.us-east-1.elb.amazonaws.com/. If deployment was successful, opening one of the load balancers endpoints in your browser will take you to the frontend service in ECS cluster:

Example 3-tier application in ECS

It’s worth mentioning that using AWS Fargate will remove the need to manage both container services and the underlying infrastructure and focus only on the services. In this post I’m going to demonstrate solution based on EC2 instances with ECS Capacity providers, but the same concept can be applied to Fargate, even with Amazon Elastic Kubernetes Service (EKS), with changes to the Lambda function code.

Amazon Route 53 DNS failover setup

Now we are going to set up Route 53 DNS failover with load balancer endpoints from the previous step. We are going to set up Route 53 health checks, CloudWatch alarms and SNS topic for notifications.

To set up Route 53 DNS failover:

  1. Navigate to Route 53 health checks and choose Create health check.
Route 53 health checks welcome screen

2. On configuration page enter PrimaryCluster as name. Choose Endpoint in What to monitor section. Under Monitor an endpoint choose Specify endpoint by: Domain name. Enter your primary cluster load balancer endpoint as Domain name. Leave advanced configuration to default values and choose Next.

Configure Route 53 health check

3. On the next step choose Create Alarm: Yes. Choose New SNS topic in Send notification to section and enter primaryClusterHealthcheckTopic as the topics name. Enter your email address as recipient of notifications and choose Create health check. You will receive an email with a link to confirm your email address.

Configure Route 53 health check alarm

4. After successfully creating the health check you will be taken back to the list of all Route 53 health check. Wait for the health check to turn healthy.

5. Navigate to Route 53 Hosted zones. Now we are going to set up a failover DNS record for a domain name. If your domain is not configure in Route 53 yet, refer to Route 53 documentation on configuring Amazon Route 53 as your DNS service. Choose the hosted zone you want to set up the records for, in our case it’s example.com. Choose Create record.

6. Enter below record parameters as shown in the image. Choose Create records.

  • Routing policy: Failover
  • Record name: <leave empty>
  • Alias: Yes
  • Record type: A — Routes traffic to an IPv4 address and some AWS resources
  • Route traffic to: Alias to Application and Classic Load Balancer
  • Region: Europe (Ireland) [eu-west-1]
  • Load balancer: choose load balancer of the primary cluster
  • Evaluate target health: Yes
  • Failover record type: Primary
  • Record ID: ExampleEcsDR
Create new failover record

7. Create one more record for the cluster in the secondary region. Enter below record parameters as shown in the image. Choose Create records.

  • Routing policy: Failover
  • Record name: <leave empty>
  • Alias: Yes
  • Record type: A — Routes traffic to an IPv4 address and some AWS resources
  • Route traffic to: Alias to Application and Classic Load Balancer
  • Region: US East (N. Virginia) [us-east-1]
  • Load balancer: choose load balancer of the standby cluster
  • Evaluate target health: No
  • Failover record type: Secondary
  • Record ID: ExampleEcsDRSecondary
Create new failover record for secondary cluster

8. At this point Route 53 DNS failover setup is complete. Navigate to the configured domain in your browser and you should see a page similar to the image below.

Public web page in the primary AWS region

Now if load balancer in the primary cluster in eu-west-1 fails the health checks, Route 53 will roll over the DNS records and point to the secondary cluster in us-east-1. As a next step we want to keep our standby cluster scaled down zero to reduce recurring costs and only scale it up in an event of disaster.

Scaling up the secondary cluster on failover

We intend to keep our standby cluster scaled to zero to avoid ongoing costs. As we are using EC2 capacity providers, this means we need to scale down EC2 Auto Scaling Group and ECS Services and set up a Lambda function which will scale back up both when Route 53 health check status changes.

As it’s not critical for the disaster recovery plan and for the sake of simplicity of the post we’ll scale down the standby cluster manually to test the solution. Nevertheless automating the operation is possible the same way as scaling up: use a combination of CloudWatch Alarm and Lambda function to scale down when the health of the primary cluster is marked as healthy, DNS was rolled back to primary cluster and the traffic on the standby cluster is drained.

To manually scale down the standby cluster:

  1. Navigate to standby ECS cluster. For every service in the cluster set Desired tasks to 0.
Scaled down ECS services

2. In Capacity Providers tab of ECS cluster console select EC2 provider and choose Update. On the settings page set Managed Scaling to Disabled and choose Update.

ECS Capacity Provider settings

3. Navigate to EC2 console, locate the Auto Scaling group for the standby cluster. Set Desired capacity to 0.

Scaled down EC2 Auto Scaling group

4. Wait for ECS to drain all connections and terminate running tasks and EC2 instances.

Lambda scaling function

Now we are ready to set up a function which will scale the secondary cluster up. The function will be triggered by an SNS topic that used by Route 53 health check to notify us of a status change. If you don’t have prior experience with AWS Lambda, please read Create a Lambda function with the console section of documentation first.

The function is first calling describe_clusters to get a list of Capacity Providers in the cluster. Then it iterates over the list of providers and calls update_capacity_provider method on each to enable managed scaling. The last steps is to get a list of all defined services in the cluster with list_services and to iterate over every service to set desired task count to 3 with update_service method.

For the sake of this simplicity, we won’t store the original service task count and will just set every service’s desired task count to 3. If you need, you can store the amount of running tasks in a persistent storage and add additional logic to the Lambda function.

To create Lambda ECS scaling function:

1.In secondary region us-east-1 navigate to AWS Lambda console and choose Create function.

2. Enter LambdaECSScaleUp as a name of the function. Choose Python 3.8 as Runtime. Choose Create function.

3. Choose Add trigger and select SNS as a trigger and search for the SNS topic of the Route 53 health check alarm. As Route 53 is operationally based out of us-east-1, if you are using a different AWS region for your standby cluster, you’ll need to specify a full ARN of the SNS topic. SNS topic looks like this arn:aws:sns:us-east-1:1111222233:primaryClusterHealthcheckTopic and can always be located in SNS console in us-east-1 region.

4. Copy the code below into the Function code field.
As of writing, ECS update_capacity_provider endpoint is available only in the latest version of Boto3 which is currently not available in Lambda runtime. You might need to package the function as a ZIP archive with the latest Boto3 version. More information on packaging Lambda functions as archives is available here.

import boto3
import json
import os

ecs = boto3.client('ecs')

def lambda_handler(event, context):
cluster_arn = os.environ['CLUSTER_ARN']
desired_count = 3

ecsClusters = ecs.describe_clusters(
clusters=[cluster_arn],
include=[]
)

if not ecsClusters['clusters']:
print('ECS cluster %s not found' % cluster_arn)
return

for capacityProvider in ecsClusters['clusters'][0]['capacityProviders']:
response = ecs.update_capacity_provider(
name=capacityProvider,
autoScalingGroupProvider={
'managedScaling': {
'status': 'ENABLED'
}
}
)
print('Enabled Managed Scaling for capacity provider %s' % (response['capacityProvider']['capacityProviderArn']))

paginator = ecs.get_paginator('list_services')
response_iterator = paginator.paginate(
cluster=cluster_arn,
launchType='EC2',
schedulingStrategy='REPLICA'
)

for page in response_iterator:
for service_arn in page['serviceArns']:
try:
ecs.update_service(
cluster = cluster_arn,
service = service_arn,
desiredCount=int(desired_count)
)
except Exception as e:
raise Exception('Unable to scale the cluster' + str(e))

print('Updated service %s desired count to %i' % (service_arn, desired_count))

5. Add permissions to lambda function to the ECS cluster. From the Permissions tab of the function console, navigate to the function’s IAM Role. Choose Attach policies. Choose Create policy. Copy the code below to the JSON editor of the policy. Review and create the policy. Attach the policy to the IAM role.
For production workload make sure to limit resources you give permissions to and follow principle of least privilege.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:ListServices",
"ecs:UpdateService",
"ecs:UpdateCapacityProvider",
"ecs:DescribeClusters"
],
"Resource": [
"arn:aws:ecs:us-east-1:1111222233:*/*"
]
}
]
}

6. In Environment variables section add a new variable with key CLUSTER_ARN and ARN of the standby cluster as value, in this example it’s arn:aws:ecs:us-east-1:1111222233:cluster/container-demo.

Testing the solution

Now we are ready to emulate an outage of our primary ECS cluster in eu-west-1. To do that we are going to edit the Security Group of the load balancer and remove the rule allowing any inbound traffic. This will make the load balancer not reachable from the public internet and will trigger Route 53 health check to fail.

Unhealthy status of Route 53 health check

As soon as save changes to the Security Group, the next sequence will be executed:

  1. Our website http://example.com/ is no longer reachable and returns error 503 Service Temporarily Unavailable.
  2. Route 53 health checks fail and mark primary record in eu-west-1 as unhealthy. Route 53 triggers a CloudWatch Alarm and switches DNS records to the standby cluster in us-east-1.
  3. CloudWatch Alarm sends a notification to the primaryClusterHealthcheckTopic SNS topic.
  4. Notification in the SNS topic sends an email with the status and triggers execution of the LambdaECSScaleUp Lambda function
  5. Lambda function enables Managed Scaling for EC2 capacity provider and sets desired task count of all services in the cluster to 3.
  6. ECS starts new tasks to reach the desired task level. To place new tasks, ECS Capacity Provider requests EC2 Auto Scaling group to launch new instances.
  7. With the default health check timeout settings, in around 10 minutes all traffic to http://example.com/ and internal services will be handled by the secondary cluster out of us-east-1 region.

Additional considerations

Internal services
To keep communication between internal services working, the easiest way is to use service discovery. This solution demonstrated how service discovery allows internal services address each other with the same names universally in different regions, e.g. http://ecsdemo-nodejs.service. As service discovery is using internal DNS zones and those are attached to the VPC, you can use exactly the same service discovery configuration: the same service names and namespace name.

It’s also worth reminding to review Tasks executed outside of ECS Services (daemons, cron jobs, etc.), so they are executed the right amount of times at the right place.

False alarms

Described method is intended for scenarios where the entire AWS region or entire service in a specific region fails. It is important to develop the right strategy to detect this scenarios and tigger the alarms at the right time. Such strategy should also avoid false triggers by application errors and degradation of internal services to trigger failover to a secondary region as it might increase overall recovery time. Releasing faulty version of the application shouldn’t trigger failover, but rather rollback to the previous working version of the application.

AWS Fargate API throttling

AWS Fargate throttles Amazon ECS RunTask API requests for each AWS account on a per-Region basis. AWS does this to help the performance of the service, and to ensure fair usage for all Fargate customers. Throttling ensures that calls to the launch tasks do not exceed the maximum allowed API request limits. This means you should develop retry policies in your Lambda function and, if needed, request API limit request. More information is available in documentation.

Container registry availability

In case of a total degradation of a region, you have to make sure container registry is still available in the standby region. Amazon ECR provides cross-region replication of your containers. More information is available in Cross region replication blog post.

Failover duration

In our example with all default setting, failover takes approximately 10 minutes from emulated failure of the ECS cluster to serving traffic from the secondary region. If your business continuity plan requires lower Recovery Time Objective (RTO). the time can be reduced with Route53 Fast Interval health checks (an additional charge applies), adjusted timeouts of CloudWatch alarms, more aggressive autoscaling policy, and reduced DNS TTL value.

Conclusion

In this post I demonstrated a concept of disaster recovery solution based on Amazon ECS and Amazon Route 53 health checks. This solution will help you implement your business continuity plan with minimal additional recurring costs using AWS services. When introducing disaster recovery solutions it is important to automate as much operations as possible and test the solution via regular game days. To learn more about building resilient infrastructure visit AWS Well-Architected portal.

--

--

Yegor Tokmakov

Fintech Solutions Architect @ AWS | AI/ML | Opinions are my own