Without a good rollout and rollback strategy, there is greater risk of releasing breaking changes or broken software that impacts all users for an extended period of time. This can erode confidence in your releases and customers’ confidence in your products.
Canary deployments can help minimise this risk by first routing a small percentage of traffic to the new version for a configured amount of time, before routing the remaining traffic to the new version. If any errors are detected during the initial routing then all traffic is routed back to the previous version.
AWS CodeDeploy provides native support for canary deployments of Lambdas. The AWS Serverless Application Model (SAM) provides abstractions to more easily configure CodeDeploy canary deployments of Lambdas using CloudFormation.
This blog post describes how to implement Lambda canary deployments using CodeDeploy and SAM, with the added bonus of a pre-traffic automated test Lambda for smoke testing the new version. CloudWatch Alarms trigger automatic rollback on increased error detection during the initial traffic shifting phase of the deployment.
CodeDeploy requires Lambda Versions and a Lambda Alias to provide support for canary deployments. A Lambda Version is an immutable snapshot of a Lambda at a point in time, and a Lambda Alias routes traffic to a specific Lambda Version. CodeDeploy begins a canary deployment by creating a new Lambda Version, then routes a certain percentage of traffic to that Lambda Version. If no errors are detected during the deployment timeframe, which is configurable, CodeDeploy will point the Lambda Alias to the new Lambda Version, thereby shifting 100% traffic to the new version.
Example Serverless Application
Here’s a CloudFormation stack with a simple example of this in action. An API Gateway RESTful API backed by two Lambdas uses CodeDeploy for canary deployments. Test Lambdas are defined to run pre and post-traffic tests against the new versions of ExampleAFunction and ExampleBFunction. CloudWatch Alarms for the alias and current version of ExampleALambda and ExampleBLambda are also defined. The code for this example can be found here.
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Runtime: nodejs12.x
MemorySize: 128
Timeout: 30
Resources:
ExampleApi:
Type: AWS::Serverless::Api
Properties:
StageName: live
ExampleAFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
InlineCode: |
exports.handler = (event, context, callback) => {
callback(
null,
{
statusCode: 200,
body: JSON.stringify({
message: 'Hello World A'
})
});
};
AutoPublishAlias: live
Events:
ExampleApiEvent:
Type: Api
Properties:
RestApiId: !Ref ExampleApi
Path: /example/a
Method: get
DeploymentPreference:
Type: Canary10Percent5Minutes
Alarms:
- !Ref ExampleAAliasErrorMetricGreaterThanZeroAlarm
- !Ref ExampleALatestVersionErrorMetricGreaterThanZeroAlarm
Hooks:
PreTraffic: !Ref PreTrafficLambdaFunction
PostTraffic: !Ref PostTrafficLambdaFunction
ExampleAAliasErrorMetricGreaterThanZeroAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Lambda Function Error > 0
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: Resource
Value: !Sub ${ExampleAFunction}:live
- Name: FunctionName
Value: !Ref ExampleAFunction
EvaluationPeriods: 2
MetricName: Errors
Namespace: AWS/Lambda
Period: 60
Statistic: Sum
Threshold: 0
ExampleALatestVersionErrorMetricGreaterThanZeroAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Lambda Function Error > 0
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: Resource
Value: !Sub ${ExampleAFunction}:live
- Name: FunctionName
Value: !Ref ExampleAFunction
- Name: ExecutedVersion
Value:
Fn::GetAtt:
- ExampleAFunction.Version
- Version
EvaluationPeriods: 2
MetricName: Errors
Namespace: AWS/Lambda
Period: 60
Statistic: Sum
Threshold: 0
ExampleBFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
InlineCode: |
exports.handler = (event, context, callback) => {
callback(
null,
{
statusCode: 200,
body: JSON.stringify({
message: 'Hello World B'
})
});
};
AutoPublishAlias: live
Events:
ExampleAApiEvent:
Type: Api
Properties:
RestApiId: !Ref ExampleApi
Path: /example/b
Method: get
DeploymentPreference:
Type: Canary10Percent5Minutes
Alarms:
- !Ref ExampleBAliasErrorMetricGreaterThanZeroAlarm
- !Ref ExampleBLatestVersionErrorMetricGreaterThanZeroAlarm
Hooks:
PreTraffic: !Ref PreTrafficLambdaFunction
PostTraffic: !Ref PostTrafficLambdaFunction
ExampleBAliasErrorMetricGreaterThanZeroAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Lambda Function Error > 0
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: Resource
Value: !Sub ${ExampleBFunction}:live
- Name: FunctionName
Value: !Ref ExampleBFunction
EvaluationPeriods: 2
MetricName: Errors
Namespace: AWS/Lambda
Period: 60
Statistic: Sum
Threshold: 0
ExampleBLatestVersionErrorMetricGreaterThanZeroAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Lambda Function Error > 0
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: Resource
Value: !Sub ${ExampleBFunction}:live
- Name: FunctionName
Value: !Ref ExampleBFunction
- Name: ExecutedVersion
Value:
Fn::GetAtt:
- ExampleBFunction.Version
- Version
EvaluationPeriods: 2
MetricName: Errors
Namespace: AWS/Lambda
Period: 60
Statistic: Sum
Threshold: 0
PreTrafficLambdaFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
InlineCode: |
"use strict";
const AWS = require("aws-sdk");
const codedeploy = new AWS.CodeDeploy();
exports.handler = (event, context, callback) => {
console.log("Entering PreTraffic hook.");
// Read the DeploymentId and LifecycleEventHookExecutionId from the event payload
const deploymentId = event.DeploymentId;
const lifecycleEventHookExecutionId = event.LifecycleEventHookExecutionId;
var validationTestResult = "Failed";
// Perform PreTraffic validation tests here. Set the test result
// to "Succeeded" for this tutorial.
console.log("This is where PreTraffic validation tests happen.")
validationTestResult = "Succeeded";
// Complete the PreTraffic hook by sending CodeDeploy the validation status
const params = {
deploymentId: deploymentId,
lifecycleEventHookExecutionId: lifecycleEventHookExecutionId,
status: validationTestResult // status can be 'Succeeded' or 'Failed'
};
// Pass AWS CodeDeploy the prepared validation test results.
codedeploy.putLifecycleEventHookExecutionStatus(params, (err, data) => {
if (err) {
// Validation failed.
console.log('PreTraffic validation tests failed');
console.log(err, err.stack);
callback("CodeDeploy Status update failed");
} else {
// Validation succeeded.
console.log("PreTraffic validation tests succeeded");
callback(null, "PreTraffic validation tests succeeded");
}
});
}
Policies:
- Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- codedeploy:PutLifecycleEventHookExecutionStatus
Resource: !Sub arn:${AWS::Partition}:codedeploy:${AWS::Region}:${AWS::AccountId}:deploymentgroup:${ServerlessDeploymentApplication}/*
- Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- lambda:InvokeFunction
Resource:
- !GetAtt ExampleAFunction.Arn
- !GetAtt ExampleBFunction.Arn
FunctionName: CodeDeployHook_preTrafficHook
Environment:
Variables:
ExampleAFunctionCurrentVersion: !Ref ExampleAFunction.Version
ExampleBFunctionCurrentVersion: !Ref ExampleBFunction.Version
PostTrafficLambdaFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
InlineCode: |
"use strict";
const AWS = require("aws-sdk");
const codedeploy = new AWS.CodeDeploy();
exports.handler = (event, context, callback) => {
console.log("Entering PostTraffic hook.");
// Read the DeploymentId and LifecycleEventHookExecutionId from the event payload
const deploymentId = event.DeploymentId;
const lifecycleEventHookExecutionId = event.LifecycleEventHookExecutionId;
var validationTestResult = "Failed";
// Perform PostTraffic validation tests here. Set the test result
// to "Succeeded" for this tutorial.
console.log("This is where PostTraffic validation tests happen.")
validationTestResult = "Succeeded";
// Complete the PostTraffic hook by sending CodeDeploy the validation status
const params = {
deploymentId: deploymentId,
lifecycleEventHookExecutionId: lifecycleEventHookExecutionId,
status: validationTestResult // status can be 'Succeeded' or 'Failed'
};
// Pass AWS CodeDeploy the prepared validation test results.
codedeploy.putLifecycleEventHookExecutionStatus(params, (err, data) => {
if (err) {
// Validation failed.
console.log('PostTraffic validation tests failed');
console.log(err, err.stack);
callback("CodeDeploy Status update failed");
} else {
// Validation succeeded.
console.log("PostTraffic validation tests succeeded");
callback(null, "PostTraffic validation tests succeeded");
}
});
}
Policies:
- Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- codedeploy:PutLifecycleEventHookExecutionStatus
Resource: !Sub arn:${AWS::Partition}:codedeploy:${AWS::Region}:${AWS::AccountId}:deploymentgroup:${ServerlessDeploymentApplication}/*
- Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- lambda:InvokeFunction
Resource:
- !GetAtt ExampleAFunction.Arn
- !GetAtt ExampleBFunction.Arn
FunctionName: CodeDeployHook_postTrafficHook
Environment:
Variables:
ExampleAFunctionCurrentVersion: !Ref ExampleAFunction.Version
ExampleBFunctionCurrentVersion: !Ref ExampleBFunction.Version
SAM Reduces Boilerplate
Just a few lines in the CloudFormation applies the CodeDeploy canary configuration courtesy of the SAM transformation:
DeploymentPreference:
Type: Canary10Percent5Minutes
Alarms:
- !Ref ExampleAAliasErrorMetricGreaterThanZeroAlarm
- !Ref ExampleALatestVersionErrorMetricGreaterThanZeroAlarm
Hooks:
PreTraffic: !Ref PreTrafficLambdaFunction
PostTraffic: !Ref PostTrafficLambdaFunction
SAM transforms the CloudFormation to create the following resources:
- CodeDeploy Application
- CodeDeploy DeploymentGroup per Lambda
- CodeDeployServiceRole
- Lambda Alias with an UpdatePolicy applying CodeDeploy Application, Deployment Group, and pre/post-traffic hooks
Pre/Post-Traffic Tests
Pre-traffic tests against the new version, validating service contracts and using known test scenarios, provide greater confidence in a deployment being successful. Additional tests can be added over time as the system evolves and more weaknesses are revealed. If these tests fail then no traffic is shifted to the new version and no customers are affected.
The example includes a post-traffic test Lambda for some post-traffic shifting smoke testing should the need arise.
Ideally these tests can be run in parallel to minimise execution time and provide rapid feedback. Tests in non-production environments can be more thorough and numerous, employing test data and mocked boundaries to exercise known scenarios that may not be possible to test in production.
CloudWatch Alarms Trigger Rollback
During initial traffic shifting, CloudWatch Alarms monitor error rates of the new version and if triggered will fail the deployment and automatically rollback to the previous version. Only a small number of requests will be negatively impacted before rollback occurs and traffic is shifted back to the previous version. If all goes well, the remainder of the traffic is shifted to the new version.
Phased Rollouts
This strategy encourages you to adopt a phased rollout strategy that introduces no breaking changes, and therefore requires no downtime for your applications. Thought has to go into architecting and coding your applications so that multiple versions can co-exist. For example how will database, contract, or configuration changes be handled and rolled back?
Business Case
There needs to be business cases behind the architectural design decisions that you make; in this case not all solutions require 100% uptime, with phased rollout of changes. It may not even be possible in certain circumstances. However, when services like AWS CodeDeploy make it so easy to apply these strategies it’s almost more work not to adopt this strategy.
AWS Well-Architected Best Practice
This strategy is described as a best practice by the Serverless Application Lens of the AWS Well-Architected Framework, falling within the Operational Excellence Pillar. The ability to build applications that run with 0% downtime is a valuable, marketable skill. Operation excellence is a bar that continues to rise and with expectations on developers only increasing it makes sense to keep your skills sharp.
With Lambdas, you pay for what you use, so previous versions, now unused, will not incur costs and can be retired as necessary. Adopting this strategy helps you optimise costs.
Caveats
CodeDeploy canary deployments use request based traffic shifting, not user based traffic shifting so some or all of your users could be impacted, albeit a much smaller number of requests.
CloudWatch Alarms gather metrics against the alias and not the version so it may lead to false positives against the new version should the previous version begin to fail.
The automatic cleanup of old versions is encouraged so as not to run out of Lambda storage space, which typically occurs at exactly the wrong moment, and can prevent you from deploying. An EventBridge Scheduled Event can be configured to periodically run a Lambda that removes older versions.
When using SAM to apply a CodeDeploy deployment configuration to a Lambda, pre/post traffic hook Lambda names must start with: CodeDeployHook_
Charges Apply!
Charges apply so be sure to remove all resources created after experimentation.
References
AWS Well-Architected Framework: Serverless lens
Serverless Application Model (SAM): AWS::Serverless::Function (DeploymentPreference)
AWS Serverless Application Model Developer Guide: Deploying serverless applications gradually
Burning Monk (guest post for Lumigo): AWS Lambda Canary Deployment