Blue/Green Deployment for AWS


Performing a deployment requires making changes to production. Change = risk. There are many techniques available to deploy changes - some simple, others quite complex; some require downtime, others do not. Blue/green deployment is one of these techniques that is quite simple, requires zero downtime, and best of all, very little risk.

The process is this:

  1. Bring up the auxiliary resources required to run your application

    These are things like load balancers and security groups.

  2. Deploy version 1 of your application

    All of the traffic from the load balancer goes to this version.

  3. Create version 2 of your application

    This usually is different code (new feature, bug fix, etc), but could also be a different instance type or key name.

  4. Switch traffic from version 1 to version 2

    Here is your blue/green deployment. The traffic going through the load balancer no longer goes to Version 1, it goes to Version 2 instead.

  5. Delete Version 1

    Once you are happy with how Version 2 is performing, you can delete the resources (instances, etc) that Version 1 was using.

This is the nice, normal flow of a blue/green deployment. During the normal flow, your application is always online because for a short period of time, version 2 and version 1 are both running at the same time - both versions, simultaneously. The is how zero downtime is achieved. Once all Version 2 is enabled and running, Version 1 is disabled. This happens as fast as the load balancer can put the instances in Version 2 in to service, which is dependent on your health check settings.

But what happens when things don't go according to plan? What if there is a change in Version 2 that causes the navigation bar to disappear, or user login to stop functioning? This is where the true power of blue/green deployment comes in. Deploying a new version causes zero changes to Version 1. Version 1 is completely untouched during the deployment. This means it is always the last known good state.

The risk of performing a deployment is greatly reduced because you can, at any point in time, roll back to the previous version, which is always fully provisioned, in a known good state, and ready to go.

Comparison to rolling updates

Another approach to performing a deployment is a rolling deployment. This is where you deploy a new version of code by taking 1 or more servers out-of-services, perform the update on them, and then put them back in service - repeating the process until all instances in the cluster have been updated.

This is also something AWS has support for at the ELB level.

Lets quickly go over the pros of a rolling deployment over blue/green deployment

  • You don't need to run double the number of instances during a deployment
  • You can do a partial deployment - deploy to only 20% of your cluster

Sounds appealing right? The success case usually is. It's the error cases where things get interesting. What happens when the new version 2 has issues? Here is where the cons really add up

  • There is NO last known good state. This technique modifies your existing servers.
  • Rollback can take a long time. If you are updating 100 instances, 5 at a time, each deployment takes 2 minutes, and errors start occurring after you are at instance 70/100, you are in a world of pain. Rolling back to the previous version is a 28 minute process.
  • This is a complex and error prone process. You need an auxiliary service that is able to keep track of where in the cluster the deployment process is up to. What if your service is automatically scaling up and down during this process. What version of the code do the new instances have?

While rolling deployments sound nice in theory, you are taking on an incredible amount of risk every time you deploy. If you are using Auto Scaling groups, changing the instance type or user data requires changing the Launch Configuration, which is yet another process that needs to be managed. This is potentially a different process than you use for rolling updates to existing servers.

Things to consider

It's not all rainbows and unicorns with blue/green deployments. There are a few things that need to be taken into consideration in order to have deployments truly risk-free.

Nothing else changes

When switching from version 1 to version 2, nothing else should happen. For those Ruby on Rails developers, this means deploying a new version of code does not run any database migrations. All schema migrations should be performed on a separate schedule.

Consider this: if Version 2 requires a DB schema change to function, when do you perform that migration? Whether you are doing a blue/green deployment or a rolling update, at some point in time, both version 1 and version 2 need to be running simultaneously. Given that this is always the case, make sure Version 2 can run on Version 1's schema.

This also brings us back to the error case. If you have Version 2 running, with a different DB schema, and need to roll back to Version 1, what will happen? How will Version 1 run on Version 2's schema?

Having both Version 1 and Version 2 of your code running on the same DB schema is not something that comes naturally to the Rails community, which is unfortunate. You will need to be aware of this.

How does Etsy do it?

Etsy have a thing called "Schema Change Thursday". They perform all database schema changes once a week only, yet deploy to production 50 times per day. How do they do this? A few things:

  • Any schema change must be an addition with a default. No field is removed during a normal migration, and nothing is renamed.
  • Features are deployed "dark", meaning while the code to use the feature may be on production, it is disabled. They use "feature flags" to enable the feature first for internal employees, then 1% of the user base, the 10% all the way up to 100%.

This is what is considered the Best Practice for deploying new features and handling DB changes.

More instances

Since both Version 1 and Version 2 are running simultaneously during a deployment, for some period of time, you will be running double the number of instances for an application than the normal state requires. There are real costs involved with this, so please be aware of it, and once you are certain Version 2 is working perfectly well, terminate Version 1. Another thing to consider is if you are getting close to the limit of EC2 instances you can run in your AWS account, you may need to increase this limit so that deployments can occur seamlessly.

No storage

Since all of the instances for version 1 are terminated when they are no longer required, do not store anything on the local disk (even EBS volumes). Any data will be lost forever after version 1 no longer exists. This is pretty standard for any auto scaling system, so should not be a surprise.

What do you think of this page? Tell us about it