A deployment is a release of well-tested software to the general users. Simply put, the process goes like this: stopping the existing process, copying new executables to a production environment, and restarting the environment. The keen reader might notice a problem here — The basic method means an outage of the system is required. This might have been fine 10–15 years ago, but in the era of high availability systems, it is not acceptable.
With a quick web search, one can find two popular ways to solve the problem of staying available (see: no downtime) while releasing new software:
- Blue/Green deployments
- Canary deployments
I have been privileged to work at corporations that have used both, so I can provide some more concrete evidence of how it worked for actual production environments, share lessons learned and help you determine which way works for you.
The goal of this article is to help you better understand the pros/cons of each strategy and make decisions on what works best for your organization.
Blue/Green Deployments
I consider this strategy the simpler of the two types mentioned above. There are two identical sets of infrastructure: one that is running the old version and serving customer traffic (blue); and one that is running the new version and ready to swap (green). The last step of the deployment would be to swap the traffic from the blue servers to the green servers. Once the swap is complete, and the new code is validated to work correctly, the old code infrastructure can be torn down and the deployment is finished.
The rollback procedure is straight-forward as the traffic can be swapped back.
There are some issues in this approach, the main ones being:
- Infrastructure cost
- Increased blast radius
The infrastructure cost is effectively doubled, there are two identical instances of the application infrastructure running at any given time. You can take down the blue servers once the deployment is deemed safe, but the rollbacks will be slower if latent issues show up.
If there are bugs in the code, the blast radius will be 100% of users until a rollback is done.
Real World Example
At one of my previous jobs, we did monolith deployments one night every 2 weeks. There would be a 2–3 hour outage of services before everything came back online. This was fine until our business decided that we would commit to your customer three nine-s availability (max ~9 hours outage per year). This commitment rendered our existing setup inadequate (3 hours * 25 releases per year = 75 hours outage).
The infrastructure-architecture team was then tasked with coming up with a solution that had no downtime. I’m not privy to the design docs written to compare approaches, but the solution ultimately implemented was the B/G deployments.
There would be x number of servers stood up, matching the number of servers currently serving prod traffic. The new servers were provisioned and had new software deployed. Once everything was set, the LB swapped the endpoints from the old servers to the new servers.
This worked well for the most part, as all pieces of code were heavily tested at the company for minimal rollback (only two out of the 30+ that were done in my time). However, when problematic code was deployed, it would cause a 100% outage for a specific feature, or in a worse case, the whole platform.
Canary Deployments
Canary deployments is where the customer traffic is slowly transferred to a newer version of the software. In practice, there’re a few different ways to achieve this. For example, you may spin up a few new servers or pods with the new version and transfer traffic to these. The traffic is monitored for any issues or inconsistencies for a period of time, then the deployment continues.
The blast radius is reduced here as only a fraction of the customers are affected by the new version of software if something goes wrong. You roll back by first taking out the new version from serving traffic and then deploying the old version.
The obvious issues with this deployment are:
- More Complexity
- Longer releases
Software engineers in an environment using canary deployments need to make sure their changes are backwards compatible. They also need to make sure that the customer experience will be consistent — think of the use-case where a customer hits a server with a new box, then on the subsequent request hits a server with older software.
Additionally, since the traffic is slowly moved from the old software to the new software, propagation of the changes can take much longer. Depending on your preferences, it can take from hours to days for a change to go out. This is generally acceptable in a CI/CD environment.
Real World Example
At my current company, the standard is to deploy to a single server or pod (called 1-box deployments) and monitor the metrics on it. Depending on the number of servers or userbase, the “1-box” might be more than just a single box, it can be multiple.
After the deployment, there’s a period of time where the new boxes are monitored (3 hours minimum per region). The setup can be simple. A bake time period can be set and aggregate alarms are monitored, these alarms are 1-box specific. If an alarm goes off, then the deployment is rolled back. Alternatively, metrics are monitored per API and thresholds are set to rollback if any threshold is breached.
After the 1-box deployments are finished in all regions, deployments occur in waves to get the software out faster, For example, there can be three waves with each wave deploying out to 33% of the remaining servers in every region. This is done so that any latent issues are caught and rolled back.
Recently, I was working on a lambda and had to implement something similar to a 1-box deployment. Lambda + CodeDeploy offers a rolling deployment solution that achieves the same goal.
Comparison
Combined Approach
Another way to tackle the problem is to have a combined approach. In this case, you have two sets of infrastructures but the traffic is slowly migrated from the old set to the new set. Monitoring is in place to validate that the new code is functioning properly.