The anxiety of deployments is real. Let's take a stab at understanding the human emotions related to deployment and learn best practices to minimize the fear.
A recent outage involving CrowdStrike impacted 8.5 million Windows operating systems, leading to disruptions in various global services, including airlines and hospitals. Multiple analyses have examined the root cause of this incident itself.
However, as a software engineer, I think we are missing the aspect of human emotions related to deployments, specifically the fear of breaking production. That’s what we will try to dive into in this article. We will cover:
Before delving into the fear of deployments from a software engineer’s perspective, let’s first understand the role of a release engineer. Release engineering has evolved considerably in recent years, thanks to the modern CI and CD tools and the standardization of Kubernetes. Despite these advancements, the primary responsibilities remain the same:
Unlike the release engineers, as a software engineer working in the product team we may only care about certain aspects of deployments:
Although there are things we care about, there are also those we don’t:
So what does the fear have to do with Continuous deployments?
A lot.
Studies have proven [several benefits](https://dora.dev/capabilities/continuous-delivery/#:~:text=DevOps%20Research%20and%20Assessment%20(DORA,as%20higher%20levels%20of%20availability) of Continuous Deployment (CD), and unsurprisingly, many of which are psychological in nature. Continuous deployments remove “human-in-the-loop”, therefore, it requires a strong trust in the test infrastructure.
In other words, automated tests not only ensure the reliability of production but also provide psychological safety, sometimes irrationally, reducing the fear of deployments. As a developer, I’m more comfortable making changes in a CD process vs if I’m asked to verify the changes manually.
However, despite the popularity of these CD strategies, a lot of companies still trigger deployments manually (have a human-in-the-loop), indicating a cautious approach to CD implementations. This behavior suggests that teams prefer to retain supervision of the release process and intervene where necessary.
This is important to understand from a psychological safety perspective. Manual deployments imply that someone is overseeing the process and handling issues when things go wrong. While this provides a sense of security, it can also induce fear in the person deploying and is prone to human error.
Despite the drawbacks, most teams manage deployments manually. A typical manual deployment may include a few steps:
Someone babysits the entire deployment process before a release goes out. This person is tasked with intervening when and if there are signs of trouble. Teams maintain an on-call person who manages their deployments and handles problems when they arise.
Some teams have a dedicated release engineering team, which ensures releases go smoothly. Since this means a high degree of specialization, the deployment process could be more efficient and reliable.
Some companies maintain a spreadsheet to validate any changes made. This allows companies to systematically review and approve these changes, ensuring they meet predefined quality standards.
In addition to spreadsheets, manual QA is another layer companies add. Manual QA tests new releases in staging environments before deploying them to production. However, a testing environment isn’t foolproof, so some real-life scenarios won’t be accounted for.
Many things can go wrong for any software development team relying solely on manual deployments:
This can create bottlenecks, which lead to release delays and human error in some instances. Also, a team could have problems when this specific person leaves or can’t deliver on the required tasks.
There is no strategy for following through in an unfavorable production incident. When an incident happens, the release team has to grapple to find the relevant stakeholders to help resolve and make decisions.
Typographical errors in commands or scripts, or forgot to run the pre-deployment or post-deployment steps.
Since the deployments require babysitting the process, it becomes a time-consuming effort. Also causing the frequency of deployments to drop significantly. For instance, if it requires an hour to monitor the entire deployment, the release team may decide to skip deployments on the days with minor changes to save that time.
It’s unclear from product teams the state of the releases and when their changes are getting into production.
Looking at these challenges, it’s easy to understand why engineers dread deployments. The risk of deployment failures, the high stakes, and the pressure to keep downtime low also contribute to this fear.
These failures can be minimized by increasing test automation. Still, since these tests are carried out in a test environment, you should not expect an automated test to catch every possible error. Failures are to be expected but at a reduced rate.
Simply set up Continuous Deployments? Easier said than done. Despite the drawbacks, manual deployments are still okay if managed well. The goals should be:
Canary and Rollback strategies can help reduce the impact of an outage and in many cases avert the crisis automatically.
A canary release exposes your new release to a small portion of production environment traffic. This gives teams insight into issues that might not have come up during testing.
On the other hand, a rollback strategy helps engineers revert a release to its previous stable version state. It is done when new problems arise after deployments to the production environment.
Define standard deployment methodologies that result in efficiency, consistency, reliability, and high software quality. In their state of DevOps report, DORA shows that reliability predicts better operational performance. Furthermore, having a standardized process allows repeatability in release processes, which can be automated. Automating this process helps a team keep production costs lower.
Democratizing the deployment process removes the reliance on specific individuals. If we empower any software engineer to deploy, it slowly reduces the fear. “If “anyone can deploy, it should not be too hard.” Share your legos!
To reduce deployment anxiety, we need to deploy more frequently, not less. The DORA report also highlights that smaller batch deployments are less likely to cause issues and help lower the psychological barrier for developers.
Clarifying what is being deployed enhances the developer experience. Make it easy for developers to know when deployments occur and what changes are included. This transparency helps developers track when their changes go live and simplifies incident investigations.
There should be defined steps to follow for rollbacks and hotfixes, as this helps eliminate any indecision with production incidents. For instance, there should be separate build and deploy steps for teams to follow for easy rollbacks.
Similarly, standardizing how to deal with hotfixes and cherry-picks can make it simple to operate when the stakes are high.
Feature flags are like kill-switches that can turn off a new feature that caused an incident in production. This can enable engineers to resolve production incidents quickly.
Software teams must treat release engineering as a priority from the outset of product development to avoid costly mistakes. And we should not let incidents like the Crowdstrike outage cripple our development practices. Addressing the fear of deployment and preventing production incidents involves several key strategies:
At Aviator, we are building developer productivity tools from first principles to empower developers to build faster and better. For a modern way to manage deployments, check out Aviator Releases.