There is a general misconception among cloud consumers that the availability of their resources in the cloud is always guaranteed. This is not true since all cloud providers, including Microsoft, offer specific SLAs for their products that almost never reach an availability target of 100%. For the consumers who have deployed critical resources and applications to the cloud, reaching the company-defined targets for Business Continuity can be technically challenging and confusing. The purpose of this blog post is to provide practical guidance on how Business Continuity is expressed on the cloud, how it can be implemented for many Azure IaaS and PaaS services and what real-world problems each solution attempts to solve.
Before we dive into the technical Azure-specific details, let’s explain what Business Continuity is and what it involves.
Business Continuity is the capability of the organization to continue the delivery of products or services at acceptable predefined levels following a disruptive incident.
As there are many different types of SLAs that you may have internally or with your customers, there are also multiple aspects of Business Continuity that may be applicable to you.
If it’s important for you to keep your services always up and running, you should focus on High Availability. This is the ability of a system to be continuously operational, or, in other words, have an uptime percentage of near 100%. It is generally achieved by implementing redundant, mirrored copies of the hardware and data, so that if one component fails, another one takes over.
If fluctuating demand and bottlenecks cause your systems to struggle, then you may need to focus on Scalability. This is the ability of a system to scale up or scale down cloud resources as needed to meet fluctuating demand. It can be considered as an aspect of Business Continuity, since peaks in demand can be the result or the cause of an incident.
Finally, to protect data that are critical to your company’s functionality and need to be always available and recoverable, you should implement Backup. This is the duplication of data to a secondary location, so that if the primary copy is harmed or becomes unavailable, data from the other location can be retrieved and the system can be rolled back to a specific point in time.
The following diagram shows an analogy between the aforementioned terms and the problems they tackle.
It is important to note that the implementation of any of the controls described in this blogpost should be based on a structured business continuity assessment/plan, and should be selected based on the requirements of your environment or application. Improvident implementation of controls could result in undue costs or in inefficient protection.
Depending on the required uptime of your application or system and the scale of disaster you need to be able to recover from, there are many ways to implement high availability in Azure. When choosing the controls that will be implemented in your environment, you should always consider that the availability in a chain of resources is determined by the weakest link in the chain. For example, in the case of an application composed by a front-end server and a database, if the web server is spread across multiple availability zones but the database is single-instance, the whole application will not be available anymore if the availability zone of the database goes down. Having the above in mind, we present below the different options provided by Azure, sorted by increasing complexity and costs.
Small-scale technical or hardware issues may affect single-instance components. To avoid this, the component should be mirrored to a secondary hardware volume. On Azure, depending on your cloud computing state, this can be implemented as follows:
When the component is a Virtual Machine (VM), this can be achieved by using availability sets. An availability set is a logical grouping of VMs that allows Azure to understand how your application is built to provide for redundancy and availability. While for single-instance VMs Azure guarantees a 99,9% uptime SLA, by using availability sets the uptime is increased to 99,95%. To use availability sets on Azure VMs, you need to perform the following steps:
Note: It is not possible to add existing VMs to an availability set after their creation.
Azure PaaS components are protected against local hardware failures by design, guaranteeing higher uptime SLAs than IaaS. Specifically:
To provide the option of protecting against failures that affect the whole datacenter, such as fire, power and cooling disruptions or flood, Microsoft has introduced the concept of availability zones. Availability zones are unique physical locations within an Azure region, each made up of one or more datacenters with independent power, cooling, and networking. The creation of multiple instances of services across two or more zones provides increased high availability, as it protects both against hardware and against datacenter failures.
Based on your Cloud computing model, such protection can be achieved as follows:
Virtual machines can be deployed across multiple availability zones to provide an uptime SLA of 99,99%. This can be done with the following steps:
When it comes to PaaS services, it is generally easier to deploy them across multiple availability zones. Specifically:
Note: It is not possible to enable availability zone support after the creation of any of the above components.
Finally, to protect against regional failures that can affect many adjacent datacenters and can be caused by large-scale natural and man-made disasters (e.g., earthquake, tornados, war), Microsoft has introduced the concept of availability regions. Azure regions are physical regions all over the world, designed to offer protection against local disasters within availability zones and against regional or large geography disasters by making use of another region and replicating the workloads to that region. The secondary region could be considered as the disaster recovery site. Availability regions can be used independently of availability zones or in conjunction with them.
Based on the Cloud computing model of the application components that you want to protect, you have the following options:
Virtual machines can be deployed across multiple availability regions to provide an uptime SLA of 99,99%. This capability is offered by the Azure Site Recovery service that orchestrates the replication, failover, and recovery of the VMs. Site Recovery can be implemented with the following steps:
PaaS services can be protected from regional disasters as follows:
Before closing with High Availability, remember that a highly available system is a system that your customers and employees can rely on. It increases the credibility of your company, improves its reputation and offers peace of mind to your valuable users. Although costs may go up, depending on your implementation choices, it shall assist you to establish yourself as a trustworthy partner.
For systems whose load can abruptly increase or decrease, a problem arises: How can you guarantee the available level of resources during high periods, while at the same time keeping your costs to the minimum during the low periods? This is the essence of scaling, and in the cloud, achieving this balance is much easier than in traditional, on-premises infrastructures. There are two main ways that an application can scale: vertical scaling and horizontal scaling. Vertical scaling (scaling up) increases the capacity of a resource, for example, by increasing the VM size, CPU, memory, etc. Horizontal scaling (scaling out) adds new instances of a resource, such as VMs or database replicas.
While vertical scaling can be achieved more easily, and without making any changes to the application, at some point it hits a limit where the system cannot be scaled more. On the other hand, horizontal scaling is more flexible, cheaper, and applies to big, distributed workloads. It also enables autoscaling, which is the process of dynamically allocating resources to ensure performance. That is why, especially in the Cloud, horizontal scaling is the recommended option.
The options Azure provides you with are as follows:
Scalability of Virtual Machines in Azure can be achieved through Virtual Machine Scale Sets (VMSS). These represent groups of load-balanced VMs that provide scalability to applications by automatically increasing or decreasing the number of VM instances in response to demand or a defined scheduled. VMSS can be deployed into one Availability Zone, multiple Availability Zones or even regionally.
During the creation of a VMSS resource, the cloud consumer can specify the scalability options and minimum instance number, the Load Balancer (or Application Gateway, in case of HTTPS traffic) that will be used, and other networking and orchestration options.
In most cases, managed PaaS services have horizontal scaling and autoscaling built in. The ease of scaling these services is a major advantage of using Azure PaaS services.
Specifically for App service, we should point out that the scaling options depend on the App Service plan (tier) and can reach a maximum of 100 instances when using the Isolated tier.
Although HA solves the problem of small or extended failures, what happens if the unavailability of data originates from a malicious threat, such as a ransomware attack? In this case, having a highly available infrastructure will simply replicate the encrypted/corrupted files everywhere almost immediately, leaving no recovery options. Here is where the value of remote data copies, that are unaffected by real-time modifications, lies. With Azure Backup, regular backups and snapshots of workloads are taken, so that in case of unauthorized modification or deletion, the service can be restored to a specific point in time.
Backups in Azure can be implemented both for IaaS and for PaaS services, and the options are presented below.
VM backups can be either locally redundant or zone redundant. Recovery from backups can be implemented in two ways: the standard option generates backups once a day and maintains instant restore snapshots for 2 days. The enhanced option generates multiple backups per day, maintains instant restore snapshots for 7 days and ensures that snapshots are spread across zones for increased resiliency. The second option applies to VMs of high criticality.
In an existing Azure VM, Backups can be configured under Operations – Backup.
Multiple backup options also exist for different PaaS services:
Overall, it is important to find the balance between the frequency of backups and the amount of maintained past snapshots, in order to lose as less data as possible in case of an incident, be able to revert to a healthy past state and at the same time maintain costs to an acceptable level for your company.
To conclude, it all ends up to one question: Can you survive? Can you recover from disasters as small as power interruptions to as big as pandemics and earthquakes? Business Continuity is the key to the answer. And in the modern, distributed Cloud world, all the capabilities are there – it’s just up to you, your dedication and commitment to implement the ones that are essential to your business.
Elpida Rouka
Elpida is an Information Security Consultant, with expertise in Azure/O365 Security, SIEM, Identity & Access management, Risk management, Information Security Management Systems (ISMS) and Business Continuity planning (ISO22301). She is always eager to create innovative high-quality solutions that precisely meet business needs.
Stijn Wellens
Stijn is a manager with experience in cloud and network security. He is Solution Lead for Cloud Security Assessments and Microsoft Cloud Security Engineering at NVISO. Besides the technical challenges during Azure and Microsoft 365 security roadmap implementations, Stijn enjoys coaching the teams by sharing his knowledge and experience.