Azure is pretty reliable and for many situations you get everything you want for all your business continuity needs without looking beyond a single Azure region. However often you’ll be working with customers that need to be assured that your solution will work if some or all services in a single Azure region should fail. And regional level failures do happen.
Whilst outages don’t happen that often, you should assume that they will at some point and you should be prepared.
The rest of this post discusses how to provide cross region redundancy for any Azure App Service.
When setting up an Azure App Service, that service runs in a Service Plan which is effectively a description of the server farm your app service is running on. Through the service plan, you can ask for bigger servers (scale up) and for more servers (scale out). Whilst it is true that having additional servers does increase the reliability characteristics of your application, service plans are really about availability and performance – not disaster recovery. If there was a problem with the underlying Azure infrastructure supporting your service plan, then there is a risk that your entire service is dead.
This is where Traffic Manager Profiles come in. These profiles sit in front of your App Service and distribute clients across multiple instances. It is not that useful when you are deploying your app service to one region, but if you have a copy in at least one other region, things get more interesting. In this case traffic manager profiles can select app service based on one of three routing methods
The linked to article describes these three methods in more detail but for the purposes of this post, I’ll only be considering the weighted option and I’ll configure it so that 50% of the traffic is routed to one region and 50% to another.
Setting up traffic manager profiles through the Azure portal is pretty straight forward. Once you have your App Service deployed to two separate regions, you simply create the Traffic Manager Profile resource.
The important point to call out here is the name of the Traffic Manager Profile. This will become the Fully Qualified Domain name for your traffic managed site. It will be the way your solution will be accessed so make the name meaningful, and it also needs to be unique. In my case the FQDN is
Here I have a basic Traffic Manager Profile but it isn’t very useful yet as it hasn’t got any endpoints. You can add endpoints through the obvious “Endpoints” option under the Traffic Manager Profile settings.
Here you can create an endpoint which can be a Cloud Service, App Service, App Service Slot or the Public IP address of something else. That something else, could be hosted in Azure but it also could be anywhere else out on the Internet. When selecting App Services all of those in your Azure Subscription are displayed. Notice that when adding your subsequent endpoints, the UI will continue to display all App Services even the ones already wired up to endpoints, but you will receive a validation error if you attempt create a duplicate.
To achieve an even distribution the weight parameter should be the same for all endpoints. Weight can be any integer between 1 – 100 so whilst configuring each endpoint’s weight to 1 will work it, might be better in production situations to use values that are more intuitive such as (50, 50) or (30, 30, 40).
That gives you a basic Traffic Manager setup. At this point you can access your site on a
*.trafficmanager.net address and the endpoint will be selected via the routing method. In the case of a failure of one of the regions your service will still be operating albeit with reduced capacity. You could combine this setting with elastic scaling rules in your App Service which would increase the size of the farm in the event of a failure to compensate for this.
Traffic Manager will detect when there is a problem with an App Service either by querying its status e.g., Started, Stopped etc. or via monitoring the site itself. By default, it probes the root of the site on port 80. If it receives a 200 HTTP status all is good. If it gets anything else, that is considered a failure. All services must make available the same endpoint for monitoring, and the Traffic Manager Profile can only be configured to monitor one thing. So, whilst this is suitable for basic situations it is not as sophisticated as the monitoring solution you might expect from a fully featured load balancer.
At this point you have provided a solution that can handle failures within one region. Traffic Manager will direct clients to an alternative. So, we’re done, right?
Not yet… One of the things I often see missing from solutions put in place for DR or business continuity is actually testing what happens when things go wrong. You should test to understand how the solution recovers from failures and to assess potential impact on connected users. With traffic manager profiles, you have a couple of options at your disposal.
1) You can stop the App Service in the Azure Portal
2) You can disable the endpoint from within traffic manager
3) Through configuration or otherwise you can cause the monitored endpoint to simulate a failure.
In each of the situations you need to observe what the impact is on a user connected to the failed endpoints during the switch over. Do they see any errors? Can their session continue once the endpoint is switched or do they need to start again? You should bear in mind that a failure is unlikely but the impact could be big. Therefore, you might inconvenience some users but when looking at the bigger picture you still have a service. And this is why testing switch over is important. You don’t know when a failure might occur but you need to be prepared in case it does.