Traffic Manager Profiles – Custom Domains

Traffic Manager Profiles – Custom Domains

Last time I walked through a basic traffic manager setup. As with most walkthroughs this should help get you up and running but it doesn’t cover some of the things you need to consider to make this a real world solution. So, this time I’m going to delve into some of this. I’ll cover the Azure part of setting up a custom domain to give your site a realistic presence to your customers. To better understand this I’ll look at what Traffic Manager Profiles are actually doing under the covers with requests coming from clients.

After setting up a traffic manager profile if you look at the custom domains for your site you’ll see something like this.

part 2.1

You’ll see that Azure has added a custom domain for azurewebsites.net so you have a means of accessing the site even if you do nothing else. It is greyed out as you cannot remove it. In the screen grab I have also added a custom domain in order to give the site a friendly name. To get this to work you need to setup the relevant DNS A or CNAME records whether that is in Azure itself or via a 3rd party. Azure will only allow you to add this after verifying that the domain records are correctly registered.

If you set up a traffic manager profile and add your web site as an endpoint, when you look at the custom domains again you notice a change. An entry for the traffic manager profile has been added, but why? To understand that you need to look at what the traffic manager profile is really doing.

part 2.2

When a client makes a request for tmpprofile1233.trafficmanager.net, a DNS lookup is required to resolve the address. Normally this would result in the same IP address (in the case of an A Record) or a domain name (in the case of an CNAME record) for every lookup. If the result is a domain name, the process is repeated until an IP address is returned. The client then uses this IP address to talk to the web application directly. Traffic is not being routed through the DNS infrastructure for each request nor is a lookup done each time. The client holds on to the IP address for a set period of time, called the Time To Live, TTL, and only looks up the address up again when this time has expired.

Traffic manager profiles provide a set of rules so different domain names are returned based on the routing method and the endpoint configuration e.g., the number of endpoints, their weight and priority. You also define a TTL which is lower than normal to ensure that address lookups occur more regularly. This ensures that clients are not disrupted for too long in the case of a failure.

part 2.3

Based on its rules, traffic manager will provide the domain for one of your endpoints, such as uksouth-dev.hamersmith.space. The client will then resolve that to an IP address and talk to it directly. This explains why trafficmanager.net addresses show up in each of your app services custom domains list. It is also why you configure a shared domain name such as dev.hamersmith.space at each site and not in the traffic manager profile itself.

part 2.4

In the screen grab above I have a local, pretty domain, xxx-dev1.hamersmith.space, that routes the clients directly to this web site. This is useful for testing purposes to bypass any traffic manager policies. You’ll also see the shared domain name xxx-dev.hamersmith.space which is needed to ensure that the site works correctly when it is picked by the traffic manager policy.

It took a while to get my head around this when I first used traffic manager, but once you walk through what it is doing it starts to make more sense.

Advertisements

Traffic Manager Profiles

Traffic Manager Profiles

Azure is pretty reliable and for many situations you get everything you want for all your business continuity needs without looking beyond a single Azure region. However often you’ll be working with customers that need to be assured that your solution will work if some or all services in a single Azure region should fail. And regional level failures do happen.

Microsoft Azure hit with widening outages in Europe and India [Sept 2016]

Amazon AWS S3 outage is breaking things for a lot of websites and apps

Whilst outages don’t happen that often, you should assume that they will at some point and you should be prepared.

The rest of this post discusses how to provide cross region redundancy for any Azure App Service.

When setting up an Azure App Service, that service runs in a Service Plan which is effectively a description of the server farm your app service is running on. Through the service plan, you can ask for bigger servers (scale up) and for more servers (scale out). Whilst it is true that having additional servers does increase the reliability characteristics of your application, service plans are really about availability and performance – not disaster recovery. If there was a problem with the underlying Azure infrastructure supporting your service plan, then there is a risk that your entire service is dead.

This is where Traffic Manager Profiles come in. These profiles sit in front of your App Service and distribute clients across multiple instances. It is not that useful when you are deploying your app service to one region, but if you have a copy in at least one other region, things get more interesting. In this case traffic manager profiles can select app service based on one of three routing methods

  • Priority
  • Weighted
  • Performance

The linked to article describes these three methods in more detail but for the purposes of this post, I’ll only be considering the weighted option and I’ll configure it so that 50% of the traffic is routed to one region and 50% to another.

Setting up traffic manager profiles through the Azure portal is pretty straight forward. Once you have your App Service deployed to two separate regions, you simply create the Traffic Manager Profile resource.

Part1

The important point to call out here is the name of the Traffic Manager Profile. This will become the Fully Qualified Domain name for your traffic managed site. It will be the way your solution will be accessed so make the name meaningful, and it also needs to be unique. In my case the FQDN is my-tm123.trafficmanager.net.

Part2

Here I have a basic Traffic Manager Profile but it isn’t very useful yet as it hasn’t got any endpoints. You can add endpoints through the obvious “Endpoints” option under the Traffic Manager Profile settings.

Part3

Here you can create an endpoint which can be a Cloud Service, App Service, App Service Slot or the Public IP address of something else. That something else, could be hosted in Azure but it also could be anywhere else out on the Internet. When selecting App Services all of those in your Azure Subscription are displayed. Notice that when adding your subsequent endpoints, the UI will continue to display all App Services even the ones already wired up to endpoints, but you will receive a validation error if you attempt create a duplicate.

To achieve an even distribution the weight parameter should be the same for all endpoints. Weight can be any integer between 1 – 100 so whilst configuring each endpoint’s weight to 1 will work it, might be better in production situations to use values that are more intuitive such as (50, 50) or (30, 30, 40).

That gives you a basic Traffic Manager setup. At this point you can access your site on a *.trafficmanager.net address and the endpoint will be selected via the routing method.  In the case of a failure of one of the regions your service will still be operating albeit with reduced capacity. You could combine this setting with elastic scaling rules in your App Service which would increase the size of the farm in the event of a failure to compensate for this.

Traffic Manager will detect when there is a problem with an App Service either by querying its status e.g., Started, Stopped etc. or via monitoring the site itself. By default, it probes the root of the site on port 80. If it receives a 200 HTTP status all is good. If it gets anything else, that is considered a failure. All services must make available the same endpoint for monitoring, and the Traffic Manager Profile can only be configured to monitor one thing. So, whilst this is suitable for basic situations it is not as sophisticated as the monitoring solution you might expect from a fully featured load balancer.

At this point you have provided a solution that can handle failures within one region. Traffic Manager will direct clients to an alternative. So, we’re done, right?

Not yet… One of the things I often see missing from solutions put in place for DR or business continuity is actually testing what happens when things go wrong. You should test to understand how the solution recovers from failures and to assess potential impact on connected users. With traffic manager profiles, you have a couple of options at your disposal.

1) You can stop the App Service in the Azure Portal
2) You can disable the endpoint from within traffic manager
3) Through configuration or otherwise you can cause the monitored endpoint to simulate a failure.

In each of the situations you need to observe what the impact is on a user connected to the failed endpoints during the switch over. Do they see any errors? Can their session continue once the endpoint is switched or do they need to start again? You should bear in mind that a failure is unlikely but the impact could be big. Therefore, you might inconvenience some users but when looking at the bigger picture you still have a service. And this is why testing switch over is important. You don’t know when a failure might occur but you need to be prepared in case it does.

Selecting the right tool for a job

Selecting the right tool for a job

Tooling, as it turns out can be a very emotive subject. Everyone has their favourites but often we find ourselves in situations where we don’t get to choose. And even when we do get to use our favourite, in all likelihood there are others using it under duress, waxing lyrical about the superior feature set of an alternative product and how pleasant life would be if only we were using that instead.

I have a few principles, guidelines or rules of thumb when it comes to tooling, and the tension, conflicts and friction it can cause across a team

Understand what you are trying to achieve

Before you even start thinking about tooling or any other type of software you need to understand what you are trying to achieve and why you think a tool might help. If you can’t answer that question than how can you expect to select and evaluate a solution.

Let’s say you want to select a tool to assist your software delivery. You might have team members in different locations so simply representing work items as sticky notes on a wall does not cut it. Therefore, the team needs something that enables the creation of work items that are visible across all team members. You also might be finding it hard to understand what work is being done, so you want something that makes it possible to see which work items are currently being worked on, and provide an indication of who is delivering a particular piece of work.

Once you understand what problems you are having you can determine what is required from the tool.

Understand how the tooling has been designed to solve a particular problem

For the software team building a tool, a problem such “as running a SCRUM team” is a wide problem space. That team will have had to build an understanding of how their software solves that problem, perhaps looking at best practice or by analysing a number of past successful projects. They will have also made their product adaptable and configurable in order to attract as many customers as possible.

When you look at the tooling from different vendors you may see that they may have solved the same problem in a different way. This might have been for a number of reasons which include understanding the problem differently, prioritising different features and trying to differentiate themselves in the market.

How you see the problem space and how the vendor does is likely to be different. They will see it from a wider and generic perspective, and you will see it from the narrow, specific point of view of the immediate problem you are trying to solve.

Adapt in both directions

Once you understand the problem you are trying to solve and how the chosen package aims to solve it, both sides need to compromise.

In an ideal world, the software would solve your problem, immediately, out of the box or maybe with some minimal configuration tweaks. More realistically you should expect to do some configuration to tailor it to your circumstances. It is likely that products will be design with this in mind.

However, if you are finding that you are having to warp the product out of all recognition to get it to work for you, even when you are both working in the same problem space, you need to ask yourself some searching questions.

  • How much effort, time and money is required to maintain heavily customised software
    If this much customisation is required are we really in the same problem space

I have seen a number of teams warp a software solution completely out of shape without asking themselves whether they should change their ways of working to better suit the software. It is highly likely that the software vendors understand the problem space better than you and have created a tool that deals with the majority of typical usage scenarios so why are you different, why are you not typical?

Standardise where possible

Standardise in this context means using the selected product across teams, business units and even organisations. This means you get the benefits of scale. Lessons learnt by one team can be shared by others. A standard solution also means that operating costs are likely reduced and there are only one set of administrators and one set of hosting costs.

This also may mean you don’t get to choose. The software might already be available so this is the default choice.

However, many organisations miss the next part. This is that choice still exists if the product really does not work in a given situation. Unfortunately, the lure of cost saving and efficiencies drown out the small teams who are greatly hindered when forced to use the wrong tool for the job.

Having a standardised solution is not the end – it is the beginning. The usage of the software needs to be continuously reviewed to see if it is still the right solution. The product needs to be upgraded and patched and there should always be an eye on the other solutions in the problem space. An alternative product that was previously discounted may now be a better fit.

Accept the choice

And finally, if the choice doesn’t go your way you need to accept it. You need to get behind the choice and make it work for you and your team. As technologists, there are always going to be better choices. It is often a case of the “grass is always greener”. And maybe you’re right this time but maybe you’re not. It is all too easy to fall into a trap, where you believe that a tool will solve all problems. Usually what happens is that the tooling will be switched only to find that you have all the same sort of problems. It wasn’t the tooling that was the solution after all.

In my experience the problem usually lies with people before it does with software. If your people can’t solve your problems, then you should not assume that software will.

Unhelpful helpers

Unhelpful helpers

In small organisations, everything is managed by a small group of people. There is no infrastructure team, nor is there a release manager. If you expect a project management office, forget it and the support team can consist of just one person. Essentially everyone can do anything and as software gets more complex these sorts of teams cannot be blamed for looking for short cuts.

When it comes to Azure it is not realistic to expect everyone to know all the details about all the services. Microsoft realise this because they are sprinkling things like helpers and advisers all over it. The one that has got my attention recently is the Azure SQL Adviser.

I have been using Azure SQL for a while now and the on the whole it just works. It looks and feels like any other SQL Server database but it just so happens that it is a running in the Cloud and some of the stuff we used to worry about such as backups and patching are now Microsoft’s problem. Even so, running and tuning a database is still complex and is a specialised skill, so a small general purpose team is thankful of any assistance they can get.

So, when customers are complaining about performance and the Azure SQL adviser is effectively flashing a big green button labelled “make the performance problems go away” that button is going to get pressed.

But wait…

Teams that are moving to DevOps are control freaks. They are creating sophisticated build and release pipelines and are trying to automate everything. In order to reduce down time things like database schema changes are closely planned and controlled using techniques like migrations. Those migrations are run as part of the release process automatically.

What is the impact of applying Azure SQL advisor recommendations immediately?

Let’s say that the adviser has applied an index on a database column used in a critical join. That change is applied immediately and all is good. Performance is improved. Customers stop complaining. Sometime later the development team are building a new feature that will be removing that column. This all works fine during development and testing but when the team try to deploy to Live, their database migrations fail. They are dropping a column that is used in an Index – an index they did not know about so there is no chance that it exists in other environments.

This can be quite a problem. Over time a team with a reliable release pipeline start to expect releases to just work and now they have a problem deploying to Live. They don’t know what to do. There is no plan for backing out or fixing forward the deployment.

The solution is quite simple. The team should look at the advisor as another source of work alongside new features and live issues. The development team should view the changes and incorporate them into their database release pipeline so it can be delivered into Live in a controlled and repeatable way. In some cases, the recommendation may relate to a critical performance problem so it becomes the highest priority. The question that the team needs to answer is how quickly this change can get into Live and is the existing process fast enough.

Learning from your Failures

Learning from your Failures

Despite your best endeavours problems will occur. Sometimes they will be big problems.
We work on complex systems which by their very nature are hard to predict. We are often lulled into a false sense of security that makes us think that we understand them only to be surprised when they fail spectacularly in an unexpected way.

Let me give you an example in terms of my cycling hobby.

Last week I was cycling back from work. Suddenly there was a loud bang and I immediately realised my rear wheel had a punctured. But WHY?

On examining the tyre I was expecting a simple inner tube failure, something that I was prepared for, but it was quickly obvious that the side wall of the tyre had split. But WHY?
Before continuing the investigation, I had to deal with the immediate problem. Fixing a split side wall is not something I could do by the roadside so I had to think about my immediate options. There weren’t many. There was a bike shop a couple of miles away but that would be a long walk and I would end up buying whatever they had just to get me home. Would they even be open? The other alternative was to ring my other half to be rescued. And this was what I chose.

I had accepted that the problem was real and I didn’t have many options. At this point I’m not looking for a perfect solution, I just wanted to recover the situation as quickly as possible. Being rescued looked like the best choice, there was little risk that it would make things worse and it was the scenario where I could be home and dry quickly.
Back to the investigation. I needed to know why I ended up in that situation so I could try to avoid it in the future. WHY did I split my tyre?

In my warm and dry garage, I could see that a build-up of grit on my brake block was rubbing the side wall of the tyre when the brakes were applied. But WHY did that cause a failure?

Well, the grit was only rubbing because a week or so earlier I had fitted new brake pads and on adjusting them I had them slightly too high on the wheel rim. This was not a problem when the pads were clean but when they were covered in road grime they rubbed the tyre. However, this should have not been enough for a complete failure. So WHY did the tyre fail?

When I examined the tyre and where it failed I could see other damage around the failure point that was not apparent elsewhere on the tyre. Casting my mind back I remembered having a puncture some months back where I had trapped the inner tube between the rim and tyre after hastily swapping them onto another set of wheels. When that puncture occurred, I had to ride for over a mile on the flat tyre because I wasn’t somewhere I could stop easily to fix it. Could that puncture and the damage caused by riding on the flat tyre be the root cause of a bigger problem some months later?

The point is that a number of things that are not a problem on their own, have the potential to cause bigger problems when they occur together. The only way you know that these things can cause a problem is to experience it. What I have done above is perform a simple root cause analysis on the problem to understand why it occurred. Many organisations miss this when they have an IT incident. They are relieved that the incident has been resolved and frankly don’t want to think about it again.

Unfortunately for these organisations thinking about the problem is the best way to avoid it happening again. Tracing the chain of events helps identify preventative steps and different ways to monitor and maintain your system to ensure that it is not on its way to another failure.

For me it is ensuring that I clean grime off my brake pads and examine my tyres after any punctures. For your IT systems it might be improving code review processes, doing more thorough testing or adding proactive monitoring. Mistakes and problems will happen. Obviously, you want to resolve the situation quickly and calmly but that isn’t the end. You can learn from problems as they help you understand how your systems really work and to improve them over time so problems are less likely to occur – well at least the ones you know about.