Never ending stories – Agile Anti Patterns

Never ending stories – Agile Anti Patterns

The Never-Ending Story was a 1984 fantasy film about a boy who reads a magical book that tells a story of a young warrior whose task is to stop a dark storm called the Nothing from engulfing a fantasy world.  Apparently it was quite good but all I can remember about it is the large white flying dog like creature.

The Never-Ending stories I am more familiar with are those in Agile projects that start out as something that is assumed to be quite straight forwards but then generates much more work than was expected and before you know it, not only are you carrying it forwards into the next sprint but it starts reoccurring in multiple sprints. So, what are the characteristics of a never-ending story.

A story really represents a feature and as such becomes a bucket for all areas of functionality related to that feature

Agile is about reducing feedback loops in order to build confidence that you are always working on the most important thing at any given moment. Often never-ending stories internalise that feedback loop. Maybe the story represents a new area of a solution. You don’t what to do too much up front design or suffer from analysis paralysis so you have a story that enables you to try a few things out. This is reviewed with the Product Owner and the rest of the team which generates new ideas which is new work.

There is nothing wrong with this description so far, however a never-ending story emerges when that new work is added to the original story. Often this is coupled with a feeling that no one really knows what good looks like. The team starts to feel like they are working towards a moving target. It will only be a matter of time before the “When will you be done” questions start. Think for a minute about how it is possible to formulate an answer. Every time the team asks for feedback, more work is generated and so the goal has changed. Yet they are asked to say when work, which they may not know about yet, will be complete.

Work is blocked by external dependencies

Sometimes someone outside the team has a stake on when a story is complete. It may be that you are integrating with a third party and they have onboarding activities. Perhaps there is an external customer that has a final say on whether the work has been delivered to the necessary standard. The important thing to realise is that you cannot control this decision making and it is highly likely that your timescales will not align. The best thing to do is to accept it and then put in controls to minimise the impact on you.

There are a couple of things you can do to take control.

  1. Understand the requirements of the third parties upfront and ensure that this is factored in when creating the stories in the first place. This may feel like up front design but you are doing just enough to mitigate the risk of a delay in the future.
  2. Don’t “leave the bonnet open” on work whilst it is outside your control. Deal with feedback whether that is good or bad as new work coming into the backlog and deal with it on a priority basis.
  3. Minimise external dependencies by only having them when you absolutely can’t avoid them or where they add value to solutions you deliver.

Quality defects stopping work being completed

In this situation, the work has been done and the team’s tester is working on it only to find a quality issue. The story goes back to the developer to be fixed. Subsequently the tester picks up the work only to find more issues and repeat.  When this has been happening for some time, e.g., it is a reoccurring theme in multiple stand ups, the team needs to try to understand why things are bouncing around between two team members. Are the developer and tester working together on the story or is the developer “throwing the work over the wall”?  Are the defects being raised related to the changes being made under the story in question or are they coincidental. Are the developer and tester on the same page as to the quality metrics for the story?

So what?…

Never-ending stories are bad because they harm your team’s predictability. They are a black hole, they consume effort and people. No-one in your team is really sure when they will be complete. If you are working on one it can be very demoralising. Personally, I am motivated by finishing things but never-ending stories can really feel like a ball and chain, never allowing you to finish and never enabling you to move on to other things.

When you look at the examples I give there are a couple of ways to avoid never ending stories

Firstly, ensure that you are aware of the work that is being created. Whilst it is normal to create some new work when delivering stories, you need to decide the point at which you need to call it out, surfacing it as new work in the backlog and letting the PO prioritise it rather than absorbing it. If you are not doing this during a sprint, maybe when the story is carried over at a sprint boundary, this would be a good time to ask yourself if the story should be broken down.

Secondly many of the problems with defining the scope and boundary of a story could be resolve by investing time in defined acceptance criteria at the start. You may use a Story Kick-Off for this. The acceptance criteria could describe how the story should function, they define the quality criteria for the story and the expectations of third parties. And let’s not forget that we should be aiming to avoid large stories, breaking them down along their natural seams in order to keep your team predictable and high performing.

Azure, Cloud Service & Reserved IPs

Azure, Cloud Service & Reserved IPs

Azure is pretty good at getting you up and running quickly. You can get from nothing to a solution in production very quickly. Whilst this approach definitely reduces time to market, it can introduce growing pains along the way. Let’s consider Cloud Services as a specific example of how growing pains might manifest themselves.

When you create a Cloud Service you get two IP addresses, one for each slot, Staging and Production. These are allocated from a huge range Azure manages for each region and you have no guarantee of which IP you’ll get. When you’re setting up your Cloud Service you probably didn’t worry about that. As time passes and the solution matures you may have used those IP addresses to create firewall rules to your databases in Azure and perhaps even giving them to third parties in order for them to be whitelisted to allow your application to access another service.

Now the specific IP addresses that were allocated at random by Azure are now critical to the success of your solution. And guess what, those IP are not as secure as you might think. If for whatever reason your Cloud Service is deallocated, the IP addresses will be lost. When you recreate the Cloud Service it will be allocated a new IP address. All of your firewall rules now don’t work. That might not be a major problem for your own rules which you can hopefully change rapidly but it might be a problem if you are working with a supplier that has a 2 week turn around SLA for “minor changes”.

This is where Reserved IPs come in. They are a means to control the life span of an IP address by effectively taking ownership of it in your Azure subscription. Now Azure will never reclaim an IP Address whilst it is reserved in your subscription. The following PowerShell command will create a new reserved IP address.

New-AzureReservedIP –ReservedIPName very-important-ip –Location "UK South"

And this command associates the IP address with an existing Cloud Service.

Set-AzureReservedIPAssociation -ReservedIPName very-important-ip -ServiceName MyBrilliantService

However, we might already have a Cloud Service and its IP address has become important. Changing it would cause unacceptable problems. Luckily it is possible to create a reserved IP from an existing cloud service. The following Powershell command creates the reserved IP using the IP address of the Staging slot of a Cloud Service and also creates the association between the reserved IP and the Cloud Service

New-AzureReservedIP –ReservedIPName very-important-ip –Location "UK South" -ServiceName MyBrilliantService -Slot Staging

A couple of notes about these commands

  1. The -Location is the region in which the reserved IP will be created
  2. The -Slot is an optional argument on these commands. It lets you target either the Staging or Production Cloud Service deployment slots. Production is the default.
  3. Reserved IPs are a “classic” Azure feature. As such resource groups are a meaningless concept. You’ll see all your reserved IPs deployed to a Default resource group.
  4. The first five reserved IPs are free but you should be aware managing more than this is not. You are charged based on the time you hold on to an IP address, which is in effective, to dissuade you from holding on to a large number of publically routable IPV4 addresses which are increasingly become a limited resource. https://azure.microsoft.com/en-us/pricing/details/ip-addresses/

Let’s talk about deallocation. Over the lifetime of your solution your architecture will evolve. You might need to move an IP address from one resource to another. You may want to release IP addresses that you no longer use. Once you have an IP address reserved you own it until the reserved IP resource is deleted. The only way it can used is by creating an association. To use it elsewhere it must be deallocated from the original resource and associated with the new resource. It is important to note that during the process the original resource will receive a new IP address from Azure’s pool.

When playing around with reserved IPs I noticed a couple of behaviours that are worth noting.

Firstly, once a Cloud Service has a reserved IP you must specify its name when deploying the Cloud Service. Remember that the name of the IP should map to the one used for the particular slot. You do this by adding a NetworkConfiguration section to the service ServiceConfiguration file

<?xml version="1.0" encoding="utf-8"?>
<ServiceConfiguration serviceName="My Service" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceConfiguration" osFamily="4" osVersion="*" schemaVersion="2015-04.2.6">
<Role name="MyRole">
…
</Role>
<NetworkConfiguration>
<AddressAssignments>
<ReservedIPs>
<ReservedIP name="very-important-ip"/>
</ReservedIPs>
</AddressAssignments>
</NetworkConfiguration>
</ServiceConfiguration>

I found when the reserved IP was not referenced I received the following error when deploying. I believe by not specifying the IP the deployment process assumes you are changing the IP which is not allowed

Set-AzureDeployment : BadRequest: A reserved IP cannot be added, removed or changed during deployment update or upgrade.

Secondly if you added a reserved IP to one slot of your Cloud Service you must also add one to the other if you want to be able swap the deployment slots. You’ll get this error if you forget.

Move-AzureDeployment : BadRequest: Cannot swap VIPs when only one deployment has a Reserved IP.

Finally, as the number of Cloud Services in a particular environment grows and the number of environments increases the management overhead for the individual reserved IPs increases greatly. Let’s say you have 6 cloud services in 4 environments. That is:

6 cloud services * 2 deployment slots * 4 environments = 48 reserved IPs

In that case it might be better in the long run to build up a VNET with a subnets for each environment and then have a Virtual Network Appliance presenting these network to the Internet on a smaller range of IPs.

For further reading on Reserved IPs take a look at the following links.

The state of “Not Invented Here Syndrome” in 2017

The state of “Not Invented Here Syndrome” in 2017

Development teams often build up high levels of trust internally due to the nature of the constant collaboration between team members.  Whilst that internal trust increases and increases, it can cause a lack of trust of outsiders whether that be 3rd parties or even other internal teams. So, when there is a genuine case for reuse there is often a strong argument against it. A common one is that the high-quality standards of the team can only be assured if code is written in house.

And why not. Developers like writing code, therefore given the chance they will write “all the code”. But code has a cost in terms on maintaining a solution over time. And we will have to support the solution because software isn’t written once then forgotten about, it continuously evolves. And let’s not forget that writing scalable, reliable and adaptable distributed systems is hard.  Who really wants to be debugging a custom load balancing solution when your system is on its knees and customers are beating down your door. Why invest the next couple of months building yet another custom security solution when your competitors seem to be releasing new features every few weeks.

The IT industry is seeing trends that will hopefully consign that old insular mindset to the history books.

Cloud computing offers, amongst many other advantages, the opportunity of offloading complexity on to some other party. Why worry about heating and air conditioning in a custom data centre when all you really need to do is build a website? Economies of scale means that costs are substantially reduced but you need to remember that cloud offerings are built for the masses and if you don’t fit then you may not get the benefits you expected. Cloud solutions such as Azure and Amazon Web Services practically offer a menu of services that you pick based on your requirements for ease of use vs the flexibility and control that you need. At the extreme, serverless computing promises that you can deploy and run code in the cloud without ever worrying about how the underlying infrastructure will be scaled to meet demand.

There is a trend where many companies are reinventing themselves as tech companies – Netflix and Amazon are just a couple of examples of companies that in order to be disruptive in their particular marketplaces transformed themselves into technology companies. Over the last few years this has reached a tipping point and now many organisations are trying the same thing and expecting the same results. Whilst it is true that IT is fundamental to many business models and being technically savvy as an organisation has a key role to play it is unlikely that everyone needs to code their IT from the ground up.

By looking at the first movers in that space you see technologies being developed in house to solve a particular problem and then shared back to the community. Google created AngularJS and Facebook the Cassandra NoSQL database. Today anyone can pick up these projects for their own use and perhaps more importantly they can contribute to them allowing them to evolve independently.

So, my vision of a team that is successfully avoiding NIH Syndrome in 2017 is one that

  • Has a wide understanding of the technology landscape
  • Does not exhibit siloed thinking about technology stacks, particular products or architectures
  • Has the time and space to try new things
  • And is encourage to contribute back into the community that they take from.

Reusing open source software is not like picking apples from somebody else’s orchard. It is a two-way proposition. You use an open source project to enhance your own product – usually to save cost and time. Therefore, you should invest some of that time back even if it is to simply fix bugs or answer questions on Stack overflow. And here in lies the challenge. Many organisations do not yet see the value of reinvesting in the community that bootstrapped them to where they are today and are so single minded they cannot see beyond their own immediate business pressures to deliver more features. Whether this approach is sustainable – I’m not sure. But as more and more companies transform into technology companies the cream of the development world will come to expect certain values from their employers and as you know the cream rises to the top.

 

 

 

 

Permission, Ability and Skills

Permission, Ability and Skills

It could be argued that the primary purpose of a software developer is to provide solutions to problems. Problems are presented in the form of requirements or user stories and it is the software developers job to provide a solution. They provide a solution with the tools that are available and within the constraints that exist within the organisation in which they work. Normally this is done unconsciously and we don’t stop to think about it. We open of our IDE of choice and start coding.

It is only when we start to encounter what might be considered more difficult problems, we start to see the limitations of this approach. What happens if the problem is that of building up a continuous delivery approach for your team or providing a level of disaster recovery for your product’s infrastructure.

Torturous Analogy Time.

Let imagine that the problem is that I need to travel from London to New York for conference. I have permission from my organisation to do this but I have no other help.  Let say I work for an aircraft manufacturer and they have a plane sitting outside ready to be used. They don’t want to waste money on buying a standard airline ticket when they have a £multi-million asset sitting outside. (I told you this is torturous!). So now I have the ability to get from London to New York but that still isn’t helping me because I can’t fly the plane, I don’t have the necessary skills.

In order to solve the problem I need the permission to tackle it, the ability, through technical support and the skills to use that technology to solve the problem at hand.

In software development we’ll nearly always have the permission and tools but that is not always enough to solve the big problems such as those a mentioned above – building up a continuous delivery approach for your team or providing a level of disaster recovery for your product’s infrastructure.

Solving the big problems often requires organisational transformations which require at least three aspects to change – people, process and tools.  Simply having the tools is not normally enough.  It like having the plane without the skill to fly it. As technologists, we want to believe that it is the tool that provides the solution to a problem, but it isn’t.

We are doing ourselves a misjustice really. We provide the skills that ultimately solve the problem. The technology simple provides the ability to bring those skills to bear.

Traffic Manager Profile – Automating

Traffic Manager Profile – Automating

If you have been following my series, (part1 & part2) on Traffic Manager profiles you should understand how to create a setup that can select between a number of App Service endpoints based on a Routing Method. You’ll also be able to create custom domains, so your customers don’t have to use *.trafficmanager.net and *.azurewebsites.net based addresses.

Up to now, all of these changes have been achieved manually through the Azure portal. Whilst this acceptable for proof of concepts and small scale deployments,  doing everything by hand very quickly becomes error prone and time consuming. When I need to provision a number of environments with near identical setups I prefer to automate via Azure Resource Manager (ARM) templates.

I have written about ARM template previously so I am not going to cover old ground. Instead I will cover one of the problems I bumped up against when trying to automate the setup described in the diagram below.

part3.1

Here I have three resource groups.

  1. One for the primary region that contains a web site and the API supporting it
  2. One for the secondary region which is a copy of the first, providing basic business continuity in the case of failure
  3. One containing traffic manager profiles for both the web site and the API.

Before I go on, there is one important thing to highlight about ARM templates. You cannot have one template that effects multiple resource groups. Therefore I had to have at least three template deployments to achieve this automation. Full disclosure: I actually achieved this with two templates. One that had two different parameters sets for the web/api sites in UK West and UK South and one for the traffic manger profiles, making three deployments.

Creating a template to deploy the web site and API turned out to be quite simple.  I could use different parameter files against the same templates to give me two of the three resource groups I needed. This created the website, API and a custom domain for each.

The difficultly came when creating the traffic manager profiles. Whilst setting up the profile rules and the endpoints was easy enough there was no clear way to provide a custom domain so customers don’t need to use the *.trafficmanager.net address. Why?…

The custom domain must be added to the app services, not the traffic manager profile or the endpoints. As these app services live in different resource groups I could not affect them from the traffic manager profile deployment.

So why not add this step to the template that provisions the App Services? I did try this but still hit problems. If I tried to provision the app service first this didn’t work because the CNAME for custom domain I was provisioning was configured at my DNS registrar to point at the traffic manager profile address and that didn’t exist yet. As part of provisioning the custom domain Azure couldn’t verify that it was valid.

Reversing the provisioning order did not help either. It is not possible to provision endpoints to a traffic manager profile if the app services they point at do not exist yet.

I am not the sort of person to give up easily so I wanted to see if there was an resource group topology that could be provisioned automatically. So this time I tried the configuration below.

part3.2

The main change is that the traffic manager now exists in the same resource group as the app services it is managing, so this should just work…

Unfortunately this was not the case.  This time something more subtle was causing problems.

Remember last time that I pointed out that when you create the traffic manager profile in the Azure Portal a *.trafficmanager.net custom domain is added to each underlying app service. And in order to access the traffic managed site through a friendly domain name you also need to add a custom binding to each app service.

In my template, I was provisioning the traffic manager profile and it endpoints once the app services were created. Once the endpoints were available I’d then add the custom domain to each one. This final step failed because when provisioning the traffic manager profile endpoints, added the *.trafficmanager.net domain to the app services is done asynchronously. Adding my custom domain whilst this was happening caused a conflict.

This stack overflow question covers a very similar issue. I tried the recommendation of using the dependsOn element to change the order the resources were provisioned. The best I could achieve was a template that would fail on the first attempt but then work on subsequent runs. Not great, but least it failed reliably, and I could get a working environment eventually.

I have not been able to get any further than this. I can live with this for now but this is something I’ll keep an eye on and update this post if I find a resolution.

Traffic Manager Profiles – Custom Domains

Traffic Manager Profiles – Custom Domains

Last time I walked through a basic traffic manager setup. As with most walkthroughs this should help get you up and running but it doesn’t cover some of the things you need to consider to make this a real world solution. So, this time I’m going to delve into some of this. I’ll cover the Azure part of setting up a custom domain to give your site a realistic presence to your customers. To better understand this I’ll look at what Traffic Manager Profiles are actually doing under the covers with requests coming from clients.

After setting up a traffic manager profile if you look at the custom domains for your site you’ll see something like this.

part 2.1

You’ll see that Azure has added a custom domain for azurewebsites.net so you have a means of accessing the site even if you do nothing else. It is greyed out as you cannot remove it. In the screen grab I have also added a custom domain in order to give the site a friendly name. To get this to work you need to setup the relevant DNS A or CNAME records whether that is in Azure itself or via a 3rd party. Azure will only allow you to add this after verifying that the domain records are correctly registered.

If you set up a traffic manager profile and add your web site as an endpoint, when you look at the custom domains again you notice a change. An entry for the traffic manager profile has been added, but why? To understand that you need to look at what the traffic manager profile is really doing.

part 2.2

When a client makes a request for tmpprofile1233.trafficmanager.net, a DNS lookup is required to resolve the address. Normally this would result in the same IP address (in the case of an A Record) or a domain name (in the case of an CNAME record) for every lookup. If the result is a domain name, the process is repeated until an IP address is returned. The client then uses this IP address to talk to the web application directly. Traffic is not being routed through the DNS infrastructure for each request nor is a lookup done each time. The client holds on to the IP address for a set period of time, called the Time To Live, TTL, and only looks up the address up again when this time has expired.

Traffic manager profiles provide a set of rules so different domain names are returned based on the routing method and the endpoint configuration e.g., the number of endpoints, their weight and priority. You also define a TTL which is lower than normal to ensure that address lookups occur more regularly. This ensures that clients are not disrupted for too long in the case of a failure.

part 2.3

Based on its rules, traffic manager will provide the domain for one of your endpoints, such as uksouth-dev.hamersmith.space. The client will then resolve that to an IP address and talk to it directly. This explains why trafficmanager.net addresses show up in each of your app services custom domains list. It is also why you configure a shared domain name such as dev.hamersmith.space at each site and not in the traffic manager profile itself.

part 2.4

In the screen grab above I have a local, pretty domain, xxx-dev1.hamersmith.space, that routes the clients directly to this web site. This is useful for testing purposes to bypass any traffic manager policies. You’ll also see the shared domain name xxx-dev.hamersmith.space which is needed to ensure that the site works correctly when it is picked by the traffic manager policy.

It took a while to get my head around this when I first used traffic manager, but once you walk through what it is doing it starts to make more sense.

Traffic Manager Profiles

Traffic Manager Profiles

Azure is pretty reliable and for many situations you get everything you want for all your business continuity needs without looking beyond a single Azure region. However often you’ll be working with customers that need to be assured that your solution will work if some or all services in a single Azure region should fail. And regional level failures do happen.

Microsoft Azure hit with widening outages in Europe and India [Sept 2016]

Amazon AWS S3 outage is breaking things for a lot of websites and apps

Whilst outages don’t happen that often, you should assume that they will at some point and you should be prepared.

The rest of this post discusses how to provide cross region redundancy for any Azure App Service.

When setting up an Azure App Service, that service runs in a Service Plan which is effectively a description of the server farm your app service is running on. Through the service plan, you can ask for bigger servers (scale up) and for more servers (scale out). Whilst it is true that having additional servers does increase the reliability characteristics of your application, service plans are really about availability and performance – not disaster recovery. If there was a problem with the underlying Azure infrastructure supporting your service plan, then there is a risk that your entire service is dead.

This is where Traffic Manager Profiles come in. These profiles sit in front of your App Service and distribute clients across multiple instances. It is not that useful when you are deploying your app service to one region, but if you have a copy in at least one other region, things get more interesting. In this case traffic manager profiles can select app service based on one of three routing methods

  • Priority
  • Weighted
  • Performance

The linked to article describes these three methods in more detail but for the purposes of this post, I’ll only be considering the weighted option and I’ll configure it so that 50% of the traffic is routed to one region and 50% to another.

Setting up traffic manager profiles through the Azure portal is pretty straight forward. Once you have your App Service deployed to two separate regions, you simply create the Traffic Manager Profile resource.

Part1

The important point to call out here is the name of the Traffic Manager Profile. This will become the Fully Qualified Domain name for your traffic managed site. It will be the way your solution will be accessed so make the name meaningful, and it also needs to be unique. In my case the FQDN is my-tm123.trafficmanager.net.

Part2

Here I have a basic Traffic Manager Profile but it isn’t very useful yet as it hasn’t got any endpoints. You can add endpoints through the obvious “Endpoints” option under the Traffic Manager Profile settings.

Part3

Here you can create an endpoint which can be a Cloud Service, App Service, App Service Slot or the Public IP address of something else. That something else, could be hosted in Azure but it also could be anywhere else out on the Internet. When selecting App Services all of those in your Azure Subscription are displayed. Notice that when adding your subsequent endpoints, the UI will continue to display all App Services even the ones already wired up to endpoints, but you will receive a validation error if you attempt create a duplicate.

To achieve an even distribution the weight parameter should be the same for all endpoints. Weight can be any integer between 1 – 100 so whilst configuring each endpoint’s weight to 1 will work it, might be better in production situations to use values that are more intuitive such as (50, 50) or (30, 30, 40).

That gives you a basic Traffic Manager setup. At this point you can access your site on a *.trafficmanager.net address and the endpoint will be selected via the routing method.  In the case of a failure of one of the regions your service will still be operating albeit with reduced capacity. You could combine this setting with elastic scaling rules in your App Service which would increase the size of the farm in the event of a failure to compensate for this.

Traffic Manager will detect when there is a problem with an App Service either by querying its status e.g., Started, Stopped etc. or via monitoring the site itself. By default, it probes the root of the site on port 80. If it receives a 200 HTTP status all is good. If it gets anything else, that is considered a failure. All services must make available the same endpoint for monitoring, and the Traffic Manager Profile can only be configured to monitor one thing. So, whilst this is suitable for basic situations it is not as sophisticated as the monitoring solution you might expect from a fully featured load balancer.

At this point you have provided a solution that can handle failures within one region. Traffic Manager will direct clients to an alternative. So, we’re done, right?

Not yet… One of the things I often see missing from solutions put in place for DR or business continuity is actually testing what happens when things go wrong. You should test to understand how the solution recovers from failures and to assess potential impact on connected users. With traffic manager profiles, you have a couple of options at your disposal.

1) You can stop the App Service in the Azure Portal
2) You can disable the endpoint from within traffic manager
3) Through configuration or otherwise you can cause the monitored endpoint to simulate a failure.

In each of the situations you need to observe what the impact is on a user connected to the failed endpoints during the switch over. Do they see any errors? Can their session continue once the endpoint is switched or do they need to start again? You should bear in mind that a failure is unlikely but the impact could be big. Therefore, you might inconvenience some users but when looking at the bigger picture you still have a service. And this is why testing switch over is important. You don’t know when a failure might occur but you need to be prepared in case it does.