Distributed transactions are not a comfort blanket

Distributed transactions are not a comfort blanket

The business logic in a solution often acts as a Process Manager or orchestrator. This may involve invoking a number of operations and then controlling the process flow based on the result. Depending on what you are building you may be committing changes to both your own systems and external ones.

Consider the following code.

private void MyBusinessProcess()
 {
    var result = external.DoSomethingImportant(args, out errors);
    if (errors.Any())
    {
        MyDb.CommitFailure(result);
    }
    else
    {
        MyDb.CommitSuccess(result);
    }
 }

Here the logic invokes an operation on an external class and then depending on the result, records the outcome to the application’s database whether it is successful or not.

So the solution above is pushed into Live.

Soon support incidents are coming in that indicate that sometimes the calls to CommitFailure or CommitSuccess fail to write their changes. As there is no record that the call to DoSomethingImportant ever happens, so the application tells the user that the operation never executed. When MyBusinessProcess is retried, this time DoSomethingImportant throws an exception because it is not idempotent and calling it with the same arguments is not allowed.

For the sake of this post let’s assume that there is no trivial way to stop the transient problems that causes exception in CommitFailure or CommitSuccess. However there remains the requirement that MyBusinessProcess must operate consistently.

The developer that picks up this issue asks around the team on how the external class works. They find out that not many people really understand this system but they are aware that when the call to DoSomethingImportant completes it commits its result to its own database. When it sees the same arguments again it throws the exception that is the cause of the support incident. The developer examines their development environment and sure enough on their local SQL Server alongside MyAppDB there is another called ExternalDB.

Great. So they implement this code to wrap up the two database operations into a transaction. Now either all calls commit or all calls are rolled back.

private void MyBusinessProcess()
{
    using(TransactionScope tx = new TransactionScope())
    {
        var result = external.DoSomethingImportant(arg, out errors);
        if (errors.Any())
        {
            MyDb.CommitFailure(result);
        } 
        else
        {
            MyDb.CommitSuccess(result);
        }
        tx.Complete();
    }
}

This is tested locally and it seems to work. However, once it hits the first test environment which in this case is hosted in Azure and specifically uses separate SQL Azure nodes for each database MyBusinessProcess fails all the time. This is because in order for a transaction to work across two SQL Azure nodes a distributed transaction must be used. And until recently the only way a Transaction Scope could achieve this would be to enlist into a transaction managed by the Microsoft Distributed Transaction Coordinator (MSDTC) which is not supported on SQL Azure.

I have encountered this problem a couple of times now. I find it interesting that the default option is often to wrap everything up in a transactional scope. Microsoft have done a great job in the language syntax to hide the complexity of deciding whether a distributed transaction is required and then dealing with the 2 phase commit. And that convenience often becomes a problem. Distributed transactions are complex and using them can have a large impact on your application. But as the complexity is hidden many people have forgotten or have no incentive to learn in the first place what is going on under the hood.

When I have challenged people about this, the normal defence is that it is easy to implement so why wouldn’t you do it. However even today it is common for the people that have implemented the code to not be the same people who have to get it working in a bigger environment.

As in the example it is typical for the configuration of development, test and production environments to be different so you may only find problems like I highlighted above late in the day. You don’t want to be finding that all of your transactional logic doesn’t work just as you are trying to put your solution live. The second thing I have seen is that distributed transactions can seriously constrain the performance of your system. In this situation you may only find you have a problem just as your product is becoming successful.

Distributed transactions and transactions in general are said to be ACID – Atomic, Consistent, Isolated and Durable. It is the C that causes all the problems here. Trying to be consistent limits concurrency, as only one party at a time can commit a change. Allowing multiple parties to commit at the same time compromises consistency. When you are trying to get a system working it makes complete sense to be consistent. But when you are trying to implement a system that will see lots of use than that equation no longer seems to make sense.

 

Architects and Agile

Architects and Agile

The Agile manifesto states the following:

Working software over comprehensive documentation

and

The best architectures, requirements, and designs emerge from self-organizing teams

Some teams take this to mean that so long as they are building “working” software and they are self-organising they are doing architecture.

I, for one, don’t want to go back to the days when the architects created system designs in their ivory towers for months on end finally throwing it over the wall to the development team once it was “correct”. But neither do I want the chaos and disorder that can be created when the architecture of complex systems is purely left to chance or left to emerge. Emerging design is not a bad thing itself, it’s just that you need to know what you are letting yourself in for and what could go wrong.

This is an excellent article that describes what can go wrong with emerging design.

In summary your system changes over time – sometimes in ways you expect and sometimes in ways you don’t.

You might not be surprised to know that I do see the need for architecture in Agile projects but it is very different to the way it was done in the waterfall days.

Architecture is a Role not a Person

In the same way that testing is a shared responsibility in a good Agile team so is architecture. Again as with testing there may be an architecture specialist in the team but that doesn’t mean they do all the “architecting”. Instead they share their experience and mentor other members of the team to develop an appropriate architecture for the product or system.

I have seen architects trying to fit into the architecture owner role described in some scaled frameworks. They apply themselves to the role in the same way as a Product Owner, distancing themselves from the day to day delivery of user stories. Most teams will accept this way of working for the product owner, as they tend to come from the business and often don’t have the necessary skills to deliver software. The architecture owner, on the other hand is talking about technical subjects and trying to influence how the software is built rather than what the software needs to do. Some teams will resist this unless the architecture owner can put their money where their mouth is!

In order to be effective the architecture owner needs to be credible in a team full of technical people. What I mean by this is they need to be producing alongside their team members. By doing this they understand what it takes to develop the product. By contributing to the delivery of user stories the architecture owner is showing that they can apply the concepts they talk about. This can be hard for some architects, as their development skills may make them junior developers on the team but in my experience it is the best way to become accepted into the team.

Architecture Vision

The architecture owner defines the architecture vision. This is usually high level but provides the general direction for the product. Going back to the earlier article the vision defines how the system is expected to grow and how it may change over time. It might highlight key dependencies outside of the control of the product but which are key to its success. This leads to a set of prioritised technical activities that ensure the architecture of the product is fit for purpose at a given point in time. This is often given the name “Architecture Runway”. The name implies that if you don’t have enough runway the plane (product in this case) will crash!

The architecture runway does not assume that the complete architecture is in place from the start or even that the architecture could be classified as “correct” at any point in time. It simply allows the architecture to emerge and change in a just in time and controlled manner. For example, to get the product off the ground for a small number of customers it may be perfectly acceptable to spin up a single virtual machine hosted by a cloud provider such as Amazon or Google. But as the usage of the product grows and the impact of the system failing increases then the architecture may need to be more fault tolerant and recover from failure better. It is not good enough to do this when you find you have a problem – you need to be ready for the problem before it happens.

So how to you order the runway?

Ordering the runway is exactly the sort of thing architects are good at. It involves understanding and assessing the technical risks associated with the product. And those risks are derived from the architectural vision – how is the product going to be used in the future, what does it need to be integrated with. But it is not just a projection of what ifs and may bes. It should be based on actual use of the product. If the number of customers are increasing exponentially then you need to think about increasing the size of the infrastructure. If customers are asking for integration with a particular external system than you need to think about how that would work.

Essentially the architecture runway is driven by two things.

The first are the technical risks. These tend to be found in the “important things” or the “things that will be hard to change”. The important things depend on what the product does and the need to change is driven by the architectural vision.

The second driving force is what you are proving by delivering working software. It is the feedback you get by having people using it. No amount of architectural beard stroking will identify how your customers use your product. No architecture is correct in the face of continuous change and constant feedback.

How long is a piece of string?

How long is a piece of string?

This is a post about prediction.

In the past, estimates were king. If you needed to know how long something would take, you asked for estimates.

However, estimates are personal. It is an individual’s opinion given their experience and knowledge of how long something might take.

As a developer I would hate being asked for estimates.

I knew I would be held to any number I provided. Often given a lack of available information and the pressure for a number an estimate would be pulled out of the air. I then would suffer the indignity of a team leader halving or doubling my estimate depending on who they were talking to. It was clear to me that estimates were never correct (whatever that means) and they generally caused stress and pressure all around.

Now I am frank when I talk to people about estimates. They are a guess, pure and simple! When people ask me for estimates I am vocal that they are asking me to guess. That can make people feel uncomfortable but I’d prefer getting the cards out on the table from the outset rather than everyone pretending that an estimate is something empirical, something tangible or even scientific.

Project Manager (PM): What is your estimate for X.
Developer : Based on what little I know – 2 weeks.
PM: You don’t sound sure – how confident are you?
D : About 60%.
PM: Okay I ‘ll say 4 weeks to be sure.

So the developer guessed. Then said they weren’t that confident. And then the PM doubled the original number. A guess of a guess of a guess!. We use terms like mandays and confidence factor to remove that uncomfortable feeling that we are just making things up. But the result of this guessing process is a number on a project plan. That plan becomes a commitment, often financial, and suddenly you are being asked why you can’t deliver something in the time you estimated.

Instead of feeling bad about it we should all get comfortable guessing!

Looking at chunks of work in terms of their size and complexity instead of the time taken to do them is a good start. You are relatively comparing two things within the same context – A is bigger than B but it is smaller than C. This is how story sizing works. A team uses past experience to size work. Empirical evidence is used to provide a forward projection of how much work can be achieved in a fixed time. The team delivered X units of work in Y weeks so it is likely that if everything stays constant then X units of work will be delivered in the following Y weeks.

The group aspect is often over looked. The work is sized as a group. The work is delivered as a group!

The sizing of the work is independent of who does the work, well it should be if you want your team to be as effective as possible. Different people will have different views of how the work can be completed so it is important to get a group perspective of what might be involved. Do not spend too much time on this. Even when you are sizing as a group it is still a guess. The process can also be incremental. Sizes can be changed when more information is available.

So given the history of the misuse and misunderstanding of estimates I can sympathise when there is a reluctance to undertake any exercise that would provide some level of predictability.

It is naive to think that a team can work in a silo and there will be no one wanting to know when the work is completed. If no one cares enough to want to know when a feature is completed or a bug is fixed then that probably mean that no-one cares about your product. If so why are you doing the work?

So let’s assume that you are doing the work because someone does care, be that a customer you’ll never meet, or a more Enterprisey customer that has made business commitments, or someone who is payrolling your team and wants to see a return on investment.

We are accepting sizing is a guess but that guess can be useful. By comparing your guess against how the team actually performed you can use that to become better at guessing and so better predicting when a feature might be completed or a bug fixed. No one is being held to an estimate. If a piece of work takes much longer than its sizing indicated then this is fine. Use it as an opportunity to reflect, understand what happened and improve the team’s predictability going forwards.

All this is gained from a guess!