Despite your best endeavours problems will occur. Sometimes they will be big problems.
We work on complex systems which by their very nature are hard to predict. We are often lulled into a false sense of security that makes us think that we understand them only to be surprised when they fail spectacularly in an unexpected way.
Let me give you an example in terms of my cycling hobby.
Last week I was cycling back from work. Suddenly there was a loud bang and I immediately realised my rear wheel had a punctured. But WHY?
On examining the tyre I was expecting a simple inner tube failure, something that I was prepared for, but it was quickly obvious that the side wall of the tyre had split. But WHY?
Before continuing the investigation, I had to deal with the immediate problem. Fixing a split side wall is not something I could do by the roadside so I had to think about my immediate options. There weren’t many. There was a bike shop a couple of miles away but that would be a long walk and I would end up buying whatever they had just to get me home. Would they even be open? The other alternative was to ring my other half to be rescued. And this was what I chose.
I had accepted that the problem was real and I didn’t have many options. At this point I’m not looking for a perfect solution, I just wanted to recover the situation as quickly as possible. Being rescued looked like the best choice, there was little risk that it would make things worse and it was the scenario where I could be home and dry quickly.
Back to the investigation. I needed to know why I ended up in that situation so I could try to avoid it in the future. WHY did I split my tyre?
In my warm and dry garage, I could see that a build-up of grit on my brake block was rubbing the side wall of the tyre when the brakes were applied. But WHY did that cause a failure?
Well, the grit was only rubbing because a week or so earlier I had fitted new brake pads and on adjusting them I had them slightly too high on the wheel rim. This was not a problem when the pads were clean but when they were covered in road grime they rubbed the tyre. However, this should have not been enough for a complete failure. So WHY did the tyre fail?
When I examined the tyre and where it failed I could see other damage around the failure point that was not apparent elsewhere on the tyre. Casting my mind back I remembered having a puncture some months back where I had trapped the inner tube between the rim and tyre after hastily swapping them onto another set of wheels. When that puncture occurred, I had to ride for over a mile on the flat tyre because I wasn’t somewhere I could stop easily to fix it. Could that puncture and the damage caused by riding on the flat tyre be the root cause of a bigger problem some months later?
The point is that a number of things that are not a problem on their own, have the potential to cause bigger problems when they occur together. The only way you know that these things can cause a problem is to experience it. What I have done above is perform a simple root cause analysis on the problem to understand why it occurred. Many organisations miss this when they have an IT incident. They are relieved that the incident has been resolved and frankly don’t want to think about it again.
Unfortunately for these organisations thinking about the problem is the best way to avoid it happening again. Tracing the chain of events helps identify preventative steps and different ways to monitor and maintain your system to ensure that it is not on its way to another failure.
For me it is ensuring that I clean grime off my brake pads and examine my tyres after any punctures. For your IT systems it might be improving code review processes, doing more thorough testing or adding proactive monitoring. Mistakes and problems will happen. Obviously, you want to resolve the situation quickly and calmly but that isn’t the end. You can learn from problems as they help you understand how your systems really work and to improve them over time so problems are less likely to occur – well at least the ones you know about.