Reading Data from Disconnected Systems
Imagine a scenario where you are tasked with building a Web Portal for customers to review mortgage offer paperwork. The website needs to integrate with a tried and tested back-end Mortgage Application System. Let’s assume for the sake of argument that this system provides web services to obtain offer paperwork. Whilst the web services were implemented some time ago, yours will be the first project to make proper use of them.
How hard can that be?
A few more requirements,
- The Web Portal will be hosted on a cloud provider whilst the mortgage application is hosted in a private data centre.
- The web site must be available 24×7 but the due to its age, the mortgage application is only available 8am – 6pm on weekdays and is undergoing maintenance outside these times.
- It should be possible to access offer documentation on demand 24×7 from the Web Portal
Still easy? What are your options?
There are two primary patterns
- Poll the Mortgage Application web services during its operational hours and transfer any “new” offer documentation to the Web Portal
- Cache data locally and only call the target web services if the data is not in the cache or if it has expired.
This solution is attempting to simulate a publish and subscribe model over HTTP. This is limited due to the nature of the communication protocol. Here the client is acting as the subscriber by listening for events by polling for changes on a web service endpoint.
The success of this approach very much depends on the interface exposed by the publishing system, which is the Mortgage Application System in this case. How easy it is to determine what is new? How often can the web service be called without adversely impacting the performance of the system? After all you don’t want a phone call from the Mortgage Application Systems owner asking you explain why to have cripple it by calling the web services 100 times per second!
Let’s walkthrough an example.
The first thing to realise is that mortgage offer documents must be stored locally at the Web Portal in order to allow them to be accessible when the Mortgage Application System is unavailable, but how do they get there in the first place? Let’s assume that you can query the Mortgage Application System for offer documents by customer ID. In theory it would be possible to call the web service for each customer known about by the Web Portal. The result can be compared with the offer documents stored by the Web Portal and any differences applied to its local data store.
But how does the portal know about new customers? Perhaps as part of a registration process the customer could enter their details and a reference number (Customer ID). However, if you stop to think about it, this is not going to provide an acceptable experience if the customer does this between 6pm and 8am. Do we ask them to come back later? Possibly? This could be dealt with by a friendly error message “your documents are not available at this time” but it is not really in the spirit of requirement 3).
Hold that thought!
In this situation a cache is used to store the result of queries against the Mortgage Application System’s web services. The web services are only directly accessed if the data does not exist in the cache or if it has expired. A retry policy can be applied to these web services calls in order to gracefully recover from temporary connectivity issues.
This solution also forces the question about offer expiry. This will have to be fully understood to correctly implement the cache. You don’t want the customer to review offers that are no longer valid. This is still a concern in the other option but it isn’t as obvious and may be missed.
When the customer first accesses the portal we can retrieve the offer documents from the Mortgage Application System and store them in the cache… except when the customer accesses the portal out of hours.
We have come another route but ended up in the same place.
Comparing the options
The caching solution moves the offer documentation from the Mortgage Application System to the Web Portal on demand. In order to react to the customer’s request, retry logic has been implemented to limit the impact of any transient connectivity problems. The amount of data transferred in one request is predictable. The rate at which data is transferred is not.
The polling solution works by copying offer documents at regular intervals. Here the rate at which data is accessed is predictable (it is dictated by the schedule) the volume of data transferred is not.
Assuming the caching solution is set up correctly there is a high chance that the data accessed by the customer is the freshest it can be. With the polling solution the data can be stale. How stale is defined by the polling schedule.
Often stale data is not a problem. All data is stale as soon as it is accessed. However stale sounds bad to many ears, so sometimes this negative connotation makes it a problem. If you have opted for the polling solution you might be tempted to reduce the polling interval to decrease the risk of the customer receiving stale data.
As the interval gets smaller you are effectively providing the retry behaviour that you have to explicitly build into the caching solution. If the communication link failed this time, it’s OK because it will probably be working again when we try again in five minutes.
Challenging Requirement 3)
Every project has a requirement 3). Many have lots of requirement 3)s. If you have a technology savvy customer you might be able to discuss the flaw in their thinking and get the requirement changed to something more realistic. Often you need to demonstrate the problem. Identify this requirement as a technical risk and use your favourite method to test or experiment. Agile and Lean approaches allow just that. If you are not so lucky, you’ll be working with a demanding client that insists that “requirement 3 is our number one priority” or maybe worse a Project Manager or Tech Lead who proclaims “We can do anything”. Well yes, you can if you have a large pile of cash and a lot of patience.
Is this scenario realistic?
I have taken the disconnected nature of the two systems to extremes in the example but we are working with disconnected systems all the time. As IT professionals we are being asked more and more to integrate bespoke and packaged solutions and work with many different partners and suppliers. Within cloud infrastructures such as Azure and AWS individual nodes disappear surprisingly regularly and transient communication failures are common. Even when you think you have everything under control the Fallacies of Distributed Computing remind you of the assumptions you should not make.
Hopefully this post has highlighted some of the techniques to think about when reading data from disconnected systems.