Design for updates

When designing new system architectures, you really must design for updating unless the system is totally trivial. This isn’t hard to do if you only do it systematically and from the ground up. You can tack it on afterwards, but it’s more work than it needs to be, but it’s still worth it.

I’ll describe how it’s done for a system based on a relational database. It does not matter what is above the database, even if it’s an object-relational layer, the method is still applicable.

I have a strong feeling that the problem is trivial on graph databases, since the nodes and relations themselves allow versions by their very nature. I haven’t designed using these graph databases yet, so I’m not 100% sure, though.

The reason I’m going into this now is that the iotaMed system must be designed with upgrades in mind, avoiding downtime and big bang upgrades. Just this weekend, the Cambio Cosmic system in our province (Uppsala, Sweden) is undergoing yet another traumatic upgrade involving taking the entire system offline for a couple of days. Very often it doesn’t come up on time or without serious problems after these upgrades, putting patients at risk entirely unnecessarily. The only reason these things need to happen is because of poor system design. A little forethought when building the system could have made a world of difference. The PMI communication system I built and which is used in Sweden has never (as far as I know) been taken down for upgrades, even though upgrading of both client and server systems is an easy and regular activity, and Cosmic could relatively easily have been build the same way. But it wasn’t, obviously. It’s not rocket science, exactly, just see for yourself in what follows.

The problem

The first step is to understand why the database structure is so tightly bound to the code layers, necessitating a simultaneous upgrade of the database structure and the code that accesses it. In the common and naive form, the data access layer is made up of application code outside the database itself which directly accesses tables for reads and writes. If there are several data access layer modules, all of them contain direct knowledge of table structures and all of them need to update simultaneously with the table structures. This means that if you change table structures you need to do all the following in one fell swoop, while all users are taken offline:

  1. Change table structures
  2. Run through all data and convert it with scripts
  3. Replace all application code that directly accesses table data
  4. And while you’re at it, compound the disaster by also replacing business layer code and user interface code to use new functionality added in the data access layer and database while removing old application code that won’t work anymore with the new data structures

This is what we call a “big bang” upgrade and it practically always goes wrong (that’s the “bang” part). It’s extremely hard to avoid disasters, since you have to do very detailed testing on secondary systems and you’ll never achieve full coverage of all the code or even all the little deviant data in the production database that will inevitably screw up the upgrade. And once you’re in the middle of the upgrade, you can’t back out, unless you have full downgrade scripts ready, that have been fully tested as well. The upgrade scripts, hard as they are to get right, are actually simpler than the emergency downgrade scripts, since the latter must take into consideration that the downgrade may be started from any point in the upgrade.

The result is that these upgrades are usually done without any emergency downgrade scripts. It’s like the high wire without a safety net. The only recourse is a total restore from backups, taking enormous amounts of time and leaving you shamefaced and nervous back at the spot you started after a long traumatic weekend. Since backing down is a public defeat, you’re under emotional pressure to press ahead at almost any cost, even in the face of evidence that things will probably go badly wrong, as long as there’s a chance of muddling through. Your ego is on the line.

This is no way to upgrade critical medical systems, but this is how they do it. Shudder.

The solution

The solution is to isolate the database structures from the application code. If different data access layers in application code can coexist, using each their own view of how the database is structured, while still accessing the same actual data within transactions, you can let two versions of application code stacks coexist while accessing the same data. If you can swing this, you’re free to change the database table structure without the application code in the data access layer even noticing. This implies that you can simply add a new view on the database that you need for a new version of the application, add in the new application code and start running it without removing the old application code or clients. New version clients, business layers, and data access layers run in parallel to old versions, so you can let the upgrade be slowly distributed out over the user population, allowing you to reverse the rollout at any point in time. Let it take weeks, if you wish. Or even leave some old clients there forever, if there’s a need for that.

To achieve the needed insulation, simply disallow any direct access to any tables whatsoever from application code. All accesses must be done through stored procedures or views.

SQL VIEWs where actually designed to achieve exactly this: different views on the table data, removing direct dependency of the application code on the tables, so the problem was clearly defined, known, and solved even before SQL hit the scene, so why are we even arguing this now? As an aside, I never use VIEWs, only stored procedures, since I can achieve the same effect with less constraints, but that does not detract anything from the argument that the problem was both recognized and solved ages ago.

Let’s assume you need to change the name of a table for some reason. (I never do that, I’m just taking a really radical example that ought to make your hair stand on end.) First, you edit the stored procedures that access the table to check if the old table name still exists, and if not, start using the new table name. Then you create the new table. Then you create trigger code on the old table that updates the new table with any changes. Then you use the core of that trigger code to run as a batch to transfer all the old table contents that aren’t accessed to the new table. You check a couple of times during actual use if the tables keep matching in contents. You drop the old table (or rename it if you’re chicken, and drop it later). Finally, you remove the check for the old table name in the stored procedures that access the table. Yes, this is somewhat “exciting”, but I’ve done this a number of times on critical systems and it works. And if you can do this, you can do anything.

A much simpler scenario is if you add columns to a table for a new version of your app. If the new column can safely remain invisible to the old version, just add it on the fly, using the right default values so any constraints don’t fire. Add a new stored procedure that is used for the new version of the application, implementing the parameters and functionality the new version needs. The old stored procedure won’t even know the column is there. If the new column must be set some particular way depending on old values, add a trigger for that and batch update the new column using the core of that trigger in a batch command. Again, there is absolutely no need to take down the database or even kick out users while doing all this. Once the new stored procedure is there, you can roll out new applications and have them come on line one by one, leaving old versions running undisturbed.

You can dream up almost any change in table structure, including new tables, splitting one table into two, combining and splitting columns, creating or removing dependencies between columns and tables, and I can catch all of that using a fairly simple combination of stored procedures, triggers, and the occasional user function. And all of it can be done on a live production database with very low risk (caveat: if you know what you’re doing).

To keep things easy and clean, I always use a set of stored procedures per application, where the application prefix is a prefix in the naming of the stored procedure. That lets you see at a glance which app is using which stored procedure. I never let two different apps use the same procedure (you can always let both stored procedures call a common third procedure to reduce code duplication). Versions of a procedure are named with a numeric postfix so they can coexist. Versions are only created if they have different behaviours as seen from the application code. So a procedure to find a patient, used by an iotaPad app, could be named “IP_FindPatient_2” if it was the third version (the first is without postfix, the second version with _1, etc).

Finally, since you only use stored procedures from application code, no application need have any read or write access at all to your database, only execute access to all the stored procedures with a prefix matching the app. This makes for a very easily verified set of GRANTs and a secure database.

Why this scheme isn’t used in products like Cambio Cosmic is a mystery to me. It can’t be laziness, since they’re working their butts off trying not to annihilate all the medical records we have, while stressing out both medical staff and their own programmers to the point of cardiac arrest everytime something goes wrong. A little forethought would have saved so much time and problems it’s almost unreal.

But on one point you can be certain: that little forethought is a definite requirement for the iotaMed project. It wouldn’t hurt if the other guys tried it as well, though.

2 thoughts on “Design for updates”

Leave a Reply

Your email address will not be published. Required fields are marked *