A tale of two data kinds

In my neverending quest to straighten out the electronic healthcare record, I have to introduce a view on data that is essential if I’ll be able to explain what’s wrong with the current model and how it should be fixed. As always when we’re in a mess with the design of systems, we climb the abstraction ladder a level of two to find the underlying (“above lying?”) abstractions, just like particle physicists try to find ever smaller components of components.

I’m seeing all applications using two kinds of data, one that is continuously updated to reflect the current state and the other where one adds records to reflect changes to data. I’ll call these, with very unsatisfactory names, the “state reflecting record” (SRR) and the “transformation additive series” (TAS). I’ve intentionally used really unusual terms to make it easier to search for these terms later. By the way, I’m absolutely positive I can’t have invented this view of data, but so far I’ve not come up with any references to literature describing it this way. I’d be really grateful for any pointers to preexisting descriptions. So, lacking any references, I’ll have to describe it here as well as I can.

If we start with the state reflecting records (SRR), this can be exemplified by an image in an imageeditor. Whatever the editor does to the image changes that one image data blob from one state to the next. Whatever you do, it’s just that one blob. A text document, or a blog post like this one, are other examples of SRR. As long as we don’t add any other kind of data structures linking to the SRR, there is no way of telling what the SRR looked like before the most recent change, or even what that change was. The SRR only reflects current state of some object or collection of objects. A relational database is another example of an SRR, by the way (if we leave the transaction log out of consideration).

Transformation additive series (TAS) are quite another animal. A keystroke logger is TAS based, for instance. Every keystroke creates a new record. Even a backspace doesn’t remove a previous record but adds a new record containing the backstroke. A network communication session is a TAS. More and more packets are piled on. The change records in a source control system forms a TAS. In principle, a TAS consists of a growing number of records, each immutable, added in a series, where any subset of the series corresponds to a transformation of some other object. A TAS must in principle always refer to something else, and that “something else” has to be in essence an SRR.

Having these two cleancut types of data, we can now form a number of data design patterns from them. “Patterns” is taken to mean “desirable patterns”, while “antipatterns” is taken to mean “undesirable patterns” that, by implication, are seen in the wild to a disturbing degree and should be avoided if possible.

Patterns

If we start with desirable patterns, or just “patterns”, we can describe a few typical ones easily. Most, but not all, patterns employ an interacting set of SRR and TAS elements. Employing the principle of not duplicating data or relationships, we first find that we have to select either the SRR or the TAS as the primary data and the other as the secondary data. In a functioning pattern, we should require that the secondary data kind should be fully derivable from the primary.

Pattern 1 – SRR only

Typical application: a simple paint program. The image itself is the SRR. The paint program may have a single level undo function, which in a practical sense doesn’t qualify as a TAS, but if it has an unlimited level undo function, it is not any longer a simple SRR pattern. This pattern is clean, there is no possibility of conflict of data since there is just the one instance, the SRR itself.

Pattern 2 – TAS only

A sniffer log of network traffic is a typical TAS only. There is no conflict possible here either, not as long as you don’t try to make sense out of the captured packets, and if you do, it’s not a “TAS only” pattern anymore.

Pattern 3 – TAS primary

Any system that contains both an SRR and a TAS, but where the SRR can be fully reconstructed from the TAS belongs in this pattern. A typical example is version control systems. If you lose your original document, you can still recreate it from just the TAS, the series of change records. An accounting system can also be built this way, but rarely is. If each invoice, payment and other entry is kept in a TAS, you can always reconstruct any general ledger, sum, report, or tax form at any time simply from the TAS. Sadly, most real world accounting systems don’t exploit this idea, leading to the sad apps we see today.

In this pattern, you will often find a single immutable SRR, the start state. I still regard it as a “TAS primary” pattern.

Pattern 4 – TAS primary, SRR cache

A minor, but in practice very important, variation on the previous pattern is when the SRR is fully reconstructable from the TAS and an optional SRR start state, and is kept for performance reasons. Micrografx Paint, may it rest in peace, worked like this if I remember right. Or maybe it was Micrografx Designer. Or both. In this app you modified the image any way you liked, but each modification resulted in a change record in the “history” panel. From this “history” panel you could delete any step and rebuild the image from the beginning with that step omitted. So the current image could be reconstructed from the starting image plus the history. The final image is kept in Micrografx Paint as a cache so the program doesn’t have to rebuild the image constantly. That would have been much too slow. But losing the current image wasn’t a problem. Consistency was guaranteed, only the history record, the TAS, actually had authority over how the image would look. If the TAS and SRR gave different results, the SRR was corrupt. By definition.

This pattern also contains an inherent consistency verification on the TAS, namely that the SRR that is built from a complete TAS must make sense. If the SRR is an image, this is difficult to ascertain, since any image is valid, even if you can’t make sense out of it if you’re a human. Just look at some modern art and you’ll get my drift. But in many other cases, such as accounting or medicine, a TAS can contain records that result in an invalid SRR. Another example is source code control. If the source code change records are invalid, the source code you build up from the change records may not be compilable. This test is only valid if you require all source code to compile before you log it into the source code control system, of course.

Adobe Photoshop also has some of this functionality, but the history record seems not to be sufficient to rebuild the image, so this would form an antipattern.

Antipatterns

Now for the antipatterns. When reading any of these antipatterns, think of real examples in your own environment, and you’ll soon discover that these systems you know and hate are bottomless sources of irritating inconsistencies and bugs. That’s no coincidence. Most of that has as a root cause the unresolved ambiguity between what’s primary and what’s secondary in the data kinds.

Antipattern 1 – SRR primary, derived TAS

A text document in a word processor with the “track changes” function switched on. You still modify a single SRR, the document itself, but every change is causing the creation of a record in a TAS in the background. Since the TAS will never be complete enough in practice to reconstruct the document itself if it’s lost, the TAS can’t be primary. But the SRR can never reconstruct the TAS either, so the TAS isn’t fully secondary.

Audit logging in most applications fit this pattern as well. The secondary TAS is a rather improvised collection of data pointing to datetime, user, kind of records or whatever else popped into the developer’s mind when writing it, usually missing the pieces of data you turn out to really need later. Since the TAS is derived and doesn’t have any function to fill at all during normal system usage, it’s functionality is never put to the test until it is too late.

Antipattern 2 – Undecided primary

Most accounting systems can’t make up their mind what’s primary and what’s secondary. When you input an invoice into the system, it’s not reflected in the general ledger until you “post” it and once you do you can’t change it anymore. There’s no way to “unpost” or to rebuild the general ledger from recorded invoices. It’s as if the invoice series (TAS) rules until you “post” them, and then the SRR rules even though the TAS is still there. There’s no end to the confusion and assbackwardness that results from all this. (Don’t tell me the law requires this, in most countries it doesn’t. The law generally requires that you should not be able to do undetectable changes to your accounting, which is something else entirely. Don’t get me started.)

Conclusion

The only patterns that seem to work consistently are thus either SRR or TAS only, or combinations where the TAS is primary. There seems to be no instances where the SRR is primary and a connected TAS is still useful and reliable. Surprisingly, most real world systems seem to be build in exactly this way, that is with a primary SRR and a tacked on TAS with a confused division of responsibilities. Not so surprisingly, most of these systems have a lot of problems, many of which can be directly tracked to this confused design.

What’s next?

In the next post, I’ll get into how this relates to medical records. I can already divulge that medical records as they are today, are pure TAS based. Think about that in the meanwhile.