Squaring the Circle: Data Persistence in a Functional World by Alex Jensen

A common refrain that I hear uttered by functional programmers is that "state is not inherently bad—it's mutating (or shared) state that's really the problem." Having worked for some time in the OO world of Java and C#, of taking combinations of data structures and encapsulating them within a box, passing that box around my program, and opening it again to find something totally foreign to me, I have to say that I sympathize with this sentiment. Now, perhaps I simply didn't understand proper OO design when I was working on these projects, and that's fair; I probably didn't. I often found myself tempted to use inheritance where I probably should have used composition, the responsibilities of classes became muddled, and class names eventually came to read like run-on sentences. In every project, I was inclined to become a digital Aristotle charting a grand classification of species, drawing lines between classes where they seemed to make sense in my mind, and winding up with a hierarchy much more comparable to a plate of spaghetti than a tree.

These tendencies were surely my responsibility to manage, however, they were not discouraged by the languages I was using (if anything, they were encouraged). And this is why I find functional programming so compelling: indeed, it is not some sort of formula for writing perfect code; it is a discipline. This being the case, I find that languages that embrace the paradigm—the discipline—wholesale make it harder for me to write bad code.

There is no silver bullet for writing perfect code, though; there will always be tradeoffs when operating in a new paradigm. For truly ‘functional’ code, these have certain implications for how we interact with persistent data stores, and they may seem initially outlandish. I’d, here, like to take a moment to dig into the source of these tradeoffs and show you how my favorite implementation of these principles—Datomic—handles these.

Functioning Properly

To begin, let’s take a moment to understand why one might want to use functional programming in the first place. What qualities does this paradigm have that make it so hard to write bad code? The foremost of these is the notion that our functions should be pure, meaning that they will always return the same value for the same input. This maximizes the predictability (and testability) of our programs. As a side effect of this, our functions will (or ought to) be avoidant of changing state outside their scope, i.e., causing side effects. If the state exterior to a function is not changing, then it follows that when we declare values they generally don't change; they are immutable—another principle of functional programming.

But surely being purely avoidant of unpredictability (side effects) goes against the very nature of a useful program in the first place... right? To be made use of, programs have to be able to consume and react to user input, the only reliable part of which is being unreliable. They also often need to persist data based on this input: after all, I wouldn't like it very much if my bank's software kept my account balance truly immutable regardless of the money I put in.

Indeed, there are instances such as writing output that necessarily involve causing side effects (e.g. writing to stdout), but obeying the principles of functional programming means that we avoid these instances whenever possible.

The Paradox of Functional Persistence

The question then becomes whether our persistent data store is such a place where these functional principles can be obeyed. At first blush, the answer seems to be an obvious "no." After all, what is a database but a method of recording the current state of different data structures as they change during the runtime of a program? Whether your database is full of rows, or documents, or graph nodes, the purpose of them living there is to be modified and recalled, isn't it?

Indeed, it seems unavoidable to "break the rules" by allowing mutability in our data store, but perhaps we can mitigate it. When state changes it does so in one of three ways: creation, deletion, or modification. In each of these, state begins with some value (nil in the case of creation), and is changed to be something else (nil in the case of deletion). What if instead of only storing the current value of some state, we simply stored the fact that one of these changes happened? In this way, every change in state becomes a "creation," the creation of a record storing the newest value of the state.

Doing so not only effectively eliminates two of the three ways state can change, but also allows us to obey the principle of functional purity. Since old values are not being deleted, I always get the same thing when I ask the database for a particular thing at a particular time. Our database is acting like a pure function of our query.

Enter: The Datomic Model

This inclusion of time as an intrinsic characteristic of all matters of state is a fundamental precept of the Clojure database, Datomic, developed by Rich Hickey. In traditional database design, we think of our database as holding things: rows, columns, documents, graph nodes. Datomic does not store things. Datomic stores facts.

Facts—from the past participle of the Latin facere, literally "to have done"—implicitly have time associated with them. If I say, "The frog is on the log," I am stating a fact that at this moment said frog is sitting upon said log. If I say, "The frog hopped off the log," I am stating two facts, namely, that the frog was on the log at some point in time, and that at a later point in time, he was no longer on the log. If I wanted to determine the current state of the frog being on the log, I could find the most recent fact about the frog's being on the log and read it. Likewise, I could still find all the points in time at which he hopped onto and off of the log.

Datom:

an immutable atomic fact that represents the addition or retraction of a relation between an entity, an attribute, a value, and a transaction.

- The Datomic Docs

Such facts are stored as what is referred to as datoms. These data structures are so-called "atomic" because they have to do with a single attribute of a single entity at a single point in time.

Datoms consist of

E: the id of the entity they are associated with (the frog)
A: the attribute (being on the log)
V: the value of the attribute (is on the log or is not)
T: the id of the transaction where the datom was created

Entities, so-called, are as close to actual things as datomic gets and essentially consist of all the current datoms that are associated with the same id. It is like defining a thing as a list of all the things you can predicate of it.

If I wanted to define an entity, the sun, as "a huge ball of flaming gas" I might represent it as a Datomic entity with id = 1 like so:

E	A	V	T
1	:size	:huge	123
1	:material	:flaming-gas	123

It is simply the coming together of all things we wish to predicate of it.

You may notice in the above example the T field on the datom, the transaction. Transactions are simply additions of datoms to the Datomic database. We can transact singular datoms at a time, or group them together. In the above example, the datoms represented by each row in the table were transacted together since they have the same transaction id.

The list of all transactions is how Datomic keeps track of the history of the database. Every datom that's ever been transacted lives here in order. When we want to get the "current state" of an entity, Datomic will give us the most recent set of datoms associated with it, but the old ones still live on in the history. As mentioned before, this is how Datomic acts as a sort of "pure function," since it will always give you the same value for a datom at a given point in time.

Always Writing in Pen

As with all decisions in software development, there are tradeoffs to this approach. From what I can tell, the greatest source of these tradeoffs lies in the fact that Datomic simply refuses to forget things. This is fantastically useful when you need to implement an audit trail (it's already done for you) or a user runs into an issue (you can step through every change to the user's state around the time of the incident). But, admittedly, you're storing all of that historical data whether you're using it or not.

On the one hand, disk space is cheap these days, but this fact only goes so far. Truly forgetting data is a requirement of many programs either for legal reasons (scrubbing PII from the database) or to conserve disk space.

:db/noHistory

The latter of these concerns can be mitigated via the :db/noHistory flag that can be attached to datoms. Doing so will ensure that only the most recent version of an attribute is stored, and causes datomic to behave more like a traditional database for attributes that have it enabled.

Excision

For the former concern, data that legally must be forgotten can be removed via an asynchronous feature of Datomic referred to as 'excision.' This process produces an index of the database that does not include datoms meeting a particular criterion (e.g., being associated with a particular entity). Querying with respect to this newly indexed version of the database will not include those excised entities, though the process is a function of the size of the entire database and can therefore be totally impractical for anything larger than a few gigabytes.

Starting From Scratch

There are times, however, when neither of these techniques is sufficient. If space is an issue, and the data is already in the database, it's too late to implement the :db/noHistory technique, and the database is probably too big to excise. In this case, it may be necessary to resort to more extreme measures.

Decanting

'Decanting' is a technique wherein the transaction history of a datomic database is read in order, filtering out undesirable transactions, and sending the datoms within to a new database. This technique is very extreme and requires a "steady hand," so to speak, as datomic's partitioning system will not allow user-defined entity ids. As a result, you will have to track the ids that are generated by the new database and retroactively assign them to their references to maintain relationships between entities.

This can be extremely burdensome if there are other databases that reference these ids, as a migration will be required to update them to the new value—potentially taking down two applications while the process is running.

Snapshotting

Even more extreme is the 'snapshotting' technique. In some cases, undesirable datoms were transacted along with those which needed to be migrated to the new database. In this case, you may need to migrate on a per-datom or per-entity basis, which suffers from all the same pitfalls as decanting, but loses history in the process.

Conclusion

These are extreme cases, and the vast majority of Datomic instances will not run into problems requiring their implementation, but they are prime examples of what happens when we take the principles of functional programming to their logical conclusions. Whether these tradeoffs are manageable will, of course, depend on your context. For myself, I continue to use (and enjoy) Datomic for its compliance with the functional way of programming, as well as its simplicity and convenience when it comes to how its data is organized and its interactivity with Clojure. When it comes to an environment that forces me to write code that I am proud of and remains maintainable, the Clojure/Datomic combination is unparalleled in my experience.