The case for explicit data schemas

The data schema is important to more than just your current notion of your application. It may be worth designing the schema explicitly.

August 28, 2019 George Hadjiyiannis

9 minutes read

The software development Zeitgeist seems to be pointing away from the necessity of having explicit data schemas. Nowadays, designing a database schema, or even just an ER diagram, seems to be downright old-fashioned, and an activity that is viewed more as a burden or a restriction, rather than something of value. At the risk of being labeled a contrarian, I would like to make the case in favor of explicitly designing data schemas.

The move away from designing data schemas was slow and somewhat subtle. As far as I can tell, it started with the arrival of frameworks that had the well-intentioned purpose of replacing what they perceived as unnecessary manual work with automation. The most prominent of these frameworks is probably Hibernate (and the associated JPA). That being said, I personally believe that Spring, with its philosophy of developing implicitly by convention and naming, rather than by explicit design, normalized the idea that one should not have to design one's data schema. After all, once you have defined your entities as POJOs (and maybe even added some annotations about constraints and mappings), there is no need to explicitly define a schema. The framework will simply create the entire database and table structure for you. Doing it yourself is simply wasted effort. At this point I would like to clarify one thing: this is not an argument against frameworks such as Hibernate and Spring. I am actually quite a fan of both, and use them both heavily. However, I am not willing to relinquish complete control of my data schema to them.

While the persistence frameworks relegated explicit schemas to the pile of tasks you had to undertake back when tools where too dumb to do it for you, it was not the nail in the coffin. As far as persistence frameworks were concerned, explicit data schemas were not evil, merely pointless. A second transition happened as part of the arrival of NoSQL, and the associated rush to “schema-less”¹. With the advent of schema-less, an explicit data schema was now an undesired restriction. One could not adapt the schema easily, so it became a commitment with significant future consequences. Unfortunately, while there are use cases that require that arbitrary documents can be stored or processed, these are few and very specific. They certainly do not warrant the approach of making everything schema-less. Nonetheless, with the advent of schema-less, explicit data schemas were no longer just a waste of effort, they were considered downright evil.

The comedian Steven Wright once said:

Experience is something you don't get until just after you need it.

The reason why I am generally reluctant to relinquish control of the data schema is simple: the pain of experience. This particular experience dates back to the days when I was a consultant, on a project with 3 consulting companies and an 8-digit price tag, for a very large state agency in Massachusetts. As part of this project we had to either ETL (Extract, Transform, and Load) the data from roughly 140 data sources, or integrate the data sources themselves. The most interesting fact about those data sources, however, is that the application code corresponding to the vast majority of them was long gone! In particular, we had 3 variations:

In the simplest case, the original application was no longer needed or used, but the data was valuable, either for analytics purposes, or as historical OLTP data for other applications.
Somewhat more complicated was the variation where the original application was re-written (some of them ran on mainframes), but the new application had to keep the original data schema in order to be able to use the existing data records.
The most complicated one was the case of data sources that were integrated into multiple other systems, effectively becoming an indirect ESB. The application itself was no longer useful, but its ability to connect and transfer data between other useful systems was still valuable. Thankfully, these cases were fairly rare.

Let's analyze the above scenarios with regards to implicit data schemas or schema-less databases, starting with scenario

In this scenario, the code for the application is probably not available, but the database and data are still alive and serving other uses. In the case of a schema-less database, we now have a number of significant issues:

Without either a schema or the application code, understanding what data is present and what it means is an exercise in frustration. Effectively, the meaning of the data can only be reverse-engineered by examination of the contents. Note that in a schema-less database, the explicit intent is that the schema can and will change between records. This means that you cannot reverse engineer the meaning on the basis of a “representative sample” of records. Instead you must examine all records.
The second challenge is that the data in such a scenario is often needed in a context where a schema is necessary (typically for use in an OLAP data warehouse, or for use in another application that needs access to specific fields). The usual approach to solving this challenge is to use an ETL process to transform to the target schema, but this generates a lot of complexity to deal with all the possible variations of schemas, and the occasional absence of needed fields in some records.
The third challenge is one of data quality. The very restrictions that the schema-less approach wishes to shed are exactly the restrictions that provide specific guarantees about the quality of the data: constraints to guarantee that specific fields are always available, that they have a specific format, or that referential integrity is maintained across all data. Avoiding unnecessary restrictions is easy if you explicitly design a schema. Recovering data quality from data that was created without such guard-rails is essentially impossible.

If the database has a schema, albeit an implicit one, the situation is much easier to handle. Essentially the only challenge is a reduced version of the data quality challenge. Referential integrity is possibly at risk, and the type of the data may be more permissive than it would have been with explicit design, but in general the schema is understandable, and has essential constraints in place.

In the case of the scenario of new code for an existing database, the challenges can become more significant. In the schema-less design, all of the challenges of scenario 1 exist, and we now have an additional one: the new application code may be incompatible with some of the existing data records. Since the only properties we can guarantee about the data are the properties that were maintained by the application code, backwards compatibility with this data has to be explicitly designed in the new application code. This can either be done by examining all records again to derive the necessary properties, or determined by examination of the original application code. While this latter approach sounds simpler, it can actually prove more complicated in practice: during its life-cycle the original application was probably updated multiple times, and each version probably generated a fraction of the records in the database. This means that there are multiple versions of the original application that need to be considered. In practice, it is probably easier to work backwards: determine which properties are important for the correct functioning of the new application, and then verify that the existing data records satisfy it (or discover why not; you may simply be missing a use-case).

In this scenario, implicit schemas can also have significant challenges. To begin with, the legacy database makes it impractical to use an implicit schema in the new application code. In theory it is possible by having the new application generate its implicit schema, and designing an ETL to transfer the data from the legacy database to the new schema. In practice, however, this will most likely prove to be an iterative process. The first auto-generated schema is unlikely to be compatible with the legacy data, requiring that the new application data model be re-written, and the whole process repeated. In addition, the effort of creating the ETL will probably be higher than designing and configuring the new application around the existing schema.

Note that the implicit schema scenario will also have a reduced version of the data properties challenge, since not all relevant data properties will be maintained in constraints, and some may have to be recovered from application code. Consider, for example, the the case of a field A, which is allowed to be NULL only if the status of the record (stored in field B) has a specific value. In general, any property that would normally be encoded as a trigger in the database would probably have to be derived. In an implicit schema scenario, however, the degrees of freedom in variation between records is drastically reduced (this is, after all, the very point of having a schema), making deriving such properties significantly easier.

For completeness, I don't believe the third scenario is worth considering in this context: my feeling is that it will prove to always be easier to replace the data source-as-ESB scenario with an explicit system integration implementation. I will, therefore, not cover it here.

To switch to the big picture, there are two fundamental principles at play², that essentially make databases different than any other part of the application:

Data has a much longer lifetime than application code (often multiple decades).
People always discover additional uses for the application data, other than the application itself. You need to be aware that uses cases will surface beyond those of your current application.

In this light, the question of explicit data schemas becomes yet another one of those quintessential “pay now or pay later” trade-offs that define software architecture. That being said, the effort of designing an explicit data schema is often fairly small, since essentially all the discovery work - what entities exist, with what fields, and what data types - has already been performed as part of the domain analysis. Conversely, the cost of recovering from the lack of a well-designed data schema if you opt for “later”, ranges from serious to prohibitive. And if you will allow me a personal plea: having had to perform the task myself, I would ask you to be kind to the consultants of the future, that will get called upon to deal with your design later.

Note that there is little tying schema-less to NoSQL, other than historical necessity. One can easily mix schema-less into an otherwise SQL environment, by mixing in unstructured fields and letting the applications define how to interpret the content. In the trivial case, you could limit the table to an ID and a text or BLOB field containing JSON, which gets serialized and de-serialized into whatever structure the application decides. This mixed mode even has the advantage of using the best features of both worlds. ↩︎
To be clear, these two principles are not limited to the scenarios above; they are indeed fundamental. They are, in fact, so fundamental that they are applicable across all different implementation technologies, and have been at play since the very first days of computers. They have far-reaching consequences, beyond the topics discussed in this article. ↩︎

George Hadjiyiannis

The case for explicit data schemas

Recent posts

Untyped vs. strongly typed languages

Rewarding the innovators

The drive-by coding anti-pattern

Culture eats Strategy for breakfast

A different kind of company growth

Categories

About