DRAFT: 20th Anniversary of The Vietnam of Computer Science

DRAFT — TODO: date · NpgsqlRestPostgreSQLArchitectureDDDClean ArchitectureOpinion

Introduction

It has been twenty years since Ted Neward published "The Vietnam of Computer Science". Twenty years.

Did we win? Did we even leave? Are we stuck in a quagmire?

Since then, the industry has marched relentlessly through a never-ending parade of patterns, architectures, and methodologies: ORM tools of every flavor, Repository and Unit of Work patterns, Domain-Driven Design, CQRS, Event Sourcing (currently all the rage), Hexagonal Architecture, Clean Architecture, Onion Architecture, Service-Oriented Architecture (now obsolete and so last year), Microservices, and on and on.

The quagmire Neward described is, of course, Object-Relational Impedance Mismatch. So to speak, a shotgun marriage between Object and Relational worlds. One lives in your application memory and works over data structures, and the other, well, in a relational database.

And none of these patterns above made the problem go away. We need both, and somehow the majority of the effort goes into dealing with persistence in one way or another. It's like unsuccessful couples therapy for that shotgun marriage. Maybe we need divorce papers?

Let's dig in deep.

What Is The Object–Relational Impedance Mismatch

According to Wikipedia (link: https://en.wikipedia.org/wiki/Object–relational_impedance_mismatch):

Object–relational impedance mismatch is a set of difficulties going between data in relational data stores and data in domain-driven object models.

Ok, got it. A set of difficulties. Difficulties that refuse to go away, it seems, but fine.

It's worth noting that Object–Relational isn't the only mismatch on the menu. OO isn't the only way we work over application memory — functional programming is on the rise too, so there's a Functional–Relational mismatch as well. To be fair, FP gets along with relational better than OO does: SQL is already declarative, set-based, and value-oriented, right up FP's alley. But it still works over application memory, so it still has to bridge to the relational database like everyone else.

Anyway...

Relational Database Management Systems (RDBMS) is the standard method for storing data in a dedicated database, while object-oriented (OO) programming is the default method for business-centric design in programming languages.

RDBMS is the standard method for storing data in a dedicated database, but are they typically doing only data storage? I mean, files are also doing that, are they not? Are we missing something here?

And this second claim that object-oriented (OO) programming is the default method for business-centric design, that might be true, but SQL is the default method for anything business data-related, like the way organizations store, manage, protect, and analyze their data.

We can see the tension already here, but let's move on.

The problem lies in neither relational databases nor OO programming, but in the conceptual difficulty mapping between the two logic models. Both logical models are differently implementable using database servers, programming languages, design patterns, or other technologies.

Now, why would they say that we have two logical models?

Take a Customer. In your business, a customer is one thing — one concept. DDD even has a name for this: within a bounded context there is one ubiquitous language, so there is one shared notion of what a Customer is. That single concept becomes one logical model — what a Customer is, which attributes it has, how it relates to orders — with no implementation details attached. That's why it's called a logical model.

Then you build it. And here is where it splits: that one logical model gets implemented physically more than once. Once as a table — customer, with a varchar(255) name column. Once as a class — Customer, with a string Name property. Same concept, same logical model, two physical incarnations.

So in modern software design we are not talking about two different logical models, as Wikipedia claims. We have one logical model and two — sometimes more — physical models. Modern overengineering knows no bounds.

The reason why they say that we have two logical models is probably because OO and Relational are seen as totally different paradigms so the logical models must be genuinely different. For example OO does have behavior, but the problem is that we are talking strictly about data here, not behavior. Behavior is a different problem axis - we'll come to it when we get to algorithms.

In any case, impedance mismatch is not about two different logical models, but about the mismatch between those two physical models. Your system has to map them and make them work together because, logically, we still have one logical model and the system needs to reflect that. That’s the real problem and source of tension.

Issues range from application to enterprise scale, whenever stored relational data is used in domain-driven object models, and vice versa. Object-oriented data stores can trade this problem for other implementation difficulties.

This last sentence just confirms what I just said. Whenever stored relational data (one physical model) is used in domain-driven object models (another physical model), we have this problem or mismatch. Maybe the belief that we are talking about two different logical models is the reason why these difficulties exist and persist and are not being resolved. If it was just a simple mapping, let's say from varchar(255) to a string, that would be easy to solve a long time ago. Or something a bit more complex, let's say many-to-many on a logical model level. In a relational model, that requires a junction table, but in an object model that can be just a collection of references. Still, a little bit more complex mapping, but still solvable and indeed already solved.

My sincere belief is that we are talking here about a series of misconceptions and misunderstandings about the nature of abstractions themselves. More specifically, about abstractions the RDBMS already provides. And if RDBMS already provides them, well, that means that the application layer goes on and re-implements them anyway. No wonder we have difficulties. Let's look at these misconceptions one by one and discuss in detail:

State Data Abstraction Misconception
Storage Devices Abstraction Misconception
Data Structures Abstraction Misconception
Abstraction Over Algorithms
Abstraction Over Concurrency and Integrity

1) State Data Abstraction Misconception

The Claim

Object-oriented programming has one of its core tenets called encapsulation. Encapsulation is supposed to protect internal data — the state. The object is the authoritative custodian of its state. Nobody else has it. Period. Without encapsulation, we don't really have OOP anymore.

On the other hand, functional programming has its own version of that — state immutability. A function takes values in and returns new values out, leaving the originals untouched — so there's no shared mutable state to corrupt in the first place. FP also enforces valid state through invariants, often by encoding them directly into the type system. Same goal, fewer bugs from uncontrolled state, arguably reached more elegantly. Fine. State status - protected, bugs - reduced. Great, beautiful, love it.

This, in fact, is very reasonable and well-thought-out. State data is data shared between different parts of the system and even different users. If every part of the system can poke at it without other parts knowing about it, then you have a lot of bugs. So, naturally, over time people came up with these concepts of encapsulation and immutability to protect that state and make sure it is only changed and fiddled with in a controlled way. Because we don't want to have bugs. Bugs are bad, okay.

No objection here. This is perfectly fine, and it makes a lot of sense. Let's say we have something complex. A game scene. Compiler syntax tree. Whatever. Protecting state in memory of such systems is invaluable, to say at least.

In his book Domain-Driven Design (2004), when describing the Domain Layer in Chapter 4, "Isolating the Domain," Eric Evans writes:

State that reflects the business situation is controlled and used here, even though the technical details of storing it are delegated to the infrastructure.

So he treats the in-memory object as the custodian of state - state that is "controlled and used here," where "here" is the in-memory Domain Layer — while the RDBMS is merely "delegated to the infrastructure." More on that in the next chapter. What matters here is the claim itself: the state is in memory. No doubt about it.

The Reality

But what about applications backed by relational databases, business or otherwise?

In RDBMS-backed applications, the state lives in that RDBMS itself - not in memory. The authoritative custodian of state is the database table row, not the in-memory object.

Take any business application backed by a relational database. How do you check the current state of some entity? Do you peek at the object in memory? No, you query the database for that. Anyone who has worked 5 seconds in industry knows this, of course.

I know what DDD people will say now: object memory (or functional state memory) is the real state data, and RDBMS is just where that data is persisted (presumably when the user clicks on a "Save" icon).

And, if you still believe that objects/memory are the real state and not RDBMS, then riddle me this:

What if we have multiple instances of the application running behind a load balancer? And then maybe some background work as well, and some other services too. Maybe reporting replica as well, so we have multiple processes accessing the same state data. Who is the sole custodian of that state data in that case? The in-memory object of one process, or the database row? The obvious answer is, of course, NONE of the in-memory copies. The best we can do is to have each process hold a copy, while the real state is in the RDBMS itself.

Some may say now, but databases have multiplicity too, right? We have multiple replicas, and they are all copies of the same data. And there are also multi-master setups as well with multiple writers for high availability scenarios. Yeah, but the big difference is that the database solves it, while in-memory objects don't even try to be honest. They just deny the reality. For multiple replicas, there is never any ambiguity about which one is the source of truth, since we are talking about read-only copies, and for multi-master setups we have either different consensus protocols that ensure a single source of truth or so-called eventual-consistency for the source of truth. In-memory objects have no such machinery, so they just pretend that they are the single custodian of the state data, which is, of course, a big, fat lie.

If the object is the custodian of state, what if we kill and restart the application? Oh my, the object's custodianship has evaporated.

Also, what, for example, if we have two concurrent user writers? One loads an invoice as 'pending', but another writer marks it 'paid' while the first one holds it. Now the first one is lying about the state of that invoice. The database is right - it is the source of truth - the first one is stale. We will talk more about concurrency and integrity later - this is just to prove my point:

RDBMS is the authoritative custodian of state, not in-memory objects. The state lives in the database, not in memory.

The Cost

Take a look at this example of a DDD-style domain model below:

Source credit: https://www.reddit.com/r/DomainDrivenDesign/comments/1ttzr19/create_complex_and_deep_aggregate/

This is a random example from Reddit, but it is indicative. Virtually every domain model is more or less like that - strip away the method, adjust types a bit, and it is basically an Entity-Relationship (ER) diagram, that's it. There is no structural difference between the domain model and the database model. It is the same logical model implemented twice.

And that method isValid() - the only behavior in the whole model - is just a data integrity check. Every rule it enforces, the database enforces too: dateBegin <= dateEnd is a one-line CHECK, and the harder ones —- price periods that must not overlap, quantity ranges that must stay contiguous — are exactly what database constraints exist for (more on the how in the integrity misconception). Same logical model implemented twice, same integrity rules implemented twice.

Because modern software design orthodoxy refuses to acknowledge that the state lives in the RDBMS, it forces us to implement the same model at least twice - once in the database (relational model implementation), once in memory (object model implementation, mapped usually with an O/R tool or a library).

I say at least, because there are examples where we have even more. Some "architects" will consider that O/R mapped model a persistence model because it doesn't have any behavior, and then they will add a separate domain model on top of that, which is the one with the behavior. So we have three implementations of the same logical model - database model, persistence model, and domain model.

And then, since we now have at least two physical models, we also have two type systems and two sets of constraints to protect the state. And we must keep them in sync at all times, and we must maintain the correct mapping. To be fair, this is mostly automated in modern systems (not completely), but we still have to do it and it is still there, automated or not. And even automated, it is still a cost and still a source of bugs and still a source of complexity. And even that can't bridge the semantic gap between the two type systems fully. For example, in PostgreSQL we can have a NOT NULL constraint on a column, but the corresponding C# property will be a nullable string.

All because we believe that there is some important state data we need to protect with our objects. In the best case, there are just some transient and disposable chunks of data copies, and in any case there is a lot of plumbing to manage: connections, transactions, commands, queries, etc. You know, the actual infrastructure.

Personally, I see a lot of irony in this. Modern orthodoxy calls RDBMS infrastructure, just to end up with code bases implementing, well, actual infrastructure.

2) Storage Devices Abstraction Misconception

The Claim

A modern software engineering approach makes the assumption that an RDBMS is a storage device. A device used to store data. Therefore, good engineering practice is to abstract that storage device.

We can see that in the quote from Eric Evans above, where he says that:

... the technical details of storing it are delegated to the infrastructure.

Evans doesn't dwell on this point much - he just delegates storage to the infrastructure and moves on. On the other hand, Robert C. Martin is much more vocal about it. In Clean Architecture, Martin dedicates an entire chapter to "The Database Is a Detail" argument. He is very blunt about it, and he repeats it several times, for example:

It's just a mechanism we use to move the data back and forth between the surface of the disk and the RAM.

Or this:

The database is really nothing more than a big bucket of bits where we store our data on a long-term basis.

This is not some fringe blog post. These are two of the most-cited and influential authors in the field - arriving at one and the same conclusion - just at different volumes.

Evans states it just once and quietly, like he doesn't want to talk about it too much, and then delegates it away (perhaps he'd rather not have you examine it too closely, I don't know). Martin states it over and over, bluntly, and goes as far as to insist we should not even acknowledge that the disk exists.

Different approaches - identical claims: the database is a storage device. Storage is mechanical, it sits beneath the business — no different from a file system — and the architect's job is to wrap it up tight and forget it is there.

The Reality

Reality is that RDBMS uses a storage device or devices, and that means that it already does this abstraction for you. It already abstracts storage.

If we go back to the beginning, when Edgar Codd introduced the relational model in 1970, the pitch was data independence — the whole point was to insulate the logical shape of your data from how it physically sits on a device. In fact, that is the core concept of the relational model. It was codified as the numbered Rule 8 fifteen years later in Edgar Codd's twelve rules: Rule 8: Physical data independence:

Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods.

The ANSI/SPARC Architecture from that era formalized it and walled off internal storage as well. So, the relational model was designed, from day one, to hide the disk.

Back to modern times and modern RDBMS implementations.

Riddle me this: We can write a SELECT, a JOIN, a WHERE, and even an INSERT, UPDATE, or DELETE, and in most cases it will be executed with the same results on different RDBMS engines from different vendors. That is called ANSI/ISO standardized SQL, which goes to show the real separation between the model and the physical implementation. Virtually every RDBMS implements the same ANSI/ISO SQL standard, but with dialect differences at the edges. Those differences can sometimes be significant, but the core of the language is the same, and that is what matters here. The same SELECT statement can run on MySQL, PostgreSQL, SQL Server, Oracle, and so on, given that you have the same logical model implemented in each of them.

But the abstraction doesn't stop at the SQL language. The relational model and the RDBMS implementations built on top of it are designed to be agnostic to the underlying storage device. For example:

In PostgreSQL, we can move any table to any different storage device we choose (they call it a tablespace), and the model above remains unchanged. Just use ALTER TABLE ... SET TABLESPACE — the table is now on a different physical device. The term TABLESPACE is Oracle's term for the same thing (switching table storage), and SQL Server has something called FILEGROUPS. Same SELECT, same result, different storage, no changes, no fuss.
You can change the actual byte format in storage. In PostgreSQL, the table access method is pluggable (Citus columnar, TimescaleDB). SQL Server flips a table from rows to columns with a clustered columnstore index. Oracle does it with Hybrid Columnar Compression, or stores the whole table inside a B-tree as an index-organized table instead of a heap. Same logical table, same query — completely different bytes on disk.
In MySQL you can swap the storage engine out from under a table entirely: ALTER TABLE t ENGINE=... moves a table between InnoDB (B-trees on disk), MyISAM, Archive (compressed), CSV (a literal text file), or MEMORY (pure RAM). One table, one query, completely different machinery underneath.
The data doesn't even have to be in local storage at all. In PostgreSQL, make it a FOREIGN TABLE over an FDW (Foreign Data Wrappers) and the rows live happily on another machine entirely — the query doesn't even care or know. Oracle reads flat files as external tables; DuckDB queries Parquet files on disk as if they were tables. The data is somewhere else, on a remote machine - SQL is the same.
It doesn't even have to be one machine. Hand it to a distributed SQL engine — CockroachDB, TiDB, Spanner — and your data becomes a replicated, sharded key-value store smeared across a cluster. You have no idea which node, let alone which disk, holds any given row. CockroachDB speaks Postgres's wire protocol so you can use standard PostgreSQL unchanged, TiDB speaks MySQL's, and you still write ordinary SQL. Or push it to the cloud, where Snowflake keeps everything as columnar micro-partitions in object storage. Same SELECT, same result, no fuss.
And finally — no disk at all. Who says we need disks? Declare a MySQL table ENGINE=MEMORY, run PostgreSQL on a tmpfs with the whole data directory in RAM, switch on SQL Server's In-Memory OLTP, open SQLite as :memory:. No persistent storage anywhere. Same query, same result — and the thing is still, unmistakably, a database.

Martin writes: "To mitigate the time delay imposed by disks, you need indexes...". Then why are in-memory databases full of indexes? Why are they (in-memory databases) a thing at all? Because an index was never about the medium - it is about not scanning every row when you want one, be it on disk or in RAM.

As we can see, different devices, different formats, even machines and clusters, or even no machines at all — the same logical model, the same SQL, the same results. The RDBMS already abstracts storage for you. It is designed to do exactly that. Thank you very much, but you are wrong.

The Cost

Cost is never ending persistence-ceremony. We are asked to carefully construct our data models in memory (to maintain the illusion of encapsulation), protect it from illogical and unwanted changes, and then save it to the database (persist it) when the time is right.

It is hard to overstate how big this persistence-ceremony thing really is. The number of specialized patterns, libraries, countless tutorials, videos, entire philosophies, strategies, etc. But take a look at this trivial example:

This is a simple SQL statement to update a transfer approval - two people must sign off, they must be different people, and the status moves from pending to partly to fully approved. In a traditional codebase this would be a considerable chunk of code spanning multiple files and doing several database calls, just to maintain the illusion.

sql

update transfer_approvals
set
  status = case
    when status = 'pending' then 'partly_approved'
    when status = 'partly_approved'
         and approver1 is distinct from $1 then 'fully_approved'
    else status
  end,
  approver1 = case when status = 'pending' then $1 else approver1 end,
  approver2 = case
    when status = 'partly_approved'
         and approver1 is distinct from $1 then $1
    else approver2
  end
where 
  transfer_id = $2
  and status in ('pending', 'partly_approved')
returning status;

We don't know where those transfer approvals are. Is it stored on a disk? What disk is it on? Is it in memory, what format it's in, or even what machine it is on, and we don't care. That is not the point. This is a DECLARATION. We have just declared business rules for how to update transfer approvals.

It has nothing to do with storage as far as the application is concerned. And since our RDBMS engine hides and abstracts all the storage details and does all the persistence work for us, what are we left with then? That's right: the business logic and business rules. Just a declaration of how our transfer approvals should be updated in this case. Nothing else. No ceremony, no plumbing, no persistence layer, no storage, no nothing.

The entire pile of code which is now gone with this move is the following:

the entity / aggregate class (the in-memory model)
the repository (to fetch it and put it back)
the unit-of-work / change tracker (to know which fields are dirty)
the O/R mapping config (to translate object <-> row)
the load → mutate-in-memory → save dance
the surrounding transaction scope

Every one of those exists for one reason only: to carry state from memory to storage and back. Remove that silly belief, and every one of those big code blocks removes itself.

3) Data Structures Abstraction Misconception

The Claim

If we already have established two previous claims — that the state lives in memory and that the database is a storage device — then data itself must be an in-memory data structure as well. And, if we are going to perform operations on it, protect the state, mutate, and so on, then we need to have it in memory, and only suitable in-memory data structures will do, period.

Direct quote from Eric Evans, Domain-Driven Design (2003), Part II "Building Blocks of a Model-Driven Design," Chapter 6 "The Lifecycle of a Domain Object," in the section on Repositories (p. 108), he writes:

For each type of object that needs global access, create an object that can provide the illusion of an in-memory collection of all objects of that type.

Screenshot from Eric Evans, Domain-Driven Design (2003), p. 108, showing the Repository definition — Eric Evans, *Domain-Driven Design* (2003), p. 108.

The word Evans uses: illusion. Not a real collection, oh no, just the illusion of an in-memory structure. This should tell you everything now: You should only build an elaborate pattern to simulate an in-memory collection if the real data collection was never in memory to begin with.

The Repository exists precisely because the objects live in the database, and its entire job is to make them look like they are sitting in memory. Fake it until you make it, except you will never make it.

The repository pattern is an admission in structural form: relational data, dressed up to pass as an in-memory data structure (so we can do OOP on it).

This is not a coincidence nor an isolated quote. For example, in the book Patterns of Enterprise Application Architecture (2002), Martin Fowler writes (source: https://martinfowler.com/eaaCatalog/repository.html):

A Repository mediates between the domain and data mapping layers, acting like an in-memory domain object collection.

Acting like an in-memory domain object collection? Why do we want to force relational data into an in-memory collection? Just for good measure, let's check out Microsoft's recommendation in their Software Architecture e-book, in a part that teaches us how to "Design the infrastructure persistence layer." Microsoft echoes Fowler almost word for word, putting it as a set of domain objects in memory.

A repository performs the tasks of an intermediary between the domain model layers and data mapping, acting in a similar way to a set of domain objects in memory.

And then they continue:

Basically, a repository allows you to populate data in memory that comes from the database in the form of the domain entities. Once the entities are in memory, they can be changed and then persisted back to the database through transactions.

Again, just goes to prove the point above - real state is in the database, we are simply mandated to load temporary chunks into memory - in order to maintain the illusion of encapsulation and illusion of in-memory data structure.

Perhaps we might end up with an illusion of the entire software solution?

In any case, more than a decade after Evans and Fowler had laid out this machinery that simulates in-memory collections, the mismatch still had not been solved. We know this because in 2014, the field's leading DDD experts from around the world gathered at the DDD eXchange conference in NYC to figure out how to do DDD better and, finally, solve this Object-Relational Impedance Mismatch. Because you do not hold a summit to solve a problem you have already solved.

One of the speakers, renowned DDD expert Vaughn Vernon, gave a talk called "The Ideal Domain-Driven Design Aggregate Store?" where he proposed a final solution to the O/R Impedance Mismatch problem:

During the park bench discussion I promoted the idea of serializing Aggregates as JSON and storing them in that object notation in a document store. A JSON-based store would enable you to query the object’s fields. Central to the discussion, there would be no need to use an ORM. This would help to keep the Domain Model pure and save days or weeks of time generally spent fiddling with mapping details. Even more, your objects could be designed in just the way your Ubiquitous Language is developed, and without any object-relational impedance mismatch whatsoever. Anyone who has used ORM with DDD knows that the limitations of mapping options regularly impede your modeling efforts.

Screenshot from Vaughn Vernon, 'The Ideal Domain-Driven Design Aggregate Store?', proposing JSON-serialized Aggregates in a document store — Vaughn Vernon, "The Ideal Domain-Driven Design Aggregate Store?"

Framing in this case is that the O/R mapping tools and libraries are the main source of impedance mismatch, and if we could just get rid of them, then we would have solved the problem. No O/R mapping, no O/R mapping at all, and no impedance mismatch. That's the idea. In essence, the proposed data design is this:

Serialize each aggregate Object to JSON, store it as a blob
Every table is just (id, data json) — a key and a blob, nothing else
"Reference Other Aggregates By Identity Only" — no foreign keys, no joins, nothing, each blob standalone

That's it. That is the "ideal aggregate store" solution to the impedance mismatch problem.

Now, to be fair, Vaughn Vernon is not saying that RDBMS = file system. He does propose using Postgres for ACID and JSON querying. He does want to keep the relational engine. Just not the relational model.

And that is the whole trick. We have solved the Object-Relational impedance mismatch by removing the relational part entirely — and keeping the Object part, obviously. No relations, no foreign keys, no joins, no set operations, no nothing. Just a flat collection of objects, serialized, frozen to disk, and fetched by id. The illusion of an in-memory collection, made real at last.

So, yeah, the solution to the O/R impedance mismatch is to remove the R (relational part) entirely, and just have a key-value store with JSON blobs.

That was proposed ten years ago. Does anyone use that today? Is that how the industry builds? I don't think so. Twenty years after Evans and Fowler, the mismatch sits exactly where it started. We never solved it.

The Reality

Let's get one thing out of the way immediately, because I don't want to win this argument by cheating.

A table is a set. That is not a metaphor — it is the mathematical definition. Codd's 1970 paper defines a relation as a subset of the Cartesian product of domains: a set of tuples.

Screenshot from E.F. Codd, 'A Relational Model of Data for Large Shared Data Banks,' defining a relation as a subset of the Cartesian product of domains — Codd, E.F., "A Relational Model of Data for Large Shared Data Banks," CACM 13(6), June 1970, pp. 377–387, §1.3.

And SQL is an algebra over those sets — selection, projection, JOIN, UNION, INTERSECT, EXCEPT — a closed algebra, so every operation over sets returns another set you can keep operating on.

All true. And here is the problem with building the argument on that: an in-memory collection can do all of it too. LINQ in C# ships Join, GroupJoin, Union, Intersect, Except, Distinct, GroupBy. Those are not arbitrary method names — those are Codd's operators, reimplemented over IEnumerable. And every other ecosystem rebuilt some version of the same algebra over its collections; LINQ just did it most completely.

So if the impedance mismatch were about operations — about what you can do with the data — it would have been solved around 2007, when the operators finished porting. Case closed, everybody go home.

The mismatch is still here. Which means it was never about the operations.

Here is what the table has that no collection in your process can have. Not what it does — what it is:

It is the record. As established in the first misconception: the row is not a representation of the state, it is the state. Your collection is a copy of it, taken at load time.
It is shared. Every process, every writer, every background job operates on the same table, under the engine's arbitration. Your collection is private to one process. The other writers don't know it exists, and they are not waiting for it.
It is durable and live. The table existed before your process started, will exist after it dies, and keeps changing under other writers the whole time. Your collection is a photograph. The table is the thing being photographed — and it kept moving after the shutter clicked.
It is guarded. CHECK, UNIQUE, NOT NULL, foreign keys — enforced transactionally, against every writer, no exceptions. Your in-memory validation binds exactly one thing: your copy. The next writer does not inherit your discipline.

And now the key observation, the one this whole chapter hangs on: not one of these four is an operation.

LINQ could port Join because join is a function — values in, values out. Pure computation travels; you can implement it anywhere. Being the shared, durable, guarded record is not a function. There is no method you can add to List<T> that makes it be the authoritative state. You can port an operator. You cannot port a status.

Notice where the line falls between what made the trip into memory and what didn't. Everything that is algebra — join, filter, group, project — ported over just fine. Everything that is state — the transaction, the arbitration, the constraints, the durability — never left the database. It can't. The line between them is exactly the line we drew in the first misconception: computation versus state.

And if you want this confirmed by the industry itself: the most serious application of LINQ's relational operators is LINQ-to-Entities, which takes your C# expression tree and compiles it back into SQL, to send to the database. We rebuilt the algebra in memory, and its main job turned out to be translating itself back — because the algebra made the trip, and the data never did.

Which puts the domain object model in a completely new light. It is not an alternative data structure for your data. Look at the machinery again, piece by piece:

The object model builds	To simulate
Repository	the table
Identity map	the primary key
Navigation properties	foreign keys and joins
Unit of work	the transaction
Change tracker	what `UPDATE ... SET` already knew
In-memory validation	`CHECK`, `UNIQUE`, and foreign key constraints

That is not a different model. That is the same model — the database — re-implemented in RAM, minus the four properties that made it meaningful. The domain model is a simulation of the database, running inside your process. Evans told us himself, remember: the illusion of an in-memory collection. And the real thing sits two feet away the entire time, doing all of it correctly, under ACID, for every process at once. We call the copy "the domain" and the authority "a detail."

To be precise about what I am not claiming, before somebody builds a strawman out of it:

In-memory data structures are not the problem. Pure computation over values you were genuinely given is exactly what memory is for — take inputs, compute, return outputs. The game scene and the compiler syntax tree from the first misconception live in memory legitimately, because they are the state of those systems. LINQ over data you rightfully hold is wonderful.

Caching is not the problem either — honest caching. A cache knows it is a copy. It has a TTL, an invalidation story, and it never claims to be the truth.

The sin is narrow and specific: a copy that claims to be the authority and has no answer for the moment it goes stale. That is not a cache, and it is not a data-structure choice. That is a simulation impersonating the thing it copied.

Hold on to that word — impersonating — because it explains something two decades of framework engineering could not fix. Every famous ORM pathology you have ever fought is not a separate bug with a separate fix. It is one failure with many faces: a simulation forced to behave like the authority it impersonates. Let's count the faces.

The Cost

What follows are not four independent grievances. It is one impossibility wearing four costumes.

Copy, not record → staleness and write amplification

The object must be hydrated before it can be mutated — that is the whole contract. So to change one column, the ORM does a SELECT followed by an UPDATE: load the row (usually the whole aggregate — scroll back to that Reddit picture and count the objects), let the object "decide" in memory, write it back. Remember the transfer approval from the previous chapter: one UPDATE statement, zero preliminary round trips, the rule declared right where the state lives. The simulated version is load → check in memory → mutate in memory → save. And between load and save, the row is free to change under you.

Hence lost updates. Hence the "optimistic concurrency token" — a rowversion or xmin column added to the schema not because the business needs it, but to detect that your copy lied to you between load and save. The token is a confession written in DDL: the copy is not the record, and we know it.

Private, not shared → arbitration lives in the database anyway

Two requests load the same invoice. Each holds a private snapshot; each decides from it; both decisions are "valid" in memory, and one of them is wrong in reality. The decision your domain model makes is provisional — the only binding decision happens in the database, under a lock, inside a transaction, where arbitration has lived all along. The version-token "rescue" concedes exactly this: the object proposes, but the WHERE clause of the write disposes. Now add a background job, a second service, a nightly import — and the in-memory "state" stops being state at all. It is one process's guess about what the state was, some number of milliseconds ago. (Much more on this in the concurrency and integrity misconception.)

Graph walk, not set operation → the access-pattern tax

We conceded that collections have the operators. But the object model's shape pushes you away from them: objects navigate. customer.Orders, order.Lines — walking references one object at a time, each step a query you didn't see, fired from behind a property getter. One JOIN becomes a hundred SELECTs, and the call site looks innocent. This is N+1, the most documented performance pathology of the last two decades. And the fixes bill you separately: eager-load with Include and over-fetch half the database, or project into DTOs — at which point you are writing relational queries again, in C#, so that a library can compile them back into the SQL you were abstracting away. The abstraction gets abandoned at exactly the moment it gets tested.

Simulation, not engine → the capability ceiling

Window functions. Recursive CTEs — foreign keys form a graph, and the engine will walk arbitrary-depth hierarchies for you, declaratively. GROUPING SETS, lateral joins, partial and expression indexes, set-based bulk updates. And underneath all of it: a cost-based planner with live statistics about your actual data, choosing between a hash join, a merge join, and an index scan — per query, per data distribution. The simulation has none of this and cannot grow it. This one isn't even a defect — it is simply what a private copy in RAM lacks next to a database engine.

The standard reply — "but EF has raw SQL escape hatches" — is my argument wearing a different hat. If the object model were the system of record, there would be nothing to escape to. The moment you drop to SQL for the hard 20%, you have admitted where the real system was all along. The escape hatch is not a counterexample. It is the proof.

Now step back and look at the four together. None of them is an implementation defect. Hibernate is twenty-five years old; Entity Framework is eighteen. Some of the best engineers in the industry have been sanding these edges for two decades, and every one of these problems is still here — because they are not bugs in the simulation. They are the simulation working correctly: behaving exactly like what it is — a private, transient copy — instead of what it plays: the shared, durable record. That gap does not close with effort, because it is not made of code. It is made of what the two things are.

And that is Neward's quagmire, stated mechanically. An escalating investment that cannot win — not because the enemy is strong, but because a copy cannot out-invest its way into being the original.

4) Abstraction Over Algorithms

The Claim

Back in the Wikipedia section, I promised that behavior is a different problem axis and that we would come to it when we get to algorithms. Here we are.

The first three misconceptions were about data. This one is about behavior. The claim goes like this: fine, the data may sit in the database — but the algorithms, the logic, the behavior of the system, those belong in application code. SQL fetches; code computes. The database is where data sleeps, and the application is where it wakes up.

Martin Fowler wrote an article about this exact question back in February 2003, called "Domain Logic and SQL". It opens with an honest description of the mainstream attitude:

Many application developers, particularly strong OO developers like myself, tend to treat relational databases as a storage mechanism that is best hidden away.

There it is again — the storage device from the second misconception — but this time the subject is logic. To his credit, Fowler takes the SQL option far more seriously than most of his readers ever did. And still, the verdict:

Personally I don't think performance should be the first question. My philosophy is that most of the time you should focus on writing maintainable code.

With a warning label attached:

I would suggest that if you go the route of putting a lot of logic in SQL, don't expect to be portable — use all of your vendors extensions and cheerfully bind yourself to their technology.

And a concession that defines SQL's proper place in this worldview:

If you use an in-memory approach and have hot-spots that can be solved by more powerful queries, then do that.

So there is the claim, in its most reasonable and balanced form, from its most reasonable and balanced proponent: logic in memory is the default — maintainable, portable, testable. Logic in SQL is the exception — a performance hot-fix, to be applied reluctantly, hot-spot by hot-spot. Twenty-three years later, this is still the mainstream position, and most codebases you will open this week are built on it.

The Reality

SQL is a programming language. A declarative one, but a programming language.

Look back at the transfer approval statement from the second misconception. Two approvers, they must be different people, the status walks from pending to partly to fully approved. That is not "fetching data." That is behavior — an algorithm, expressed as a declaration, executed next to the data, atomically.

Now ask what the algorithms of a business system actually are. Strip away the ceremony and it is overwhelmingly this: filter, join, group, aggregate, rank, deduplicate, walk a hierarchy, compute something over ordered data. Which is, item for item, exactly what SQL was designed to express. A running balance, for example:

sql

select
  customer_id,
  transaction_date,
  amount,
  sum(amount) over (
    partition by customer_id
    order by transaction_date, transaction_id
  ) as running_balance
from transactions;

The in-memory version of this algorithm: fetch every transaction over the wire, group by customer in a dictionary, sort each group, loop and accumulate — plus the memory footprint, plus deciding what happens on the day the table stops fitting in RAM. The declarative version is the window function above. Ranking, top-N-per-group, gaps in sequences, running totals, year-over-year — window functions. Org charts, bills of materials, category trees — recursive CTEs: declare the traversal, and the engine walks the graph. With recursive CTEs, SQL is Turing-complete — not that you should compute Fibonacci in it, but "SQL can't express my logic" stopped being true decades ago.

But here is the part that actually settles the argument, and it is not expressiveness. When you write the loop, you are writing one algorithm, frozen at commit time. When you write the declaration, the engine writes the algorithm — at runtime, with a cost-based planner and live statistics about your actual data. Hash join, merge join, nested loop; index scan or sequential scan; parallel workers or not — chosen per query, per data distribution, and re-chosen as the data grows. Your hand-written loop was a perfectly good plan at ten thousand rows. At ten million it is a catastrophe, and it will not adapt, because it is code — someone has to notice it, profile it, and rewrite it. The declaration just quietly gets a new plan.

Nobody would hand-write three join algorithms plus a statistics-driven optimizer to choose between them in the service layer. That machinery already exists. It sits directly under the data — and the claim instructs us not to use it.

Which leaves the claim's justifications, so let's take them in order. Maintainability: is the eight-line window function really less maintainable than the same algorithm spread across a repository, a service method, and a mapping profile? "Maintainable" is not a synonym for "written in my favorite language." Portability: the second misconception already dealt with that — the core of SQL is an ANSI/ISO standard that runs on every engine, while your domain layer is portable to exactly nothing; nobody in recorded history has swapped C# + EF for Java + Hibernate because the code was so nicely decoupled. Testability: SQL is testable — pgTAP exists, and the oldest trick in the book still works: open a transaction, run the test, roll back. Besides, when you mock the database out of a test of data logic, look at what is left standing: you are testing the simulation from the previous misconception, not the system.

The Cost

Every other corner of the industry has a name for this cost: moving data to compute instead of moving compute to data. The entire big-data field was built on the lesson that you ship the algorithm to where the data lives, because the other direction does not scale. Business software orthodoxy teaches the other direction as a best practice.

So we pay, in four installments:

The wire tax. Rows are read, serialized, shipped across the network, deserialized, and mapped into objects — so that a loop can run in the application, redoing work the engine would have done in place, with indexes.
The round-trip tax. Iterative logic in the application is chatty by nature: a query per step, per entity, per iteration. The N+1 problem from the previous chapter is this tax's most famous invoice.
The reimplementation tax. Every in-memory group, sort, join, and aggregate is a worse copy of what the engine already had: no indexes, no statistics, no planner, no parallelism, and memory bounded by your heap.
The frozen plan tax. The hand-written algorithm does not adapt to data growth. It just gets slower, quietly, until the nightly job that took a minute takes six hours, and someone gets paged to rediscover this chapter.

And the punchline is already inside the claim itself. SQL is admitted as the exception, for hot-spots — and then, over the life of the system, every part that matters turns out to be a hot-spot. One by one, the pieces that count get rewritten in SQL anyway, by tired people, during incident reviews. It is the same concession we saw with the raw-SQL escape hatch in the previous misconception: the exception clause ends up doing all the load-bearing work. At some point, the honest question is why the exception is not the architecture.

5) Abstraction Over Concurrency and Integrity

The Claim

DDD's answer to data integrity is the aggregate. Eric Evans, Domain-Driven Design, Chapter 6:

An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes.

The aggregate root guards the boundary and enforces the invariants — the business rules that must never be broken. That Reddit picture from the first misconception is exactly this: Article at the root, guarding its packages, price periods, and quantity ranges, with isValid() standing watch.

Vaughn Vernon — the same Vaughn Vernon from the aggregate store — codified the discipline in Implementing Domain-Driven Design (2013) and the "Effective Aggregate Design" essays it grew from. He is admirably precise about it:

An invariant is a business rule that must always be consistent.

A properly designed Aggregate is one that can be modified in any way required by the business with its invariants completely consistent within a single transaction.

Thus, Aggregate is synonymous with transactional consistency boundary.

Along with the rule that turns it into a discipline:

A properly designed Bounded Context modifies only one Aggregate instance per transaction in all cases.

So the claim: invariants — the rules that must always hold — are enforced by the domain model, in memory, one aggregate at a time. The community even has a slogan for it: the always-valid domain model.

But read Vernon's third sentence again, slowly, because something remarkable is happening in it. "Aggregate is synonymous with transactional consistency boundary." The pattern defines itself as a transaction. Hold that thought.

The Reality

First things first: the goal is completely right. Invariants must hold — that was never in dispute; it is the same reasonable instinct we already conceded in the first misconception. The question was never whether to enforce invariants. The question is where enforcement actually binds.

An invariant enforced in memory binds exactly one process — property four of the data structures misconception: the next writer does not inherit your discipline. And it binds at exactly one moment — validation time. Between the check and the write, the world keeps moving.

The canonical example, the one every team eventually learns in production: usernames must be unique. The domain model checks — no such username, valid — and inserts. Two concurrent registrations both check, both pass, both insert. The "always-valid" model just produced invalid data, twice, without a single line of it misbehaving. Check-then-act on a private snapshot is a race by construction — TOCTOU, time-of-check to time-of-use, a bug class old enough to have its own acronym. And notice what every team actually does about it: they add a UNIQUE constraint. The engine catches what the model cannot.

It is worth being precise about why the engine can do what the model cannot: it is the only party that sees every writer. Which makes it the only place where invariant machinery means anything: NOT NULL, CHECK, UNIQUE, foreign keys, EXCLUDE — wrapped in transactions, with isolation levels up to SERIALIZABLE, where concurrent transactions are guaranteed to behave as if they had run one at a time. Enforced against every writer: your application, the second instance behind the load balancer, the background job, the DBA at 2 a.m. No exceptions, and no discipline required.

Time to pay the debt from the first misconception. The hardest invariant in that Reddit aggregate: price periods must not overlap. isValid() can inspect its own copy — while another process commits an overlapping period it has never heard of. Here is the entire invariant, declared:

sql

create extension if not exists btree_gist;

alter table price_periods
add constraint price_periods_no_overlap
exclude using gist (
  package_id with =,
  daterange(date_begin, date_end, '[]') with &&
);

Two periods for the same package with overlapping date ranges can now not exist. Not "will be caught, provided the request comes in through the domain layer" — cannot exist. Under any concurrency, from any writer, forever. One declaration. That is what enforcing an invariant actually means.

And now unhold that thought from the claim. Aggregate is synonymous with transactional consistency boundary. The transaction is a database concept. The pattern's own definition concedes that invariant enforcement is transaction work — it just redraws the transaction as an object graph, in one process's private memory, where it can see no other writer and therefore enforce nothing. The aggregate is a hand-drawn picture of a transaction. The database has the real ones — and the real ones can span whatever rows and tables the invariant actually needs, not just one object cluster.

The Cost

Everything is enforced twice — or worse, once, in the wrong place. The same rules live in C# validation and in database constraints, drifting apart release by release — the behavioral edition of the duplicated models from the first misconception, same bill, new line item. And the team that takes "always-valid" at its word and skips the constraints has it worse: their invariants are now enforced nowhere. They hold only in the absence of concurrency — which is to say, they are not enforced. They are observed, until further notice.

The races ship. Check-then-act bugs pass every unit test, because in the test the domain model really is alone — the race needs a second writer, and the test suite proudly mocks that out. The mock removes the exact enemy the invariant exists to fight. So the bug appears only under production load, intermittently, and is discovered as corrupted data weeks later: the duplicate payment, the double-booked slot, the negative stock. Every experienced engineer has one of these stories, and in every single one of them, the domain model passed all of its tests.

The aggregate-boundary industry. Vernon's rule — one aggregate instance per transaction — voluntarily outlaws the multi-row, multi-table atomicity the engine offers natively. So what happens when a real invariant spans two aggregates? A whole discipline unfolds: redesign the boundaries, or accept eventual consistency between aggregates, coordinated through domain events, process managers, sagas, compensating actions. An entire detect-and-compensate machinery, invented to route around BEGIN ... COMMIT. The database would have held both rows in one transaction — the real kind — and gone to lunch.

And with that, the five misconceptions close into a single picture. The state lives in the database (1), which already abstracts its own storage (2). Its tables cannot be replaced by in-memory structures — only impersonated by them (3). Its language already expresses the algorithms (4), and its transactions and constraints are the only invariant enforcement that actually binds (5). Every axis the modern application layer re-implements is an axis the engine already owns. The impedance mismatch was never a mapping problem between two equal worlds. It is the ongoing cost of running a simulation of one world inside the other — and calling the original "a detail."

DRAFT: 20th Anniversary of The Vietnam of Computer Science ​

Introduction ​

What Is The Object–Relational Impedance Mismatch ​

1) State Data Abstraction Misconception ​

The Claim ​

The Reality ​

The Cost ​

2) Storage Devices Abstraction Misconception ​

The Claim ​

The Reality ​

The Cost ​

3) Data Structures Abstraction Misconception ​

The Claim ​

The Reality ​

The Cost ​

Copy, not record → staleness and write amplification ​

Private, not shared → arbitration lives in the database anyway ​

Graph walk, not set operation → the access-pattern tax ​

Simulation, not engine → the capability ceiling ​

4) Abstraction Over Algorithms ​

The Claim ​

The Reality ​

The Cost ​

5) Abstraction Over Concurrency and Integrity ​

The Claim ​

The Reality ​

The Cost ​

Comments

DRAFT: 20th Anniversary of The Vietnam of Computer Science

Introduction

What Is The Object–Relational Impedance Mismatch

1) State Data Abstraction Misconception

The Claim

The Reality

The Cost

2) Storage Devices Abstraction Misconception

The Claim

The Reality

The Cost

3) Data Structures Abstraction Misconception

The Claim

The Reality

The Cost

Copy, not record → staleness and write amplification

Private, not shared → arbitration lives in the database anyway

Graph walk, not set operation → the access-pattern tax

Simulation, not engine → the capability ceiling

4) Abstraction Over Algorithms

The Claim

The Reality

The Cost

5) Abstraction Over Concurrency and Integrity

The Claim

The Reality

The Cost