Domain Driven Design while Growing Data Teams

April 15, 20265 min read

Domain Driven Design while Growing Data Teams

Backstory

Necessary vs Unnecessary Complexity

Necessary complexity → reflects the actual difficulty of the problem.

Unnecessary complexity → reflects decisions that felt easier at the time.

Unnecessary complexity does not emerge from the difficulty of the problem domain. It emerges because it is difficult to design scalable solutions. Techniques that work at the early phase of development often become unmanageable and too complex to comprehend if the techniques and tools do not evolve as the problem domain expands.

Real World Example

Multi-tenancy is hard. Keeping data consistent across service boundaries is hard. Analytical and transactional workloads having incompatible access patterns is hard. None of that difficulty goes away because you chose a simpler architecture. It moves somewhere else, usually into operational pain and accumulated technical debt.

One database because you already had one. Custom views instead of defined APIs because views were faster to write. Client-specific schemas because nobody sat down to design a tenant model. It reflects the difficulty of having the right conversation before writing the first line of code.

The practical work of software architecture is learning to tell these apart. They feel identical from the inside when you are living in a system that has accumulated both.

Two Systems, One Infrastructure: The Original Sin

When an analytics firm begins building application software, the path of least resistance is to build inside the existing data infrastructure. The databases are already there. The data is already there. Why introduce new systems?

Because analytical infrastructure and application infrastructure serve fundamentally different purposes, and a system optimized for one is wrong for the other.

Analytical systems are built for reads across large data sets, flexible querying, heterogeneous schemas, and batch processing. The user defines the question. The value is in the flexibility of the interface and the breadth of the data.

Application systems are built for transactional reads and writes, predictable access patterns, normalized schemas, and low-latency point lookups. The product team defines the question in advance and embeds it in the design. The value is in the workflow and the state it produces.

Putting application tables into analytical infrastructure is not a simplification. It is a collision of two incompatible sets of requirements in one place, and both degrade as a result.

The BI Tool Test

Before any architectural decision, classify what you are building.

A BI tool produces insight. When the user closes the tab, nothing in the world has changed. The database is the product.

An application produces records. A submitted cost change. An approved space plan. A funded promotion deal. When the user completes the workflow, a durable artifact exists that other people and other systems will act on. The application state is the product.

The test: does the tool create state that matters to anyone else?

A secondary check: who defines the question? In a BI tool, the user defines the question. In an application, the product team defines the question in advance and the user provides parameters.

Analytics frequently appears inside applications. Charts, forecasts, optimization scores, sortable tables. This does not make an application a BI tool. Analytics inside an application is informational scaffolding around a transaction. It exists to help a user make a better decision at a specific, predefined moment in a defined workflow. Removing it would leave the user less informed, but the workflow would still exist. Removing the workflow would leave nothing.

Showing a supplier their forecasted volume impact before they submit a price change is not a BI feature. It is a well-designed submission form.

The Three Layers

Any system in this class of problem contains three distinct layers of data. Treating them as one is the most common source of both unnecessary complexity and of legitimate architectural complaints.

Raw and operational data. What clients provide, what third parties supply, what you ingest. Heterogeneous, often client-specific, authoritative at the source. This layer exists for analytical processing, model training, and pipeline development. Application software does not read from this layer directly.

Computed data products. The outputs of analytical and data science work. Elasticity coefficients, demand forecasts, optimization scores, benchmark indexes. These are produced by models running against the raw layer on a schedule or in response to a trigger, and the results are materialized into stable, well-defined outputs. This layer is the interface between analytical capability and application capability.

Application state. What users create and act on. Submissions, approvals, configurations, workflow status, document records. Owned entirely by the application. Written by users, read by downstream actors and systems.

The objection that comes here is that many analytical outputs are dynamic and depend on user input at runtime. That is true, and it does not change the layer model. The distinction is not between precomputed and on-demand. It is between computation that belongs in the data services layer and computation that belongs in the application layer.

A user selecting 40 items and requesting a combined transferability and trip impact analysis is providing parameters to a model. The model runs in the data services layer, against data that already exists there, and returns a result through an API. The application collects the inputs, calls the service, and presents the result. Parameterized model execution and ad-hoc exploration of raw data are architecturally different things. If a feature requires arbitrary exploration of raw transaction data at query time, it is a BI feature and should be built as one, connected directly to the analytical layer.

The question to ask about any analytical feature in an application: is this a parameterized model call, or is it open-ended data exploration?

A Taxonomy of Shared Data

The hardest reasoning problem in this architecture is shared data. Items, stores, suppliers, organizations, and their relationships are needed everywhere. Multiple applications need them. The analytical layer needs them. Getting this wrong produces either a tightly coupled monolith or an operationally painful over-distributed system. Neither is acceptable.

The key insight: "shared data" is not a single category. It contains several types that require different handling.

Reference data. The nouns of the domain. Items, stores, suppliers, organizations. Slow-changing, authoritative, needed everywhere, never created by a downstream application. One system owns this. Everyone else reads it.

Relationship data. Which suppliers can submit changes for which items. Which stores belong to which merchant. Slightly more dynamic than pure reference data but still owned by a central authority and consumed downstream.

Domain extensions. Net-new concepts added by a specific application that reference shared entities by their canonical identifiers. A space planning tool's fixtures and planograms. A promotion tool's deal structures. These belong entirely to their domain. No other application needs to understand them.

Application state. Created by and owned by a specific application. Never shared sideways to other applications except through explicit API contracts.

The Rubric: Where Does This Data Live?

For any piece of data, answer these questions in order.

Who is the authority? There must be exactly one system that is right when everyone else is wrong. This is not a technical question; it is an ownership question. If you cannot answer it, answer it before making any other decision. Zero duplication of authority is the goal. Duplication of data across consumers is fine and often necessary.

Do you need to join against it at query time? Joins across network boundaries are not joins. They are sequential lookups that do not compose, do not optimize, and fail under volume. If the answer is yes, you need a local replica. If you only need occasional point lookups, an API call is sufficient.

How stale can you tolerate it being? An item name that is four hours old causes a cosmetic inconsistency that resolves on the next sync. An old contract value in a financial document causes a compliance problem. Calibrate replication frequency to the consequence of staleness.

Do multiple applications need to react when it changes? If adding a new item requires updates across three applications, polling from each is fragile. Publish a change event and let each downstream application maintain its own replica by consuming that stream.

The Three Patterns

Replicate with a canonical source of truth when data is reference or relationship data, changes slowly, needs to be joined against at query time, and staleness is tolerable. Each consuming application maintains its own read-optimized copy, updated via ETL, CDC, or event stream. The replica is read-only from the perspective of the consuming application. The source of truth wins when replicas diverge.

API access with no local copy when data changes frequently enough that a replica would be unreliable, when point lookups are sufficient, when query volume is low enough that network calls are practical, or when sensitivity of the data makes replication a security concern. User authentication and permissions are the standard example.

Colocation in the same database when two things that appear separate are the same bounded context: owned by the same team, deployed together, sharing the same operational requirements. The test is whether you could hand one piece to a different team and have both still function independently. If not, they are one thing and should live together.

Colocation is a legitimate architectural choice, not a fallback. The condition for choosing it is that the boundary between two concerns is genuinely weak. It is not a substitute for defining APIs and sync mechanisms when the boundary is real.

What Accumulates When You Ignore This

Every table added to a shared schema without an ownership model expands the cognitive surface area for every developer on every team. Every view added to reshape data for a consuming application adds to a dependency graph that exists nowhere in your documentation. Every client-specific schema added for a new tenant keeps the onboarding cost constant while system complexity grows.

The specific failure modes are predictable.

Schema changes become cross-team coordination events. A column rename requires knowing every consumer of that table. That knowledge lives in people's heads. People leave.

View dependencies become invisible infrastructure. Views used as join targets in other views, which are themselves referenced by application queries, produce query plans that cannot be optimized and dependency chains that cannot be enumerated without querying system catalogs and hoping everything is registered.

Analytical queries and transactional queries compete for the same resources. A runaway BI dashboard competes with your application for IOPS and connections. A table lock from a migration affects every application and every client simultaneously.

The cost of each new client stays constant or grows. Multi-tenant application architecture is the only model where the marginal cost of an additional tenant decreases over time. A schema-per-client model never achieves this.

These are not speculative risks. They are the sequence of events that follows from the current approach, given enough time and enough growth.

The North Star

Build one explicitly owned service that is the canonical authority for shared reference and relationship data. It exposes data via API for point lookups and publishes change events for downstream replication. Each application team applies the rubric above to decide independently whether they replicate locally or call the API for each entity type they need.

Applications own their domain extensions and their application state. They never reach past the core data service into the analytical layer directly. The analytical layer feeds the data services layer, which produces computed data products that applications consume.

This is not a more complex architecture. It is complexity given a location and rules for how it grows.

The question to ask at every design review: for this piece of data, which system is right when everyone else is wrong? When the answer is clear, the architecture follows from the rubric. When the answer is unclear, that is the conversation to have before writing any code.