Why Data Engineers Need Version Control for Data, Not Just Code
- Vexdata

- May 5
- 8 min read

Every data engineer uses Git. The code that defines pipeline logic, transformation rules, dbt models, and Airflow DAGs is committed, versioned, reviewed, and deployed with the same discipline any software team applies to application code. Breaking changes go through pull requests. Rollbacks are a single command. Every change has an author, a timestamp, and a message explaining what changed and why.
The data that flows through those pipelines has none of this.
When the upstream team renames a column from user_id to customer_id, there is no pull request. No review. No notification to downstream consumers. The change is made in the source system and the pipeline discovers it at runtime — when it breaks. When a vendor changes the format of an incoming file, there is no commit message. When a business rule is updated and the transformation logic changes, the historical data that was processed under the old rule remains in the warehouse unchanged, making any time-series comparison between pre- and post-change data meaningless without knowing when the rule changed.
This asymmetry — mature version control for code, no version control for data — is one of the most significant structural weaknesses in how data engineering teams operate. It creates a category of failure that Git can't protect you from, that monitoring dashboards won't catch, and that only becomes visible after the damage is done.
"An engineering team renames a database column from user_id to customer_id for better clarity. Twelve hours later, five executive dashboards show blank values, two ML pipelines have crashed, and your data engineering team is deep in an incident that could have been entirely prevented." — Tenets of Data, March 2026
What Version Control for Data Means in Practice
Version control for code means: you can see every change, who made it, when, and why — and you can roll back to any prior state. Applied to data, this means the same guarantees across four distinct dimensions:
Dimension | What It Means for Code | What It Means for Data |
Schema versioning | Interface / API contract is versioned | Column names, types, and constraints are tracked across time |
State snapshots | Code is tagged at release points | Dataset values are snapshotted at defined intervals or events |
Change attribution | Every commit has an author and message | Every schema or value change has a timestamp and source |
Rollback | git revert to prior working state | Time-travel queries or snapshot restore to prior data state |
Breaking change detection | Failing tests in CI block merge | Schema drift alerts block pipeline load before downstream impact |
Consumer notification | PR review notifies dependents | Data contract violation alerts notify downstream consumers |
The Five Ways Missing Data Version Control Breaks Production
1. Schema Changes That No One Told You About
The most common failure. An upstream team alters a source schema — adds a column, renames a field, changes a data type from VARCHAR(100) to VARCHAR(50), drops a column that "nobody was using". Your pipeline has no mechanism to detect this before runtime. The next scheduled load fails — or worse, silently truncates values — and you spend three hours tracing the failure back to a schema change that happened four days ago in a system you don't own.
With schema version control in place: the schema change is detected at the point of ingestion, compared against the registered schema version, and flagged before the load proceeds. The pipeline owner is notified. The downstream consumers are notified. The load waits for explicit acknowledgment that the change is expected and accounted for.
2. Historical Data That Can't Be Trusted After a Business Rule Change
A business rule changes: what was previously classified as "active customer" is now defined differently, narrowing the criteria. The transformation logic is updated in the pipeline. Going forward, the warehouse reflects the new definition. But the historical data — loaded under the old rule — is still sitting in the warehouse with the old classification applied.
You now have a warehouse where data before date X means one thing and data after date X means something different. But there is no marker in the data indicating when the rule changed. Any analyst running a time-series query spanning the change boundary is comparing apples to oranges — and has no way to know it.
Data version control addresses this by recording transformation rule changes as versioned events — so that any historical analysis can be qualified against which version of the business rule was active for a given time period.
3. No Reproducibility for ML Training Data
A machine learning model is trained on a snapshot of the customer dataset from a specific date. Six months later, the model is underperforming and the team wants to retrain from the same starting point to isolate whether the degradation is due to model drift or data drift. Without data versioning, reproducing the exact training dataset is impossible — the data has been modified by subsequent loads, backfills, and business rule changes. The team has to make do with current data and hope it's close enough.
This is why data versioning is not just a data engineering concern — it is a prerequisite for reproducible ML. You cannot reproduce a model if you cannot reproduce its training data. Delta Lake, Apache Iceberg, and Apache Hudi exist primarily to solve this problem for large-scale data platforms.
4. Upstream Changes From Third-Party Sources
A vendor changes the format of a daily file feed. They add two new columns, change the encoding of a date field from DD/MM/YYYY to ISO 8601, and add a BOM character to the file header. Your ingestion pipeline has no schema registry for external file sources. The load fails. Or worse — the date field is interpreted incorrectly and loads without error, silently misattributing every date in the feed.
Schema version control for incoming files means: every file format is registered, every incoming file is compared against the registered format before loading, and any deviation triggers an alert and a quarantine rather than a silent load of malformed data.
5. Nobody Knows What Changed or When
A data quality investigation begins. The question is: "when did this field start containing nulls?" Without data version control, the answer requires manual archaeology — querying logs, interviewing the team that owns the source system, checking Confluence pages that may or may not have been updated. With data version control in place, the answer is a query against the version history: show me every load of this table where null_rate for column X exceeded 2%, ordered by date.
What Data Version Control Looks Like Layer by Layer
Data version control is not a single tool or a single practice. It operates at four distinct layers of the data stack, each with different techniques and tooling.
Layer 1: Schema Version Control
The foundation. Every source schema — database tables, API response structures, file formats, and message schemas — is registered in a schema registry or a version-controlled schema store.
Every change to a registered schema produces a new version. Pipelines validate incoming data against the current registered version before processing.
In practice this looks like:
dbt model definitions committed to Git — schema changes go through PR review
Confluent Schema Registry for Kafka topics — incompatible schema changes rejected at publish time
JSON Schema or Avro schemas stored in version-controlled schema repositories for file feeds
Schema drift detection at ingestion — comparing incoming schema against the last registered version
# Example: schema drift detection at ingestion
registered_schema = schema_registry.get("orders_v2.3")
incoming_schema = infer_schema(incoming_file)
drift = compare_schemas(registered_schema, incoming_schema)
if drift.breaking_changes:
quarantine(incoming_file)
alert(owners=["data-engineering"], breaking_changes=drift.breaking_changes)
raise PipelineHaltException("Breaking schema change detected")
Layer 2: Data Snapshot Versioning
For analytical datasets that need to be reproducible — ML training data, regulatory snapshots, point-in-time reporting — snapshot versioning creates immutable copies of dataset state at defined points.
Modern open table formats make this native:
Delta Lake —
Time travel queries (VERSION AS OF <n> or TIMESTAMP AS OF <t>) return data as it existed at any historical state. Snapshot data is retained for the configured retention period.
Apache Iceberg —
Every write creates a new snapshot. Rollback to any prior snapshot is atomic and instantaneous. Snapshot metadata includes the operation type, row count change, and timestamp.
Apache Hudi —
Maintains a timeline of all table operations. Incremental queries consume only records changed since a given instant. Point-in-time reads restore table state at any historical commit.
Layer 3: Transformation Rule Versioning
When a business rule changes, that change should be recorded with:
The old rule definition
The new rule definition
The effective date of the change
The datasets affected
Whether historical data was or was not backfilled under the new rule
This metadata allows any downstream consumer to qualify their analysis against the rule version that was active for their query window — and it allows the data team to answer the question "when did this metric start measuring differently" from the version history rather than from institutional memory.
Layer 4: Data Contract Versioning
A data contract is a formal, versioned agreement between a data producer and data consumer defining the expected schema, quality standards, update frequency, and semantic meaning of a dataset. Data contracts bring software-style API versioning discipline to data: breaking changes require a new major version, backward-compatible additions are minor versions, and consumers can pin to a specific contract version while a migration window is open.
The key properties of a versioned data contract:
Machine-readable — not a Confluence page, but a structured file that tooling can validate against
Versioned using semantic versioning (major.minor.patch) — consumers know whether a change is breaking
Enforced at ingestion — data that violates the contract is rejected, not silently loaded
Consumer-registered — every downstream consumer of a dataset is registered so they can be notified of version changes
Change-managed — breaking version changes require a deprecation period and migration path, not immediate cutover
Practical Starting Point: What to Implement First
If your team currently has Git for code but nothing for data versioning, the highest-impact starting point is schema drift detection at ingestion. This single practice catches the most common and most damaging class of data version failure — an upstream schema change that your pipeline doesn't know about until it breaks.
The implementation sequence:
Register your current source schemas. For every source system your pipelines consume, capture the current schema definition (column names, types, constraints) and store it in a version-controlled repository or a schema registry. This is your baseline.
Add schema comparison at the ingestion step. Before your pipeline loads any data, compare the incoming schema against the registered baseline. Flag any deviation — added columns (informational), removed columns (high risk), type changes (critical), renamed columns (critical).
Define halt vs continue behaviour. Breaking changes (dropped columns, type narrowing) halt the pipeline. Non-breaking changes (new columns, type widening) alert and continue. Define this per table — not every schema change warrants a halt.
Add open table format for analytical datasets. If you're on Databricks, Snowflake, or an S3-based data lake, enable Delta Lake or Iceberg table format for your key analytical tables. Time travel is then available immediately — without additional infrastructure.
Introduce data contracts for high-stakes datasets. Start with the 3-5 datasets that, if they change unexpectedly, would cause the most damage. Define a contract for each, register downstream consumers, and enforce the contract at ingestion. Expand from there.
💡 The principle: version control for data doesn't mean building a git-for-data system from scratch. It means applying the same discipline — track what changed, when, by whom, and why — to schemas, business rules, and dataset state using the tools that now exist to do it.
The Bottom Line
The data engineering discipline has matured enormously over the last decade. Pipelines are more reliable. Orchestration is more sophisticated. Transformation frameworks like dbt have brought software engineering practices into data modelling. But the fundamental gap — version control for the data itself, not just the code that processes it — remains largely unsolved in most organisations outside the largest data platforms.
The consequences are real and recurring: silent schema breaks, irreproducible analyses, historical data that can't be trusted across a business rule change boundary, and post-incident investigations that take hours because nobody tracked what changed in the data. These are not data quality problems in the traditional sense. They are data engineering discipline problems — problems that arise because the data stack is treated as less versioning-worthy than the code stack.
Schema versioning, snapshot history, data contracts, and transformation rule tracking are the answers. The tooling exists. The patterns are established. The remaining question is whether data engineering teams choose to close the gap before it closes them.
→ Data Observability (schema drift detection): vexdata.io/data-observability
→ Data Validation Platform: vexdata.io/data-validation
→ Book a demo: vexdata.io/contact




Comments