The Hidden Cost of MINUS Queries — And the Faster Way to Compare Data

Vexdata
Jun 4
10 min read

If you have ever validated a data migration, reconciled a data warehouse load, or compared a production table against a staging environment, you have almost certainly used a MINUS query. It is the most intuitive tool for the job: subtract one dataset from another, see what is left, call that the discrepancy.

It is also, at scale, one of the most expensive queries you can run — in compute time, in cloud credits, and in the engineering hours spent waiting for results that should have arrived in seconds.

Most data teams know MINUS queries are slow. What they often do not realise is exactly how slow, why the performance scales so badly, and that there is a fundamentally different approach that delivers the same results at a fraction of the cost.

This post covers all three: the real cost of MINUS queries, why they degrade at scale, and the hash-based comparison approach that replaces a 45-minute query with a 47-second result.

45 minutes to compare 5 million rows with MINUS. 47 seconds with hash-based comparison. Same result. Same accuracy. The difference is the approach, not the data.

What a MINUS Query Actually Does

A MINUS query (called EXCEPT in most SQL dialects) returns rows that exist in the first result set but not in the second. In data validation, it is used like this:

-- Classic source-to-target validation with MINUS

SELECT * FROM source_table

MINUS

SELECT * FROM target_table;

The query reads both tables in full, loads all rows into memory or spill storage, sorts them, and then performs a set subtraction. Run it again in the other direction to catch rows that exist in the target but not the source, and you have what looks like a complete comparison.

This works correctly. The problem is not accuracy — it is execution cost.

The Real Cost of MINUS Queries at Scale

Most data engineers have an intuitive sense that MINUS queries are slow. What is less understood is the specific mechanics that make them so expensive — and why the cost does not scale linearly with data volume.

1. Full Table Scans on Both Sides

A MINUS query must read every row from both tables before it can perform the comparison. There is no index optimisation. There is no partition pruning. There is no early termination. If you have a 5-million-row source table and a 5-million-row target table, the database reads 10 million rows before it produces a single result.

On a cloud data warehouse like Snowflake or BigQuery, where you pay per byte scanned, this is a direct billing event. A MINUS query that runs in a development environment against a sample table becomes a very different bill in production against the full dataset.

2. Sort-Merge or Hash Join Internally

To perform set subtraction, the database engine must identify matching rows between the two sets. It does this using one of two strategies: a sort-merge join (sort both sets, walk through them in parallel) or a hash join (build a hash table from one set, probe it with the other). Either way, the operation requires materialising significant intermediate state — either sorted data written to disk, or a hash table that may exceed memory and spill.

For large tables, this intermediate materialisation is where the performance collapses. A table that fits comfortably in memory at 1 million rows may cause disk spill at 10 million rows, and the performance difference between memory-resident and disk-spilling operations is typically an order of magnitude.

3. Deduplication Semantics

MINUS (and EXCEPT) performs implicit deduplication. Unlike a JOIN-based approach, it treats the result as a set — duplicate rows are removed. This means the database must track every row it has already returned to avoid emitting it twice, adding an additional bookkeeping overhead on top of the join computation.

4. No Progressive Results

A MINUS query produces no output until the entire computation is complete. There is no way to get partial results, no way to validate a sample first, no way to fail fast. You wait the full duration regardless of whether the tables match or not. If they match, you waited the full time to find out there was nothing to report.

⚠ On Snowflake, a MINUS query comparing two 5-million-row tables can consume 4–8 credits depending on warehouse size and data distribution. Run this daily as part of a validation pipeline and you are looking at $2,000–$4,000 per month in compute costs for a single table comparison.

Why MINUS Queries Fail at Migration Scale

The problems described above are manageable when you are comparing one or two tables in a development environment. They become a serious operational problem in a data migration context, where you are validating hundreds of tables, often under time pressure, against a tight go-live window.

Consider a typical enterprise data warehouse migration:

250 tables to validate across source and target
Average table size: 3 million rows
MINUS query average duration: 8 minutes per table
Full validation run: 250 × 8 minutes = 33 hours

A 33-hour validation run means one thing in practice: teams do not run it. They sample. They spot-check. They run validation on the 20 most important tables and hope the rest are fine. The validation that was supposed to be the quality gate for a multi-million-dollar migration becomes a checkbox exercise because the tooling cannot keep pace with the timeline.

This is not a hypothetical scenario. It is the actual operational constraint that causes teams to declare migrations successful before they are — and it is why data quality issues surface in production weeks after go-live, after the project team has disbanded and the source system has been decommissioned.

Hash-Based Comparison: A Fundamentally Different Approach

The core insight behind hash-based data comparison is straightforward: instead of comparing rows directly, you compute a cryptographic fingerprint of each row and compare fingerprints. If the fingerprints match, the data matches. If they do not, you drill down — and only then do you scan individual rows.

This inverts the cost structure of comparison entirely.

How It Works in Practice

The comparison runs in three phases, each progressively more expensive — but each phase only runs if the previous phase detected a discrepancy:

Phase	What Happens	Cost if Tables Match	Cost if Tables Differ
Phase 0: Row count	Compare row counts between source and target	< 1 second	< 1 second — stops here if counts differ significantly
Phase 0: Sample hash	Hash a random 1,000-row sample from each table	~2 seconds	~2 seconds — early warning of mismatch
Phase 1: Full hash	Compute MD5/SHA hash of entire table as one fingerprint	~10 seconds for 5M rows	~10 seconds — hash mismatch triggers Phase 2
Phase 2: Row-level drill	Identify specific mismatched rows via partition hashing	Not reached	Targets only affected partitions — not full scan

The key insight: if the tables match — which they do in most production runs after a correct migration — the comparison completes in under 15 seconds for a 5-million-row table. No Phase 2. No row-level scan. Just a fingerprint confirmation.

Only when a mismatch is detected does the system escalate to the more expensive operation — and even then, it targets only the partitions or row ranges where the mismatch occurred, not the full table.

The SQL Behind It

The whole-table hash approach computes a single fingerprint across all rows and all columns:

-- Whole-table hash computation (Snowflake)

SELECT MD5(LISTAGG(row_hash, ',') WITHIN GROUP (ORDER BY primary_key))

AS table_fingerprint

FROM (

SELECT primary_key,

MD5(CONCAT(

COALESCE(CAST(col1 AS VARCHAR), '__NULL__'), '|',

COALESCE(CAST(col2 AS VARCHAR), '__NULL__'), '|',

COALESCE(CAST(col3 AS VARCHAR), '__NULL__')

)) AS row_hash

FROM your_table

) row_hashes;

This query reads each table exactly once, computes a hash per row using all column values, and aggregates to a single fingerprint. The computation is CPU-bound rather than I/O-bound — it processes rows sequentially rather than holding them in memory for comparison.

Two important implementation details: NULL handling uses a consistent sentinel value ('__NULL__') so that NULLs in one table do not produce false positives when compared against NULLs in the other. And deterministic ordering by primary key ensures the same data always produces the same fingerprint regardless of physical storage order or query parallelism.

Performance: MINUS vs Hash Comparison

Scenario	MINUS Query	Hash Comparison	Speedup
5M rows, tables match	~45 minutes	~12 seconds	225×
5M rows, mismatch found	~45 minutes	~47 seconds	57×
50M rows, tables match	~8 hours	~90 seconds	320×
250 tables, full validation	~33 hours	~50 minutes	~40×
Cross-database comparison	Not possible (single engine)	Native — each DB hashes locally	Enables new use case

The performance difference is not marginal. It is not a 20% improvement that might matter in some contexts and not others. It is a difference that changes what is operationally possible.

A validation run that takes 33 hours with MINUS queries cannot be part of a migration go-live checklist. A validation run that takes 50 minutes can. The performance improvement does not just save time — it changes whether comprehensive validation is feasible at all.

Two Properties That Make Hash Comparison Production-Ready

NULL-Safe Comparison

NULL values are one of the most common sources of false positives in data comparison. In SQL,

NULL != NULL — two NULL values in the same column in the same row are not considered equal by standard equality operators. A naive hash implementation that treats NULLs inconsistently will flag mismatches that are not real discrepancies.

The correct approach — used in Vexdata's implementation — is to substitute a consistent sentinel string ('__NULL__') before hashing. This ensures that a NULL in the source and a NULL in the target produce the same hash contribution, and that a NULL in the source against a real value in the target produces a different hash — which is the correct detection behaviour.

Cross-Database Comparison

MINUS queries only work within a single database engine. You cannot write a MINUS query that compares a Snowflake table against an Oracle table. This is a fundamental limitation for the validation use case, where source and target systems are often entirely different database platforms — which is the point of the migration.

Hash-based comparison solves this by computing the hash natively in each database. Snowflake computes its hash of the source table using Snowflake SQL. Oracle computes its hash of the target table using Oracle SQL. The two fingerprints are then compared outside either database — a trivial string equality check that requires no data movement between systems.

This makes cross-platform migration validation possible for the first time without exporting data to a neutral location for comparison — which is both operationally complex and often prohibited by data governance policies on production data.

✓ Cross-database comparisons supported natively: Snowflake ↔ Oracle · MySQL ↔ PostgreSQL · SQL Server ↔ BigQuery · Any combination across all major platforms.

What This Means for Your Migration Validation Strategy

The practical implication of hash-based comparison is that comprehensive, full-dataset validation becomes operationally feasible in contexts where MINUS-based validation was not.

Validate After Every Load Wave, Not Just at the End

With MINUS queries, validating after every wave of a migration is impractical — the cumulative time would push the migration timeline out by days. With hash comparison, each table validation takes seconds rather than minutes. A full validation pass after each wave takes minutes rather than hours. Catching a mapping error after wave one rather than at cutover is the difference between a quick fix and a re-migration.

Run Validation in the Cutover Window

Migration cutover windows are measured in hours. Running a 33-hour MINUS-based validation pass during a cutover window is impossible. Running a 50-minute hash-based validation pass is feasible — it becomes the final gate before traffic is switched to the new system, rather than a post-hoc reconciliation that happens days later.

Include All Tables, Not Just the Critical Ones

When validation is expensive, teams inevitably prioritise. The 20 most important tables get validated thoroughly. The remaining 230 get a row count check and a prayer. Hash-based comparison eliminates this triage — when validation takes seconds per table, validating all 250 tables costs the same as validating 20 used to.

💡 The best time to validate is before the migration starts, after every wave, and at cutover. Hash-based comparison makes all three of these operationally feasible. MINUS queries make none of them feasible at enterprise scale.

What the Results Look Like in Practice

Here is the actual output from Vexdata's hash comparison engine on a 5-million-row table comparison — first a match, then a mismatch:

MATCH SCENARIO — 5,231,847 rows

[Phase 0] Row counts match ✓ (0.8s)

[Phase 0] Sample hashes match ✓ (2.1s)

[Hash ] Source hash: a7f3b2c1e9d4... ✓

[Hash ] Target hash: a7f3b2c1e9d4... ✓

✅ DATA MATCH — Total time: 12.4 seconds

MISMATCH SCENARIO — 5,000,000 rows

[Phase 0] Row counts match ✓ (0.9s)

[Phase 0] Sample hash MISMATCH ✗ (1.8s)

[Sample] Found 23 differences in sample

❌ DATA MISMATCH — Total time: 47.2 seconds

📊 15 sample row differences exported to report

47 seconds to compare 5 million rows and get actionable sample differences. Not 45 minutes. Not "check back after lunch."

Database Support

The hash-based comparison approach works across all major relational database platforms and cloud data warehouses — including cross-platform comparisons that MINUS queries cannot perform at all:

Platform	Native Hash Support	Cross-Platform Comparison
Snowflake	MD5, SHA-256 native	✓ Compare to any other platform
Oracle	ORA_HASH, STANDARD_HASH	✓ Compare to any other platform
PostgreSQL	MD5, SHA functions	✓ Compare to any other platform
SQL Server	HASHBYTES (MD5, SHA)	✓ Compare to any other platform
MySQL	MD5, SHA1, SHA2	✓ Compare to any other platform
BigQuery	MD5, SHA256, FARM_FINGERPRINT	✓ Compare to any other platform

Each database computes its hash natively using its own SQL dialect. The comparison happens outside either database — a fingerprint equality check that requires no cross-system data movement and no shared compute environment.

The Bottom Line

MINUS queries are the right mental model for data comparison — subtract one set from another and examine what remains. The problem is the implementation at scale: full table scans on both sides, expensive intermediate materialisation, no early termination, and no cross-platform support.

Hash-based comparison delivers the same logical result through a fundamentally more efficient mechanism. By computing a fingerprint of each table and comparing fingerprints first, the common case — tables that match — resolves in seconds rather than minutes. Mismatches are detected in the sample phase within two seconds, then localised to specific rows without scanning the full table.

The operational consequence is significant. Validation that was previously too expensive to run comprehensively — after every wave, at cutover, across all tables — becomes feasible. The quality gate that migration projects need actually works within the timelines migration projects operate under.

If your current migration validation strategy relies on MINUS queries and you are validating at scale, the question worth asking is not whether the approach is correct — it is whether the performance is allowing you to validate as thoroughly as the project actually requires.

→ See Vexdata Data Validation: vexdata.io/data-validation

→ Data Migration Testing: vexdata.io/data-migration

→ Book a 20-min demo: vexdata.io/contact

The Hidden Cost of MINUS Queries — And the Faster Way to Compare Data

Recent Posts

Comments

Data Ingestion Validation

Data Transformation Validation

Data Lake Testing

Date Warehouse Migration

Data Migration Testing

BI Tool Testing

Cloud Migration Testing

CRM Tool Migration Testing

Flat File Testing

Document Validation

Data Validation

Data Migration Testing

Data Quality and Cleansing

Data Observability

Designed by DataDrivify