ClickHouse CDC with Field-Level Updates

clickhouse, cdc

One field changed—why rebuild the whole object? With field-level CDC in ClickHouse, you update only what’s new and keep the full history.

The Pain of Full-Object Syncs

In many data pipelines, the go-to method for keeping downstream systems up to date is “just sync the full object” whenever something changes.

If you’re lucky, that object lives entirely in one source, so you can simply re-fetch it. But in most real systems, a “full object” is actually stitched together from multiple upstream systems:

  • CRM for contact info
  • RP for billing
  • Marketing DB for campaigns
  • HR or finance systems for headcount and revenue data

That means when you want to update just one field, you end up either:

  1. Overfetching — hitting every source, reassembling the full object, and pushing it downstream, or
  2. Suffering during backfills — running expensive multi-source joins just to fix a single attribute.

A Quick, Painful Reality Check

You find a typo in the _employee_count_ in your HR data for 1,200 companies. With a full-object sync, you now must:

  • Pull from _all_ upstream systems to rebuild the company object for each record
  • Rewrite every field — even though 99% are unchanged
  • Risk overwriting fresh updates from other sources

Result: wasted compute, slower jobs, higher storage churn, and higher risk of data corruption.

A Better Way: Field-Level Change Data Capture

Instead of treating “the object” as the atomic unit, treat **each field change** as the atomic event. A change to `employee_count` becomes its own row:

object_id: 42
object_type: "company"
field: "employee_count"
value: "350"
timestamp: 2025-08-09T09:15:32Z

This means:

  • A backfill for employee_count only touches that field for affected rows.
  • You don’t need to touch other fields or reassemble the object.
  • Late or out-of-order updates are handled naturally.

Designing a Solution in ClickHouse

We’ll use three layers in ClickHouse:

  • Raw change log (append-only, one row per field change)
  • Latest value per field (deduplicated by key + version)
  • Wide object view (JSON blob + optional materialized columns for hot fields)

1. Raw Change Log

Append every change into cdc_events:

CREATE TABLE object_properties_log
(
  object_id   UInt64,
  object_type LowCardinality(String),
  field       LowCardinality(String),
  value       String,
  ts          DateTime64(3, 'UTC'),
  op          Enum8('insert' = 1, 'update' = 2, 'delete' = 3),
  source      LowCardinality(String),
  version     UInt64 DEFAULT toUnixTimestamp64Milli(ts),
  ingested_at DateTime64(3, 'UTC') DEFAULT now64(3)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (object_type, object_id, field, ts);

This is your audit log. It’s append-only and never mutated.

2. Latest Value Per Field

We use a ReplacingMergeTree keyed by (object_type, object_id, field). The highest version wins.

CREATE TABLE object_properties_latest
(
  object_type LowCardinality(String),
  object_id   UInt64,
  field       LowCardinality(String),
  value       String,
  ts          DateTime64(3, 'UTC'),
  version     UInt64
)
ENGINE = ReplacingMergeTree(version)
PARTITION BY object_type
ORDER BY (object_type, object_id, field);

Keep it up to date via a materialized view:

CREATE MATERIALIZED VIEW mv_latest_into_props
TO object_properties_latest
AS
SELECT
  object_type,
  object_id,
  field,
  value,
  ts,
  toUnixTimestamp64Milli(ts) AS version
FROM object_properties_log
WHERE op != 'delete';

Deletes can be handled by either removing the row or writing a tombstone value.

3. Wide Table as JSON

Rather than pre-pivoting into hundreds of columns, store a JSON object per entity, plus materialized columns for the most queried fields.

CREATE TABLE companies_wide_json
(
  id         UInt64,
  data       String,
  name       String MATERIALIZED JSONExtractString(data, 'name'),
  domain     String MATERIALIZED JSONExtractString(data, 'domain'),
  version    UInt64,
  updated_at DateTime64(3, 'UTC')
)
ENGINE = ReplacingMergeTree(version)
ORDER BY id;
CREATE MATERIALIZED VIEW mv_companies_wide_json
TO companies_wide_json
AS
SELECT
  object_id AS id,
  toJSONString(
    mapFromArrays(groupArray(field), groupArray(value))
  )                             AS data,
  max(toUnixTimestamp64Milli(ts)) AS version,
  max(ts)                         AS updated_at
FROM object_properties_latest
WHERE object_type = 'company'
GROUP BY object_id;

Query hot fields directly from materialized columns:

SELECT id, name, domain
FROM companies_wide_json
WHERE id = 42;

And pull less-used fields with JSONExtract* functions.

Why This Pattern Works

  • No overfetching: Update only the fields that changed.
  • Fast backfills: Touch only the target attribute.
  • Idempotent: Latest table dedupes by (key, version).
  • Auditable: Full history stays in object_properties_log.
  • Composable: You can build multiple wide tables (e.g. companies, contacts) from the same latest table.

ClickHouse’s combination of fast ingestion, flexible materialized views, and efficient deduplication makes it an ideal home for this pattern. You end up with the best of both worlds: a complete history for auditing and debugging, and a low-latency “latest state” for serving queries.