Cassandra as a Data Warehouse in 2019: What Worked, What Didn't

In 2019, two of us at Marax were responsible for ingesting 10+ million click events a day from one of our clients — a fitness product marketplace — storing them, training a product discount recommendation model on top, and shipping the predictions back. The store we picked was self-hosted Apache Cassandra running on Azure VMs.

Calling Cassandra a "data warehouse" was already a category error in 2019, and I knew it at the time. Cassandra is an OLTP store built for fast writes and key-addressed reads, not the scan-heavy aggregations a warehouse does. We picked it anyway, for reasons that were mostly right, and paid for it in ways I'll get into. This post is the retrospective I wish I'd read before making the call.

A note on accuracy: this system is seven years gone and I no longer have the runbooks. The decisions, the architecture shape, the pain, and the two incidents below are remembered; exact node counts, configs, and parts of the backfill stack are reconstructed from what was standard practice in 2019, and I flag them inline where that's the case. The client is anonymized throughout. Every code snippet was re-validated this week against a real Cassandra 3.11 in Docker — schemas, the webhook handler, and the Spark jobs all execute (details at the end).

The problem

The client wanted personalized product discounts: for each user, which products to discount and by how much, within a configured min/max range. That model — how it was framed, trained, and evaluated — deserves its own post, and I'll write it separately. This post is about the data platform underneath it, which had to answer three requirements:

Real-time events going forward. The client agreed to push us click events (product views, add-to-carts, orders) as they happened. We gave them a webhook endpoint.
Historical events for day-one training. Their existing event history sat in their production database. We needed a one-time backfill so the model didn't start cold.
Per-user feature reads. Training and serving both needed "give me everything user X did recently" to be fast.

That third requirement is what pulled us toward Cassandra: per-user event history is exactly the partition-keyed access pattern Cassandra is built for, and requirement one meant sustained heavy writes, which is the other thing it's built for.

The architecture

graph TB subgraph client["Client (fitness marketplace)"] APP[Mobile + web apps] DB[(Production DB)] CONS[Recommendation consumer] end subgraph marax["Marax (Azure)"] WH[Webhook ingestion API Node.js] LAKE[(Azure Blob Storage 'the data lake')] CASS[(Cassandra 3.11 3 Azure VMs, RF=3)] SPARK[Spark on Azure VMs backfill + features] MODEL[Discount model training + scoring] SB[/Azure Service Bus queue/] end APP -- "events, real time" --> WH WH -- "writes, LOCAL_QUORUM" --> CASS DB -- "one-time export" --> LAKE LAKE --> SPARK SPARK -- "backfill writes" --> CASS CASS -- "feature extraction" --> SPARK SPARK --> MODEL MODEL -- "per-user discounts" --> SB SB --> CONS

Two ingestion paths into one store, one extraction path out, and a queue to close the loop. No Kafka anywhere — more on that below, because it was a deliberate choice that people pushed back on then and would push back on now.

Build vs buy: self-hosted Cassandra vs Cosmos DB's Cassandra API

We were deploying on Azure because the client was an Azure shop and the data needed to stay close. Azure's managed answer in 2019 was the Cassandra API for Cosmos DB, which had gone GA at Ignite in September 2018 after a year in preview. (Azure Managed Instance for Apache Cassandra — actual managed Cassandra rather than a compatibility layer — didn't exist until 2021.)

Cosmos was tempting: no nodes to babysit, SLAs, wire-protocol compatibility with the drivers we already wanted to use. We rejected it for two reasons.

The pricing model fought our workload. Cosmos billed provisioned throughput in Request Units per second — reserved in 100 RU/s increments, billed hourly whether consumed or not, with a 1KB write costing roughly 5–6 RUs against a 1KB read's single RU. Our workload was a firehose of small writes, sustained all day, with daily peaks. To not throttle at peak you provision for peak; at 10M+ writes/day that's a large RU reservation idling at 3 a.m. Three VMs with premium SSDs were a fixed, boring, predictable monthly number. I no longer remember the exact quotes, so I won't invent them — but the shape of the comparison was lopsided enough that the decision meeting was short.

The compatibility was wire-level, not behavior-level. The Cassandra API spoke CQL but mapped it onto Cosmos's own engine. The 2019 support matrix listed only a handful of CQL commands — no secondary indexes, no materialized views, batches "only unlogged," and gc_grace_seconds forced to zero — and consistency mapped onto Cosmos's five fixed levels rather than Cassandra's per-query tunable consistency. Add RU-based rate limiting surfacing as errors Cassandra drivers don't expect, and we'd have been debugging a Cassandra-shaped abstraction over a different database — with a compatibility matrix between us and every Stack Overflow answer.

So: self-hosted. Two people, no DBA, what could go wrong.

The deployment

This section is reconstructed — the shape is right, the exact numbers are from standard practice rather than memory.

Three Azure DS-series VMs with premium SSDs in one region, spread across an availability set — the DataStax deployment guidance for Azure of that era recommended exactly this pairing — running Cassandra 3.11, the current stable line in 2019 (3.11.4 shipped that February; 4.0 was still two years away). Replication factor 3 with NetworkTopologyStrategy, writes at LOCAL_QUORUM, reads at LOCAL_ONE for serving and whatever Spark negotiated for batch. Provisioning was shell scripts over SSH, not Terraform — it was 2019 and we were two people.

The keyspace and the main table:

CREATE KEYSPACE marax_events  WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3}; CREATE TABLE marax_events.events_by_user (  user_id      text,  bucket_date  date,  event_time   timestamp,  event_id     uuid,  event_name   text,  product_id   text,  category     text,  price        decimal,  session_id   text,  properties   map<text, text>,  PRIMARY KEY ((user_id, bucket_date), event_time, event_id)) WITH CLUSTERING ORDER BY (event_time DESC, event_id ASC)  AND compaction = {    'class': 'TimeWindowCompactionStrategy',    'compaction_window_unit': 'DAYS',    'compaction_window_size': 1  }  AND default_time_to_live = 15552000;  -- 180 days

The design decisions that mattered, in order of how much pain they saved or caused:

Day-bucketed partitions. The partition key is (user_id, bucket_date), not user_id alone. Cassandra's guidance has long been to keep partitions under roughly 100MB; an unbounded per-user partition accumulating clickstream forever blows through that for any heavy user. One partition per user per day keeps even the most obsessive shopper's partition tiny, and reads for "last N days of user X" become N cheap partition lookups you can issue in parallel.

event_id in the clustering key. Two events in the same millisecond don't overwrite each other, and webhook retries with the same event_id are idempotent upserts instead of duplicates. This made the ingestion path safely retryable for free — last-write-wins on identical rows is a feature here, not a bug. The corollary that's easy to miss: the event_id has to come from the client, generated once at the source. If the server mints an ID per attempt, every retry of a timed-out batch quietly inserts the same events under fresh keys, and your "idempotent" pipeline inflates counts precisely when things are flaky — so the endpoint rejects events without one.

TWCS — eventually. TimeWindowCompactionStrategy (in Cassandra since 3.0.8, and by 2019 the standard recommendation for time-series TTL data) buckets SSTables by time window, which is exactly right for append-mostly, TTL'd time-series data: expired data ages out by dropping whole SSTables instead of compacting tombstones grain by grain. I'm honestly not sure anymore whether we started with TWCS or moved to it during the tombstone incident below; I've written it into the schema above because it's what you should do, and what we ended up on.

There was a second table, events_by_type_day, partitioned by (event_name, bucket_date, shard) for campaign-level analytics — the shard smallint is there because a popular event type does millions of rows a day, which is far too hot for one partition; we spread it over 16 shards and re-merged in Spark.

The real-time path: webhook straight into Cassandra

The contentious choice: there was no queue between the webhook and the database. The client's backend POSTed event batches to our endpoint, and the handler wrote them to Cassandra synchronously, returning 202 only after the write was durable at LOCAL_QUORUM.

sequenceDiagram participant M as Marketplace backend participant W as Webhook API (Node.js) participant C as Cassandra (RF=3) M->>W: POST /v1/events (batch of 1..n) W->>W: validate (stable event_id required) W->>C: INSERT xN, prepared, LOCAL_QUORUM C-->>W: ack (2 of 3 replicas) W-->>M: 202 { accepted: n } Note over M,W: on 5xx or timeout, client retries same batch Note over W,C: same event_id → idempotent upsert, no dupes

The handler, essentially verbatim (validated this week against a real 3.11 ring — see the end):

const express = require('express');const cassandra = require('cassandra-driver'); const client = new cassandra.Client({  contactPoints: ['10.0.1.4', '10.0.1.5', '10.0.1.6'],  localDataCenter: 'dc1',  keyspace: 'marax_events',}); const INSERT_EVENT = `  INSERT INTO events_by_user    (user_id, bucket_date, event_time, event_id, event_name,     product_id, category, price, session_id, properties)  VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`; const app = express();app.use(express.json({ limit: '1mb' })); app.post('/v1/events', async (req, res) => {  const events = Array.isArray(req.body) ? req.body : [req.body];   for (const e of events) {    // event_id is required and must be stable across retries: a server-minted    // ID would create a new row on every retry, silently breaking idempotency.    if (!e.user_id || !e.event_name || !e.event_time || !e.event_id) {      return res        .status(400)        .json({ error: 'user_id, event_name, event_time, event_id are required' });    }  }   try {    // One statement per event, not a logged batch: these rows land in    // different partitions, and multi-partition batches put coordinator    // pressure exactly where we don't want it at this volume.    await Promise.all(      events.map((e) => {        const eventTime = new Date(e.event_time);        return client.execute(INSERT_EVENT, [          e.user_id,          cassandra.types.LocalDate.fromDate(eventTime),          eventTime,          cassandra.types.Uuid.fromString(e.event_id),          e.event_name,          e.product_id || null,          e.category || null,          e.price != null ? e.price : null,          e.session_id || null,          e.properties || null,        ], {          prepare: true,          consistency: cassandra.types.consistencies.localQuorum,        });      })    );    res.status(202).json({ accepted: events.length });  } catch (err) {    res.status(503).json({ error: 'write failed, retry with same event_id' });  }});

Why no Kafka? Because Cassandra's write path is a buffer: a write hits the commitlog and an in-memory memtable and acks (the commitlog fsyncs every 10 seconds by default) — it's closer to appending to a replicated log than to a B-tree update. At 10M+ events/day (~120/sec average, low-thousands at peak) a 3-node ring doesn't notice. Adding Kafka would have meant operating a second distributed system — brokers and ZooKeeper — to protect a database whose entire design point is absorbing exactly this write load, run by the same two people. We did adopt Kafka at Marax later, when genuine streaming analytics consumers existed; as a write buffer here it would have been ceremony.

The honest counterargument: a queue buys you availability when the database is down, and during the node-replacement incident below we made the client's retries do the job a queue would have done. Retry-with-idempotency was our durability story. It held, but it's a thinner cushion than a log.

The one-time backfill

Stack reconstructed — this is the part I remember least precisely.

The client's historical events lived in their production database. We weren't going to query a client's prod DB repeatedly, so the deal was: their team exported the event history to files, the dump landed in Azure Blob Storage on our side — that container was, grandly, "the data lake" — and a Spark job transformed their schema into ours and wrote to Cassandra via the spark-cassandra-connector (the 2.4 line, paired with Spark 2.4). Spark ran self-managed on Azure VMs too — consistent with the self-host-everything theme, and this predates the easy Databricks default.

import uuidfrom pyspark.sql import SparkSessionfrom pyspark.sql import functions as Ffrom pyspark.sql.types import StringType spark = (    SparkSession.builder.appName("historical-backfill")    .config("spark.cassandra.connection.host", "10.0.1.4,10.0.1.5,10.0.1.6")    # protect the ring: throttle writes so compaction keeps up    .config("spark.cassandra.output.throughput_mb_per_sec", "5")    .getOrCreate()) make_uuid = F.udf(lambda: str(uuid.uuid4()), StringType()) raw = spark.read.parquet("wasbs://exports@maraxlake.blob.core.windows.net/clickstream/") events = raw.select(    F.col("member_id").alias("user_id"),    F.to_date("activity_ts").alias("bucket_date"),    F.col("activity_ts").alias("event_time"),    make_uuid().alias("event_id"),    F.col("activity_type").alias("event_name"),    F.col("item_id").alias("product_id"),    F.col("item_category").alias("category"),    F.col("item_price").alias("price"),    F.col("session_ref").alias("session_id"),) (    events.write.format("org.apache.spark.sql.cassandra")    .mode("append")    .options(table="events_by_user", keyspace="marax_events")    .save())

The one line in there I'd tattoo on a junior engineer: the write throttle (spark.cassandra.output.throughput_mb_per_sec in the 2.4-era connector; it's camelCase in 3.0+). A Spark cluster will happily write to Cassandra faster than Cassandra can compact, and the failure is delayed — the ring takes the writes, pending compactions pile up, disks fill with un-merged SSTables, and read latency degrades an hour later when you think the job went fine. Bulk-loading a ring that's also serving live traffic is a "slower is faster" situation.

Feature extraction, briefly

Training reads went through the same connector in reverse: load events_by_user, aggregate per user, hand the feature frame to the model.

events = (    spark.read.format("org.apache.spark.sql.cassandra")    .options(table="events_by_user", keyspace="marax_events")    .load()) features = events.groupBy("user_id").agg(    F.max("event_time").alias("last_seen_at"),    F.count("*").alias("events_90d"),    F.countDistinct("session_id").alias("sessions_90d"),    F.sum(F.when(F.col("event_name") == "add_to_cart", 1).otherwise(0)).alias("carts_90d"),    F.sum(F.when(F.col("event_name") == "order_completed", 1).otherwise(0)).alias("orders_90d"),    F.avg("price").alias("avg_item_price"),)

Recency, frequency, category affinity, price sensitivity — classical signals feeding a gradient-boosted model that produced, per user, a discount within the campaign's configured min/max/step range. How that model was framed and why discount recommendation is a much weirder ML problem than it looks (you're predicting a treatment effect, not a label) is the follow-up post.

Closing the loop: Azure Service Bus

Predictions went back over an Azure Service Bus queue rather than a webhook into the client's systems. They were an Azure shop, so Service Bus was zero new infrastructure for them; queue semantics with at-least-once, peek-lock delivery meant they consumed at their own pace; dead-lettering gave failed deliveries somewhere to go that wasn't a retry storm against their API during their deploys.

import jsonfrom azure.servicebus import ServiceBusClient, ServiceBusMessage with ServiceBusClient.from_connection_string(CONN_STR) as client:    with client.get_queue_sender(queue_name="discount-recommendations") as sender:        sender.send_messages([            ServiceBusMessage(                json.dumps(rec),                content_type="application/json",                # one message per user per model run; consumer dedupes on this                message_id=f"{rec['model_version']}:{rec['user_id']}",            )            for rec in recommendations        ])

(That's today's azure-servicebus v7 API; the 2019-era Python SDK was v0.50, released that January, with a different surface — but the queue, the JSON contract, and the message_id dedupe key are what we actually shipped.)

The whole loop — event in, features, model, discount out — ran in production and the client consumed the recommendations in their campaigns. I don't have a clean uplift number that survived seven years and a company transition, so I won't claim one.

What worked

Write throughput was a non-issue, permanently. In the entire life of the system, ingestion was never the bottleneck or the incident. The thing Cassandra promises is the thing it delivered.

Per-user reads were fast and stayed fast. The partition-per-user-per-day model meant feature reads for serving were single-digit-millisecond partition lookups, unaffected by total data volume.

Idempotent ingestion paid for itself. Retries-by-design meant client-side delivery bugs and our own incident below resolved without a dedupe cleanup project.

Fixed cost. The bill was the same every month. For a startup invoicing a client for this system, that mattered more than any benchmark.

What didn't work

The ops burden was real and it landed on two people. Everything the warnings say about running Cassandra without a dedicated team happened in miniature: scheduled nodetool repair -pr that had to be staggered and babysat because anti-entropy repair is expensive; JVM heap tuning by folklore; watching nodetool compactionstats during the backfill like a heart monitor. None of it was hard, exactly. All of it was time, and the time came out of model work.

The incident that crystallized it: a node died — Azure VM gone, not coming back — and what should have been a documented afternoon (-Dcassandra.replace_address_first_boot, let it bootstrap, done) turned into most of a week of evenings. Streaming a full replica's data over the network to the replacement node took much longer than the docs led us to expect, and while it streamed we were a 3-node ring running on 2, which at RF=3/LOCAL_QUORUM means zero remaining failure budget. Nothing was down — the client never noticed, retries and quorum did their job — but I checked that streaming progress the way you check a sick child's temperature. With a managed service, that week is the provider's problem. That's what the Cosmos premium buys, and we'd chosen not to pay it.

Tombstones, the classic. Our 180-day TTL meant every event eventually became a tombstone. Months in, read latencies on long-lived users' queries started spiking, and the logs filled with tombstone warnings — Cassandra warns at 1,000 tombstones scanned per query and aborts at 100,000 by default (the abort surfaces as a TombstoneOverwhelmingException), and we were climbing that curve. The mechanics: expired TTL data can't just vanish — it has to survive as a tombstone for gc_grace_seconds (default ten days) so repairs can propagate the deletion, and until compaction actually evicts it, every read of that partition wades through it. Wide time-range scans over old partitions are exactly the worst case. The fix was the standard one — time-windowed compaction so whole expired SSTables drop at once, and query patterns that never scan past the recency horizon the model actually used. But we lost real days to diagnosing it, and the lesson stuck: in Cassandra, deletes are writes, and TTLs are scheduled deletes. You design for data death on day one or it ambushes you in month four.

The category error collected its tax. Every new question — "which categories drive repeat purchases?", "what does the funnel look like for users who got discounts?" — was a Spark full-table scan, because Cassandra answers the queries you modeled for and nothing else. A columnar warehouse answers ad-hoc questions in seconds; we'd built our analytics on a database that structurally can't. It's telling that the next system I built for this exact workload (at CustomerGlu, the successor company) kept the same shape — webhook in, events stored, models on top — but landed events in ClickHouse, where the ad-hoc questions cost nothing. That's the retrospective verdict in one sentence: Cassandra was the right ingestion engine and the wrong analytics store, and we'd accidentally hired it for both jobs.

What I'd do today

The shape survives: real-time webhook in, one-time backfill, offline feature jobs, queue back to the client. Seven years later I'd keep every arrow in that diagram and change most of the boxes — events land in ClickHouse or a lakehouse table instead of a ring I operate; the webhook becomes a managed ingestion endpoint; Spark becomes whatever the warehouse runs natively; the discount model gets better tooling but the same features.

And the build-vs-buy call inverts. In 2019, managed Cassandra on Azure was a compatibility layer with a pricing model hostile to firehoses, and self-hosting was genuinely the defensible choice — I'd defend it again for that year and that team. Today the managed options are real, and two engineers choosing to operate their own distributed database in 2026 would need a much better excuse than we had. The ops burden didn't make us better engineers in any way that compounding mattered; it made us tired. The week I spent watching a replacement node stream data taught me plenty about Cassandra internals and nothing about discount recommendation, and discount recommendation was the business.

Validation note: the CQL schema, webhook handler, and both Spark jobs in this post were executed this week against Cassandra 3.11 in Docker — schema applied via cqlsh, the Node handler writing and reading back real rows end-to-end (including re-POSTing the same batch and confirming the row count doesn't change), and the PySpark backfill + feature jobs running against the same ring via the spark-cassandra-connector. The Service Bus snippet was validated in dry-run (client and message construction, no live send) against azure-servicebus v7. The validation scripts live in this site's repo under research/code-validation/.