Cassandra Is a 'Row Store': Truths, Lies, and What the Source Code Says

The Apache Cassandra repository describes the project in one sentence in its README:

Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.

Every database has rows. PostgreSQL has rows. SQLite has rows. So what is that sentence actually claiming? After writing up how we ran Cassandra as an event warehouse in 2019 — and confessing the ways that hurt — I wanted to answer that question properly. So I cloned apache/cassandra (trunk, version 7.0-dev as I write this) and read it. This post is what the source code says, where the marketing-era terminology lies, and a few experiments on my laptop that make the storage model visible.

Why Cassandra exists at all

Cassandra was built at Facebook by Avinash Lakshman and Prashant Malik for Inbox Search — the feature that let you search messages you'd sent and received. Lakshman had previously co-authored Amazon's Dynamo paper, and it shows. The repo's own architecture overview states the lineage in one line:

This initial design implemented a combination of Amazon's Dynamo distributed storage and replication techniques and Google's Bigtable data and storage engine model.

That sentence explains almost everything confusing about Cassandra. It is two databases glued together by design: Dynamo answers "which machines hold this data and how do they agree" (consistent hashing, multi-master replication, gossip, tunable consistency), and Bigtable answers "how does each machine store it" (memtables, immutable SSTables, log-structured merge). When something about Cassandra seems weird, it's usually because you're looking at the seam between the two papers.

The timeline, verified against primary sources: open-sourced on Google Code in July 2008 and announced by Facebook engineering that August; entered the Apache Incubator January 2, 2009 per the ASF record (Wikipedia says March — the ASF's own record disagrees); graduated to a top-level project February 17, 2010. The canonical paper is Lakshman & Malik, LADIS 2009. Current GA is the 5.0 line, released September 5, 2024; the Docker image I ran the experiments below against reported 5.0.8.

The core mechanism, traced through the source

The write path never reads

A CQL INSERT arrives at whichever node your driver picked — every node can coordinate; there is no primary. The coordinator hashes the partition key with Murmur3Partitioner into a token, looks up which replicas own that token range (the replication strategy lives on the keyspace — more on that below), and fans the mutation out via StorageProxy.mutate(). Each replica does exactly two cheap things: append the mutation to the CommitLog (sequential disk write, for durability) and insert it into a Memtable (sorted in-memory structure). When a memtable fills, it's flushed to disk as an immutable SSTable. Nothing is ever updated in place.

The crucial property: a write never reads existing data first. An UPDATE in Cassandra is the same operation as an INSERT — a new timestamped version of some cells, appended. Conflict resolution is deferred to read time and to compaction, where the highest write timestamp wins (last-write-wins, per cell). This is why Cassandra's write throughput is famous: every write is an append plus an in-memory insert, no matter how big the dataset gets.

Consistency is enforced by counting, nothing fancier: WriteResponseHandler blocks until the number of replica acks satisfies your consistency level (ONE, QUORUM, ALL...). Cluster membership and liveness ride on gossip — the Gossiper javadoc describes a node picking a random peer every second and exchanging three messages (Syn → Ack → Ack2), which is how a thousand-node cluster knows who's alive with no central registry.

The read path pays the bill

Reads do the work writes deferred. A single-partition read checks each SSTable's bloom filter, seeks via the partition index, and then merges the partition's fragments from the memtable plus every SSTable that contains them — newest timestamp wins, cell by cell. The more SSTables your partition is smeared across, and the more deleted data the scan has to step over, the more a read costs.

Deletes are the sharpest edge. Because SSTables are immutable, a delete can't remove anything — it writes a tombstone, a marker that out-timestamps the data under it. The read path counts every tombstone it scans: in ReadCommand.java, a transformation wraps the row iterator and bumps a counter per tombstone, warns past tombstone_warn_threshold, and at the failure threshold throws TombstoneOverwhelmingException — the database aborts your query to protect itself. I reproduce this on purpose in the experiments below; in 2019 we hit it by accident in production.

Compaction is the rent: background processes continually merge SSTables, discard shadowed versions, and purge expired tombstones. The strategy is pluggable — size-tiered (default, write-friendly), leveled (read-friendly), time-window (time-series with TTLs), and since 5.0 the unified strategy that subsumes the older ones.

"Row store": the truths and the lies

The truth: it really is rows

The storage engine's own definition of a row lives in Row.java:

A row mainly contains the following informations: 1) Its Clustering, which holds the values for the clustering columns identifying the row. 2) Its row level informations: the primary key liveness infos and the row deletion. 3) Data for the columns it contains, or in other words, it's a (sorted) collection of ColumnData.

And the atom under the row, in Cell.java:

A cell is our atomic unit for a single value of a single column. A cell always holds at least a timestamp that gives us how the cell reconcile. We then have 3 main types of cells: 1) live regular cells ... 2) expiring cells ... 3) tombstone cells.

So yes: rows, made of cells, grouped into tables. All the columns of a row are stored together, which is precisely what "row store" means as a storage-layout term — as opposed to column-oriented stores (ClickHouse, BigQuery, Parquet files) that store each column contiguously so analytical scans can read only the columns they need. When the README says "row store," it's positioning Cassandra in that dichotomy, and it's telling the truth. If you've seen Cassandra filed under "wide-column store" — a label the project's own documentation never uses — that taxonomy is more confusing than helpful. Note that every cell carries its own timestamp: that's the machinery that lets writes skip reading.

The lie by omission: the load-bearing word is "partitioned"

Here's what "row store" doesn't tell you. In PostgreSQL, a row lives in some 8KB heap page, and indexes tell you which page; the primary key is a constraint, and the planner can find rows by any column you index, because any row is equally reachable. In Cassandra, the primary key is the physical layout:

The partition key (the first part of the primary key) is hashed to decide which nodes own the row. A partition lives entirely on its replicas, stored contiguously.
The clustering columns (the rest) are the sort order inside the partition. A partition is, in effect, a sorted map of rows — and TableMetadata keeps partitionKeyColumns and clusteringColumns as separate lists because they are different physical concepts, not one "primary key."

The consequence: queries that follow the layout (give me a partition, or a clustering-ordered slice of one) are fast at any scale, and queries that don't are somewhere between expensive and refused. WHERE clauses are restricted to what the layout can serve; the database makes you type ALLOW FILTERING to confess you're asking for a full scan. That's the honest reading of "partitioned row store": rows are stored together, partitions decide where, and your queries are constrained to that geometry.

The lie of familiarity: CQL is a costume

CQL looks so much like SQL that the README itself says it's "a close relative." It isn't a relative; it's a costume. No joins. No cross-partition transactions (until Accord ships — more below). Aggregations exist but execute by scanning. The schema you write in CQL is not a description of your data — it's a query plan you commit to at design time. You design tables from the queries backward, denormalizing one table per access pattern.

The source code carries the archaeology of this. Cassandra's original interface (Thrift, pre-2013) exposed the storage engine almost raw: "column families" you'd insert arbitrary column names into. CQL3 layered the rows-and-tables veneer on top. TableMetadata still contains flags named DENSE, COMPOUND and SUPER whose comments talk about "thrift" column families and explain the flags exist only to detect un-migrated pre-4.0 tables at startup. The relational-looking surface was added later, and the engine remembers.

Keyspace: a database that also decides your blast radius

A keyspace is the top-level namespace, and yes — it contains tables (plus user-defined types and functions). KeyspaceMetadata.java calls itself "an immutable representation of keyspace metadata (name, params, tables, types, and functions)." So far it sounds exactly like a PostgreSQL database or schema.

The difference is in params: replication is configured per keyspace, not per table or per cluster. When you write

CREATE KEYSPACE events  WITH replication = {    'class': 'NetworkTopologyStrategy',    'dc_mumbai': 3,    'dc_singapore': 2  };

you're declaring that every table in this keyspace keeps three copies in one datacenter and two in another. The strategy classes are ordinary pluggable code — SimpleStrategy walks the token ring, NetworkTopologyStrategy spreads replicas across racks and datacenters. A keyspace is a namespace that also decides how many copies of your data exist and where on Earth they live.

What shape of data actually fits

So is storing user events — what we did in 2019 — Cassandra's best use case? Close, but the precise answer is more useful: Cassandra fits data whose access pattern you can state as a partition key before you write a single row.

The good-fit checklist, every item a direct consequence of the mechanism above:

Write-heavy, append-mostly. Events, messages, sensor readings, feeds. The write path is O(append) and scales linearly with nodes.
Reads addressed by key, returning a partition or a slice of one. "Everything user X did this week," "the latest N messages in channel Y." The layout serves these in one seek per SSTable.
Volume that exceeds one machine, with availability requirements that rule out a single primary. Multi-master means any node down ≠ writes down.
Tolerance for eventual consistency on most operations, with QUORUM where it matters.

The anti-fit list is just as mechanical: ad-hoc analytics (no layout to follow — that's a columnar warehouse's job), strongly-consistent multi-row invariants like ledgers (no cross-partition transactions today), small datasets (you're paying distributed-systems tax for nothing), and anything whose query patterns you can't predict (every new question needs a new table or a full scan). Time-series events are the canonical fit not because of what the data is but because event workloads naturally satisfy all four bullets — they arrive as appends, they're queried by entity and time window, and they're huge.

Verdict on 2019-us: ingesting click events into Cassandra was the textbook use case. Using the same cluster as the warehouse for model-training scans was the part the storage model never promised — which is exactly where it hurt.

Who actually runs it

I expected the repo to contain an adopters list. It doesn't — the case-studies page lives in the separate website repo. But the source tree leaks who cares: there are first-class locality "snitches" for specific clouds (GoogleCloudSnitch, AlibabaCloudSnitch), the docs link Apple's Swift driver, and the big recent features map to corporate engineering investment: Accord transactions (CEP-15) is led by Apple engineers, storage-attached indexes (CEP-7) and unified compaction (CEP-26) came from DataStax, and the Sidecar effort was Netflix-led.

From the public record, with sourcing confidence flagged:

Apple — the official case-studies page cites 75,000+ nodes storing 10+ PB. A larger figure ("300,000 nodes") was reported from an ApacheCon 2022 talk but survives only in tweets and a Hacker News thread, so treat it as hearsay. Apple maintains a public Cassandra page and employs several of the project's most active committers.
Netflix — petabytes across hundreds of clusters; one storage layer alone spans ~2,400 EC2 instances across 12 clusters, and a May 2026 post on their data-movement tooling confirms it's still core infrastructure.
Uber — tens of millions of queries per second over tens of petabytes, clusters ranging from 6 to 450 nodes.
Bloomberg, eBay, Best Buy, Walmart, Target, Monzo, BlackRock — all on the official case-studies page; eBay's entry cites 200+ TB and 400M+ writes a day.

And the honest counterexample: Discord left. In 2023 they migrated trillions of messages from Cassandra to ScyllaDB, going from 177 Cassandra nodes to 72 ScyllaDB nodes, with message-read p99 dropping from 40–125ms to 15ms. Their stated pain — JVM garbage-collection pauses and "hot partition" cascades — is a real critique of this architecture under their workload (Discord still appears on Cassandra's case-studies page, which tells you something about how often adopter lists are pruned). Instagram took a different path back in 2018, swapping the storage engine for RocksDB ("Rocksandra") to cut tail latency 10x; that fork was archived in 2023 and the pluggable-engine work never landed upstream.

The landscape: equivalents and lookalikes

Cassandra is not one of a kind — it sits in a family tree with two roots (Dynamo and Bigtable), and most of its "equivalents" are siblings from one root or the other.

System	Lineage	Consistency model	Operating model	Pick it when
Cassandra	Dynamo + Bigtable	Tunable per query	Open source (Apache), self-run or managed	You want the full model, multi-DC, and no vendor lock
ScyllaDB	Cassandra reimplemented in C++	Same as Cassandra	Source-available since Dec 2024 (last OSS release: 6.2)	Cassandra's model, but GC pauses are killing you
DynamoDB	Dynamo (same paper, same co-author)	Strong or eventual per read	AWS managed only, pay per request/capacity	You're all-in on AWS and want zero ops
HBase	Bigtable on HDFS	Strongly consistent per row, single master	Open source, needs Hadoop stack	Honestly: rarely today — even Pinterest, one of the largest users, deprecated it
Bigtable	The original	Strong within a cluster	GCP managed only	GCP-native, Bigtable-shaped workloads
Cosmos DB Cassandra API / Amazon Keyspaces	CQL wire-compatible, different engines inside	Provider-specific	Azure / AWS managed	You want CQL compatibility without running Cassandra — read the documented feature gaps first

On ScyllaDB performance: their benchmarks claim 2–5x Cassandra's throughput on identical hardware, and Discord's migration numbers lend real-world weight — but note those benchmarks are vendor-run against Cassandra 4.0, before 5.0's trie memtables and unified compaction; I found no independent 5.0-era comparison.

Two systems that look adjacent but are different categories: ClickHouse is a columnar OLAP engine — it's what you use for the ad-hoc analytics Cassandra refuses, not a Cassandra alternative. MongoDB is a document store with a different data model and (since 2018) its own source-available license story.

Cassandra vs. PostgreSQL, architecture to architecture

This comparison is the one I find most clarifying, because PostgreSQL is the mental model most engineers carry in. It's not a benchmark — they're not interchangeable tools — it's two different sets of promises.

graph TB subgraph PG["PostgreSQL — one primary, queries first"] PQ["SQL: joins, planner, index on anything"] --> PW["WAL append"] PW --> PB["8KB heap pages, updated in place via MVCC versions"] PB --> PH[("Heap tables + B-tree indexes")] PH --> PV["VACUUM reclaims dead row versions"] end PV ~~~ CQ subgraph CS["Cassandra — every node equal, layout first"] CQ["CQL: keyed by partition, no joins, schema = query plan"] --> CW["CommitLog append"] CW --> CM["Memtable, never updated in place"] CM --> CT[("Immutable SSTables (LSM tree)")] CT --> CC["Compaction merges versions, purges tombstones"] end

Topology. PostgreSQL has one writable primary; replicas are followers, and failover is an event. Cassandra is multi-master — every node accepts writes for the data it owns, and a node dying is Tuesday.
Storage engine. PostgreSQL updates 8KB heap pages in place, keeping old versions for MVCC until VACUUM reclaims them. Cassandra never updates in place; versions accumulate in immutable SSTables until compaction merges them. Vacuum and compaction are the same debt — deferred cleanup of dead versions — paid on different schedules.
Query model. PostgreSQL's planner exists so you don't have to know your queries in advance: add an index, join anything. Cassandra deleted the planner's job: the schema is the plan, fixed at design time.
Consistency. PostgreSQL gives ACID transactions on the primary. Cassandra gives per-query tunable consistency and (today) single-partition atomicity only. The Accord work targets general-purpose transactions in Cassandra 6.0 — in alpha as of April 2026, off by default, opt-in per table — which would retire the loudest item on the anti-fit list, if it ships well.
Scaling. PostgreSQL scales up, then shards painfully. Cassandra scales out as its core design premise — but you pay for that premise on day one, at ten rows, whether you needed it or not.

The punchline: PostgreSQL optimizes for the queries you haven't thought of yet; Cassandra demands you've already thought of all of them.

Some tests at scale (laptop edition)

Claims above, receipts below. Everything here ran against a single-node cassandra:5.0 Docker container (5.0.8) on an M-series MacBook — so the absolute numbers are meaningless for capacity planning, but the relative numbers demonstrate mechanisms. All scripts live in this site's repo under research/code-validation/, and the outputs below are pasted, not typed.

Test 1: writes vs. reads — the asymmetry that wasn't

cassandra-stress ships inside the Cassandra image. One million writes, flush, then one million reads of the same keys:

cassandra-stress write n=1000000:
  Op rate: 77,555 op/s   p50 0.6ms  p99 3.3ms  p99.9 6.4ms

cassandra-stress read n=1000000:
  Op rate: 77,015 op/s   p50 0.6ms  p99 2.7ms  p99.9 6.5ms

I expected the folklore asymmetry — append-only writes fast, merge-everything reads slow — and didn't get it: throughput within 1%, p99 actually better on reads. The folklore isn't wrong, but it's conditional. These are tiny single-row partitions, freshly compacted into few SSTables, fully cached. Cassandra reads aren't slow; reads that fight the storage model are. The next two tests pick that fight deliberately.

Test 2: the partition is the unit of geometry

Same table, three partitions: 100 rows, 10,000 rows, 1,000,000 rows (~100-byte payloads). Two operations on each: scan the whole partition, and read a LIMIT 100 slice starting from the middle by clustering key.

partition       rows    full scan    slice p50    slice max
p_100           100       0.01s       2.74ms       3.63ms
p_10k        10,000       0.04s       2.64ms       3.56ms
p_1m      1,000,000       1.80s       2.15ms       3.27ms

The scan cost is linear in partition width — you asked for the whole sorted map, you pay for the whole sorted map. But the slice cost is essentially flat from 100 rows to a million: the clustering columns are a real index inside the partition, and a keyed slice is a seek, not a scan. This is the "partitioned row store" claim made empirical — and it's why our 2019 schema bucketed partitions by (user_id, day) instead of letting them grow unboundedly: not because wide partitions break slices, but because everything else (compaction, streaming, repair, heap) handles a partition as a unit.

Test 3: tombstones, reproduced on purpose

Two partitions with the same 10,000 live rows. The clean partition was written once. The deleted partition got 60,000 rows written and 50,000 of them deleted, then flushed — so the live data is identical, only the history differs:

partition   live rows     read p50     read max
clean          10,000       20.0ms       21.9ms
deleted        10,000       35.4ms      112.1ms

server log:
WARN ReadCommand.java:644 - Read 0 live rows and 50000 tombstone cells
  for query SELECT * FROM blog_experiments.tombstone_test
  WHERE pk = 'deleted' AND ck > 9999 LIMIT 5000
  (see tombstone_warn_threshold)

Same query, same result set — nearly 2x slower at the median and 5x at the worst case, because ReadCommand had to step over and count 50,000 tombstones to find the 10,000 live rows. The warning the server logged is the exact code path quoted earlier in this post, down to the line number — and notice what it says: the final page of the query read zero live rows and 50,000 tombstone cells, pure overhead. Past the failure threshold (100,000 by default) it stops warning and starts throwing TombstoneOverwhelmingException. In 2019 this mechanism found us before we'd read the source; now you can watch it coming.

The README sentence, revisited

Apache Cassandra is a highly-scalable partitioned row store.

Every word is load-bearing. Row store: all columns of a row stored together, cells with timestamps as the atoms — true, and honestly distinct from columnar engines. Partitioned: the primary key is physical placement, your queries are constrained to its geometry, and that constraint is the price of the next word. Highly-scalable: linear write scaling and multi-master availability, which Apple and Netflix cash in at tens of thousands of nodes — and which costs you a planner, joins, and (until Accord proves itself in 6.0) transactions.

The lies aren't in the sentence; they're in the familiarity it invites. CQL looks like SQL and isn't. Tables look relational and are query plans. A keyspace looks like a database and is also a replication contract. Read the source — it's unusually well-commented for infrastructure of this age — and the database stops being surprising.

Validation note: every CQL statement and experiment script in this post ran against a real Cassandra 5.0.8 in Docker before publication; the stress and experiment outputs are pasted verbatim from those runs. Source quotes were copied from a local clone of apache/cassandra at trunk (7.0-dev, June 2026) with file paths linked inline. Scripts live in this site's repo under research/code-validation/.