Last year, I was interviewing a senior backend engineer. Good candidate — solid distributed systems background, strong on Kubernetes, knew their way around Go. I asked my favourite database question:
“Your PostgreSQL instance on RDS is hitting a write throughput ceiling at peak traffic. Walk me through how you’d approach scaling it horizontally.”
They gave the textbook answer. Read replicas for read-heavy queries. PgBouncer for connection pooling. Maybe Aurora if the budget allows. Citus for sharding. Bigger instance class as a stopgap.
All reasonable. All missing the point.
Because before we talk about scaling infrastructure, there’s a question nobody asks: is your primary key strategy creating an artificial write bottleneck that no amount of hardware will fix?
The Problem Nobody Talks About #
Every tutorial, every ORM, and every “getting started with PostgreSQL” guide tells you to do this:
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
total NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
Looks innocent. BIGSERIAL gives you a nice, clean, auto-incrementing integer. It’s small (8 bytes), fast to compare, natural to paginate with. Every INSERT gets the next number in the sequence. Simple.
But inserts overwhelmingly target the rightmost leaf page of the B-tree index — and under concurrency, that concentration becomes a problem.
A B-tree index in PostgreSQL stores keys in sorted order across leaf pages. When your keys are monotonically increasing — 1, 2, 3, 4, … — every new key is larger than every existing key. Every new key targets the rightmost leaf page of the index. Page splits, fillfactor settings, and vacuum can temporarily distribute writes across nearby pages, but the statistical pattern is relentless: the rightmost leaf is where the action is.
PostgreSQL is actually smart about this. The B-tree code in nbtinsert.c has a “fastpath” optimisation: it caches the block number of the rightmost leaf page so it doesn’t have to walk the tree from the root for every insert. Efficient, right?
Under concurrency, that optimisation is the problem.
When 50, 100, 200 concurrent connections are all inserting rows, they’re all trying to modify the same leaf page. PostgreSQL uses a content lock on each buffer page — and to insert a tuple, you need an exclusive lock. So those 200 backends line up, one by one, to write to that single page. The critical section of index insertion becomes serialised — heap inserts, WAL logging, and visibility checks still run concurrently, but the index write itself is the chokepoint. Throughput degrades and scales poorly under this pattern.
You can see it happening on your database right now.
Diagnosing the Bottleneck #
Before you change anything, measure.
Step 1: Check for buffer content lock contention #
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY 1, 2
ORDER BY 3 DESC;
If LWLock / BufferContent sits at the top during peak insert traffic, you’ve got buffer page contention. That’s the symptom.
Step 2: Find the hot page #
This requires the pg_buffercache extension (CREATE EXTENSION IF NOT EXISTS pg_buffercache;):
SELECT c.relname, b.relblocknumber, b.pinning_backends, b.isdirty
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.pinning_backends > 1
ORDER BY b.pinning_backends DESC
LIMIT 10;
If a single block from your primary key index has a disproportionately high pinning_backends count, you’ve found your bottleneck. That’s the rightmost leaf page. Every INSERT is fighting for it.
Step 3: Compare index I/O patterns #
SELECT indexrelname, idx_blks_read, idx_blks_hit,
round(idx_blks_hit::numeric / nullif(idx_blks_hit + idx_blks_read, 0), 3) as hit_ratio
FROM pg_statio_user_indexes
ORDER BY idx_blks_read DESC
LIMIT 10;
For sequential-key indexes under heavy inserts, you’ll see a deceptive pattern: the buffer pool hit ratio looks decent (the hot page is cached), but lock contention is what’s killing you — and it won’t show up in I/O stats.
If you don’t see these symptoms, stop reading. Your BIGSERIAL keys are fine. Go ship features. This post is for teams doing thousands of concurrent inserts per second and wondering why throwing more CPU at the problem doesn’t help.
A key nuance: This contention appears when the bottleneck shifts from I/O to in-memory lock contention. You need high concurrent inserts, a PK index that fits in memory (so I/O isn’t the dominant wait), and enough inserts per millisecond to create meaningful lock queuing on the rightmost leaf. If your workload is I/O-bound, the rightmost-leaf pattern is the least of your problems.
What’s Actually Happening Inside the B-tree #
PostgreSQL’s B-tree implementation lives in src/backend/access/nbtree/nbtinsert.c. When you insert a new row, here’s the simplified flow for the index:
-
Find the target leaf page. For monotonically increasing keys, the fastpath kicks in — it remembers the rightmost leaf’s block number and goes straight there.
-
Acquire an exclusive content lock on that buffer page. This is the serialisation point. Only one backend can hold this lock at a time.
-
Insert the index tuple into the page.
-
If the page is full, split it. Allocate a new page, move roughly half the tuples over, update the parent’s downlink, write WAL records for all of it, hold exclusive locks on both pages and the parent during the operation. (Yes, all of that for one row.)
-
Release locks and return.
With monotonic keys, step 4 happens predictably: the rightmost leaf fills up, splits, and the new rightmost leaf becomes the next hot page. Rinse and repeat. The BufMappingLock partitioning introduced in PostgreSQL 9.5 helps when different backends hit different pages — but when they’re all hitting the same page, no amount of lock partitioning matters.
This contention comes down to concurrency on a single page — not I/O, not CPU, not memory. That has consequences for every scaling strategy you might reach for:
- Scaling up gives you more CPU, but the lock still serialises writes to one page. You raise the ceiling; the bottleneck shape doesn’t change.
- Read replicas are irrelevant — they don’t touch write throughput.
- Sharding by a sequential ID as the shard key is the cruelest joke of all — each shard gets its own B-tree with its own rightmost leaf, and all writes still land on one shard. (If you’re sharding by
customer_id,tenant_id, or a hash — different story. The PK’s rightmost-leaf contention still exists per-shard, but at least your writes are distributed across shards.)
A Note on Sequence Contention #
You might be thinking: “Wait, isn’t the sequence itself a bottleneck too?” Historically, yes — nextval() contention on shared sequences was a real issue. Modern PostgreSQL largely eliminates this through per-backend sequence caching (the CACHE clause on CREATE SEQUENCE, default 1 but commonly raised). Today, the index is the dominant bottleneck, not the sequence. But if you’re seeing LWLock:SerializableXactHash or similar waits on sequence operations, check your sequence cache settings before blaming the B-tree.
The Case FOR Sequential IDs #
Now here’s the part the internet skips.
Sequential IDs are actively superior for most workloads:
Cache locality is excellent. The rightmost leaf pattern means your B-tree’s hot working set is tiny and predictable — just the current rightmost leaf and its parent path. A handful of pages. Your buffer pool barely breaks a sweat. Compare that to random-key inserts, where the entire index potentially needs to be cached for decent write performance.
Append-only patterns love them. Time-series, event logging, any append-heavy workload where reads target recent data — sequential IDs give you perfect physical ordering. Recent rows clustered on disk. Recent index entries clustered in the B-tree. Your sequential scans and range queries benefit enormously.
JOINs are faster. The primary reason isn’t CPU comparison cost (though 8-byte integer comparisons are cheaper than 16-byte UUID comparisons) — it’s cache density. Smaller keys mean more entries per index page, fewer pages to traverse, better buffer pool hit rates. When you have foreign keys pointing to these IDs, the size difference compounds across every index, every join, every lookup.
Polling-based CDC is trivial. WHERE id > @last_processed_id is the simplest, most efficient polling-based change-data-capture pattern. If you’re using log-based CDC (Debezium, logical replication), the PK type matters less — but for the polling pattern, sequential IDs are unbeatable.
The vast majority of PostgreSQL deployments will never hit rightmost-leaf contention. If you’re doing fewer than a few hundred concurrent inserts per second — and that’s a lot more than most people think — BIGSERIAL is the right choice. Simpler, smaller, faster for everything except the specific pathology described in this post.
When You DO Have the Problem: Your Options #
You’ve run the diagnostic queries. You’re seeing BufferContent lock waits concentrated on your primary key index. You’ve confirmed it’s the rightmost leaf. Now what?
Option 1: FILLFACTOR Tuning (Quick Mitigation) #
Before reaching for bigger changes, try the simplest knob:
ALTER INDEX orders_pkey SET (fillfactor = 70);
REINDEX INDEX orders_pkey;
Lowering fillfactor (default 90 for B-tree indexes) leaves free space on each leaf page, which delays page splits and reduces the frequency of the most expensive contention events. This doesn’t solve the fundamental problem — inserts still concentrate on the rightmost leaf — but it can meaningfully reduce split-related lock contention and buy you headroom. Think of it as ibuprofen, not surgery.
Option 2: Time-Range Partitioning (Least Disruptive Structural Fix) #
If you can’t change your primary key strategy — and in a running production system, you often can’t — partitioning by time range is the pragmatic answer:
CREATE TABLE orders (
id BIGSERIAL,
customer_id BIGINT NOT NULL,
total NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2026_02 PARTITION OF orders
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE orders_2026_03 PARTITION OF orders
FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');
Each partition gets its own index. The rightmost-leaf problem still exists within each partition, but you’re distributing the hot pages across multiple smaller indexes. If you’re inserting data that spans multiple partitions (e.g., backfills), the contention distributes naturally.
The trade-off is operational complexity — partition management, query planning overhead, and the joy of explaining to your team why SELECT * FROM orders WHERE id = 12345 now needs a partition key in the WHERE clause for efficient lookup.
Option 3: UUID v7 for New Tables (Best Write Distribution) #
Starting fresh — new service, new table, greenfield — today, many teams are converging on UUID v7 (RFC 9562), and for good reason.
UUID v7 embeds a millisecond-precision timestamp in the high bits, giving you time-ordering (so range scans and sorting still work), and fills the remaining 74 bits with random or sub-millisecond data (12-bit rand_a + 62-bit rand_b per RFC 9562). That randomness is what distributes your inserts.
PostgreSQL does not have a built-in UUID v7 generator as of PG 17. You’ll need the pg_uuidv7 extension or application-side generation:
-- Requires: CREATE EXTENSION pg_uuidv7;
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
customer_id BIGINT NOT NULL,
total NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
Or generate them in Go:
import "github.com/google/uuid"
func NewOrderID() uuid.UUID {
// uuid v7: timestamp + random (requires google/uuid v1.6.0+)
id, _ := uuid.NewV7()
return id
}
How the distribution works: Within the same millisecond, UUID v7 values share the same timestamp prefix but differ in the 74 random/sub-ms bits that follow. Since these bits come after the timestamp in sort order, values generated in the same millisecond spread across that millisecond’s key space. For a large index where each millisecond’s range spans many leaf pages, concurrent inserts land on different pages. Contention is dramatically reduced and spreads across a range of pages rather than hammering a single one.
The honest caveat: This distribution benefit scales with index size. For a small, new table where the entire index fits on a handful of pages, UUID v7 inserts hit the same pages regardless — there simply aren’t enough pages to spread across. The benefit kicks in as the table grows. At the scale where rightmost-leaf contention matters in the first place (big tables, high insert rates), the distribution is effective. But note that inserts still trend rightward over time — the hotspot becomes a region rather than a single page.
The trade-offs are real:
- 16 bytes vs 8 bytes. Your primary key index is roughly twice the size. Every foreign key index that references this table doubles too. This compounds.
- Slower comparisons. UUID comparison is more expensive than integer comparison. JOINs feel it.
- No TOAST concern — 16 bytes doesn’t trigger TOAST. But the size impact across indexes and FKs is non-trivial.
Option 4: Don’t Use UUID v4 as a Primary Key #
I’m including this as an option so I can tell you not to do it.
UUID v4 is fully random — no time component. It distributes writes beautifully across the B-tree. But it destroys locality and range-scan performance, and dramatically increases cache pressure — point lookups still work, but any access pattern that benefits from ordering or spatial locality suffers badly.
If you need unpredictable IDs for external exposure, generate a UUID v4 as a secondary column and keep a sane primary key. Don’t make your B-tree suffer for your API design.
The Comparison #
| Key Type | Size | Write Distribution | Range Scans | Cache Locality | Shard-Friendly |
|---|---|---|---|---|---|
| BIGSERIAL | 8B | ❌ Rightmost only | ✅ Natural | ✅ Tiny hot working set | ❌ Hot shard |
| UUID v4 | 16B | ✅ Fully random | ❌ No ordering | ❌ High cache pressure | ✅ Even distribution |
| UUID v7 | 16B | ⚠️ Distributed (scales with index size) | ✅ Time-ordered | ⚠️ Good for recent data | ✅ Even distribution |
| ULID | 16B | ⚠️ Distributed (scales with index size) | ✅ Time-ordered | ⚠️ Good for recent data | ✅ Even distribution |
| Snowflake | 8B | ⚠️ Reduced burst contention across workers, monotonic per-worker | ✅ Time-ordered | ✅ Sequential per-worker | ⚠️ Depends on worker count |
Snowflake IDs deserve a note: if you’re running multiple application instances (which, on Kubernetes, you almost certainly are), each instance generates IDs in a different range based on its worker/machine ID bits. This reduces burst contention since concurrent inserts from different workers target different regions of the tree. But Snowflake IDs are still globally increasing — writes still trend toward the right edge of the index. Snowflake reduces burst contention, not the structural rightward pattern. Whether the per-worker distribution is sufficient depends on your worker count and insert rate per worker.
The Interview Answer #
Back to that interview question. Here’s what I was looking for:
“Before we talk about scaling the infrastructure, I’d want to verify we’re not fighting a data structure bottleneck. I’d check pg_stat_activity for LWLock:BufferContent concentration during peak writes, then look at pg_buffercache to see if a single index page is disproportionately hot. If we’re seeing rightmost-leaf contention on a sequential primary key under high concurrent inserts, no amount of horizontal scaling fixes that — we need to address the key distribution at the application layer.”
Not because read replicas and connection pooling don’t matter — they do. But because you need to diagnose before you prescribe. We’re trained as engineers to think about infrastructure scaling because it’s visible, it’s vendored, and it’s what you can buy with a credit card. Data structure pathologies require deeper analysis. They don’t show up in your cloud provider’s dashboard.
The engineer who can distinguish between “we need more hardware” and “we need a different data model” is the one I want on my team.
The Pragmatic Takeaway #
Here’s my decision tree:
-
You’re doing < 500 concurrent inserts/second (most of you): Use BIGSERIAL. It’s simpler, smaller, faster for reads and JOINs. Don’t optimise for a problem you don’t have.
-
You’re seeing
LWLock:BufferContentcontention on your PK index: Run the diagnostic queries above. Confirm it’s the rightmost leaf. Then:- Quick win: Lower
fillfactoron the index to buy headroom. - Existing system, can’t change PK: Partition by time range. Distributes the problem without changing your key strategy.
- New tables or services: UUID v7. Time-ordered enough for range scans, random enough to spread writes.
- Quick win: Lower
-
Someone suggests UUID v4 as the primary key: Ask them what their range-scan strategy is. Watch them blink.
Choose your primary key based on your write pattern, not your ORM’s default. BIGSERIAL is ActiveRecord’s default, Django’s default, every “Getting Started” guide’s default. For 95% of applications, it’s the right default. But if you’re in the 5% pushing real concurrent write throughput, understand why it can become the bottleneck before you reach for the AWS pricing calculator.
Measure before you migrate. Diagnose before you prescribe. Stop blaming PostgreSQL for a contention pattern your application created.