<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kafka on EXPLAIN ANALYZE</title><link>https://explainanalyze.com/tags/kafka/</link><description>Recent content in Kafka on EXPLAIN ANALYZE</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Sat, 20 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://explainanalyze.com/tags/kafka/index.xml" rel="self" type="application/rss+xml"/><item><title>Scale the Pattern, Not the Instance</title><link>https://explainanalyze.com/p/scale-the-pattern-not-the-instance/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://explainanalyze.com/p/scale-the-pattern-not-the-instance/</guid><description>&lt;img src="https://explainanalyze.com/" alt="Featured image of post Scale the Pattern, Not the Instance" /&gt;&lt;div class="tldr-box"&gt;
 &lt;strong&gt;TL;DR&lt;/strong&gt;
 &lt;div&gt;Most production fixes patch the occurrence in front of you (reboot this box, clean these rows, bump this instance), which clears today and leaves the class of problem intact to recur. The fix that holds lives one level up, where one change covers every future occurrence. The skill is telling a recurring class from a true one-off before you&amp;rsquo;ve been burned three times.&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;A development cluster ran on shared-tenancy EC2, the cheap tier with no dedicated host. Prod deploys had to clear dev first, so when a box wedged (SSH hung, the health check flatlined, steal time climbing on whatever still answered) the deploy queue stalled behind it. The team read it as capacity and moved up an instance size. It wedged again, so they went up another, now triple the bill and still wedging. Capacity was never it. A noisy neighbor on the oversubscribed host was starving the tenants, and a bigger seat at the same crowded table changes nothing. The unblock path made it worse: EU developers couldn&amp;rsquo;t reboot the box themselves (correctly, they shouldn&amp;rsquo;t hold that access), so a US infra engineer got paged in the middle of the night to stop/start a single dev box. What finally held wasn&amp;rsquo;t a bigger box. It was a pipeline any developer could run to relaunch a wedged instance on a fresh physical host.&lt;/p&gt;
&lt;p&gt;One box wedged; the pipeline covered every box that ever would. That distance, between fixing the occurrence in front of you and retiring the class it belongs to, is the whole article. The dev cluster is the easy version, because the fix was operational and nobody&amp;rsquo;s data path ran through it. The expensive version is when the strain sits in your data layer and the instance control is right there in the console. The reflex is the same either way: fix the thing that&amp;rsquo;s strained, which clears the symptom by end of afternoon and is exactly why it keeps beating the slower question of what class of problem just announced itself.&lt;/p&gt;
&lt;h2 id="just-scale-up-just-scale-out"&gt;Just scale up, just scale out
&lt;/h2&gt;&lt;p&gt;A Kafka consumer group on the orders topic starts lagging, a few hundred thousand messages behind by mid-morning and never catching up overnight. The handler does the obvious thing per message: deserialize, look up the customer, write a row, commit the offset. The team bumps the workers a size and goes from six partitions to twelve. Lag clears in an hour. Three weeks later it&amp;rsquo;s back; twelve isn&amp;rsquo;t enough, so they go to twenty-four and up another size. Relief, a quiet stretch, recurrence one size up. The cycle has a rhythm once you&amp;rsquo;ve seen it.&lt;/p&gt;
&lt;p&gt;By the second recurrence the fix is obvious to anyone who&amp;rsquo;s met it before: stop processing one message at a time, batch the poll and the downstream write, commit once for the batch. The obviousness is the point, because the team didn&amp;rsquo;t reach for it. They reached for partitions and worker size, twice, because those are the controls the dashboard puts a slider on, and a change to the handler is the one that isn&amp;rsquo;t. The occurrence got fixed each time. The class, a handler that pays a fixed round trip per message however many workers run in parallel, stayed exactly as expensive as it started.&lt;/p&gt;
&lt;p&gt;This is the same argument &lt;a class="link" href="https://explainanalyze.com/p/its-almost-always-the-queries-part-i-why-metal-doesnt-help/" &gt;Part I of the query series&lt;/a&gt; made under the heading &amp;ldquo;Why Metal Doesn&amp;rsquo;t Help,&amp;rdquo; where adding a third read replica made replication lag worse rather than better. The hardware treated an occurrence, this system is slow today, and never touched the class, the write pattern that produced the lag. That post stayed inside Postgres. The shape isn&amp;rsquo;t a Postgres thing, and the rest of this is where it shows up.&lt;/p&gt;
&lt;h2 id="the-same-mistake-four-domains"&gt;The same mistake, four domains
&lt;/h2&gt;&lt;p&gt;The reboot pipeline and the Kafka lag are the same move pitched at different altitudes. Each problem has a fix that clears the occurrence in front of you and a fix that retires the whole category it belongs to. The first one you reach for by reflex, because the occurrence is what&amp;rsquo;s paging you. The second covers every occurrence that hasn&amp;rsquo;t happened yet, including the ones produced by a code path nobody on the current team remembers writing.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Domain&lt;/th&gt;
 &lt;th&gt;Fixing the instance&lt;/th&gt;
 &lt;th&gt;Fixing the class&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Dev fleet&lt;/td&gt;
 &lt;td&gt;stop/start the box that wedged&lt;/td&gt;
 &lt;td&gt;a pipeline any dev runs against any box&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Kafka lag&lt;/td&gt;
 &lt;td&gt;drain today&amp;rsquo;s backlog&lt;/td&gt;
 &lt;td&gt;a handler that batches every message it will see&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Orphaned data&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;UPDATE&lt;/code&gt; the rows that broke the report&lt;/td&gt;
 &lt;td&gt;a foreign key that makes the orphan impossible to write&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Slow reads&lt;/td&gt;
 &lt;td&gt;tune the query a reviewer caught&lt;/td&gt;
 &lt;td&gt;a log threshold that surfaces every N+1, not the one noticed&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The first two rows you&amp;rsquo;ve already seen in this post. The other two are where the idea earns its rent on a database team.&lt;/p&gt;
&lt;p&gt;Orphaned rows are the cleanest example, because the instance fix is so easy to reach for. A child row points at a parent that&amp;rsquo;s gone, a report breaks, and you write the &lt;code&gt;UPDATE&lt;/code&gt; that reconciles it. It works. You&amp;rsquo;ll write a near-identical one next quarter when a different code path produces the same shape, because nothing stopped the shape from being written in the first place. The class fix is the foreign key (or the &lt;code&gt;CHECK&lt;/code&gt;, or the &lt;code&gt;NOT NULL&lt;/code&gt;) that makes the orphan unrepresentable. It covers every write that will ever touch the table: the migration script, the one-off backfill, the new service that integrated last week and was never code-reviewed against this invariant. The cost is real and worth stating, since it&amp;rsquo;s the reason people leave the constraint off. It validates on every write, and it will reject a messy bulk load you&amp;rsquo;d otherwise have let slide. That friction is the feature.&lt;/p&gt;
&lt;p&gt;The slow-read row is the same trade one layer up. You can fix the N+1 a reviewer happened to catch, or you can turn the catching into infrastructure: log queries-per-request and alert when a single request crosses some threshold, so the next N+1 announces itself in CI instead of waiting for a customer to sit through a slow page. The pull request fixes one endpoint. The threshold catches the shape wherever it surfaces next, including in code no reviewer reads closely.&lt;/p&gt;
&lt;p&gt;Not every invariant fits in a constraint or a metric. Some live only in a reviewer&amp;rsquo;s head: this table is append-only, that column is denormalized on purpose and both copies have to move together, this service owns the writes and everyone else reads. The class fix there is weaker but still real. Write the invariant down where the next change will run into it (a schema-conventions doc, an ADR, a comment on the migration that introduced the rule) and make checking it a standing line item in review, not a thing one careful person happens to catch. Enforcement that leans on people is the most fragile kind, which is the whole reason to push everything you can into the constraint or the CI check first. What&amp;rsquo;s left over is what review and documentation are for.&lt;/p&gt;
&lt;p&gt;One more thing before the table hardens into a rule. Some instance fixes are cheap and reversible; some carry a cost the class fix doesn&amp;rsquo;t, and Kafka partitions are the second kind.&lt;/p&gt;
&lt;div class="warning-box"&gt;
 &lt;strong&gt;Partitions are hard to take back&lt;/strong&gt;
 &lt;div&gt;Adding partitions to a keyed topic reshuffles the key-to-partition mapping. Confluent&amp;rsquo;s docs spell this out: Kafka maps a keyed message to a partition by &lt;code&gt;hash(key) % partition_count&lt;/code&gt;, and changing the count means messages with the same key can start landing on a different partition, breaking the per-key ordering you may be relying on (&lt;a class="link" href="https://docs.confluent.io/kafka/operations-tools/partition-determination.html" target="_blank" rel="noopener"
 &gt;Confluent, &amp;ldquo;Choose and Change the Partition Count&amp;rdquo;&lt;/a&gt;). And there is no supported operation to reduce the count afterward. If you scale out partitions to paper over a per-message handler, you&amp;rsquo;ve taken on an ordering hazard and a topology you can&amp;rsquo;t cleanly walk back, to avoid a change you&amp;rsquo;ll likely make anyway.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id="finding-the-level-where-the-fix-generalizes"&gt;Finding the level where the fix generalizes
&lt;/h2&gt;&lt;p&gt;Fix the class, then. But which problems are a class? An instance of a class and a genuine one-off look identical the first time, and you have to bet on which you&amp;rsquo;re holding before you&amp;rsquo;ve watched it recur. The reboot pipeline was a wasted week if exactly one box ever wedges. The foreign key is bureaucratic drag if the orphan came from a bug already fixed and the shape can&amp;rsquo;t come back. Generalize too early and you&amp;rsquo;ve built a framework for a population of one, which is its own well-documented way to lose a sprint.&lt;/p&gt;
&lt;p&gt;The heuristic that sorts them is unglamorous. Count. How many times has this exact shape cost someone an afternoon, and is the count rising? One wedge, ever, is a pet. The third overnight page to reboot a different box is a class wearing the disguise of three unrelated incidents, and the disguise is the whole problem: each one gets closed individually by whoever caught it, so the pattern never accumulates into a number anyone acts on.&lt;/p&gt;
&lt;p&gt;Profiling is how you make that count honest instead of anecdotal. &lt;a class="link" href="https://explainanalyze.com/p/its-almost-always-the-queries-part-iii-when-the-cpu-is-pegged/" &gt;The CPU-pegged case&lt;/a&gt; sorts &lt;code&gt;pg_stat_statements&lt;/code&gt; by &lt;code&gt;total_exec_time&lt;/code&gt;, which floats the high-frequency cheap query above the rare expensive one. That&amp;rsquo;s the same act in a different tool: looking past the single event that hurt once to find the shape that repeats. Once you can see the shape, what the class fix is depends on the shape. Per-unit overhead wants batching. A hot lookup that runs identically a million times a day wants a cache. A predicate the planner can&amp;rsquo;t seek on wants a SARGable rewrite or the right index, the full treatment of which is in &lt;a class="link" href="https://explainanalyze.com/p/non-sargable-predicates-how-a-function-in-where-kills-your-index/" &gt;non-SARGable predicates&lt;/a&gt;. An &lt;a class="link" href="https://explainanalyze.com/p/orms-are-a-coupling-not-an-abstraction/" &gt;ORM issuing a query per row&lt;/a&gt; wants one set-based query. The level you fix at is what&amp;rsquo;s shared, not the tool you fix with.&lt;/p&gt;
&lt;p&gt;Managed services force the question by removing the instance fix entirely. On your own metal you can bump the box one more size, which is exactly what lets a recurring problem hide for years. Hit a vendor&amp;rsquo;s instance ceiling and there&amp;rsquo;s no next size, no knob on their storage engine, no replication setting to reach into. The only layer left under your control is your own load, so finding the class stops being the disciplined choice and becomes the only one. The &lt;a class="link" href="https://explainanalyze.com/p/where-your-cloud-bill-actually-leaks-an-audit-nobody-runs/" &gt;cloud-bill leaks audit&lt;/a&gt; runs on the same fact: the spend is yours, and nobody&amp;rsquo;s coming to optimize it for you.&lt;/p&gt;
&lt;div class="note-box"&gt;
 &lt;strong&gt;This is not &amp;#39;always generalize&amp;#39;&lt;/strong&gt;
 &lt;div&gt;Lifting a fix to the class level has costs. A foreign key validates on every write and blocks messy backfills. A batched handler trades per-item latency for throughput and hands you partial-failure handling that a per-item loop answered for free. A query-count alert is one more thing that pages. The argument is narrow: when a shape recurs and the count is climbing, fix where it generalizes. When the population is genuinely one, don&amp;rsquo;t build the framework.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id="when-fixing-the-one-instance-is-right"&gt;When fixing the one instance is right
&lt;/h2&gt;&lt;p&gt;The argument inverts the moment the population really is one. A box that wedged once during a bad maintenance window is a pet. Reboot it, move on, and don&amp;rsquo;t build a pipeline for a host that won&amp;rsquo;t repeat. A startup with eighteen months of runway should fix the instance every time the class fix costs an SME afternoon it can&amp;rsquo;t spare, and write the class fix down as known debt rather than pretend it&amp;rsquo;s done. And when you need relief now and the general fix is a week out, fix the instance to buy the week, then put the class fix on the calendar instead of the backlog.&lt;/p&gt;
&lt;p&gt;The trap is that last case going quiet. The cleanup query that bought a quarter is still the team&amp;rsquo;s monthly ritual two years on, the constraint never landed, and the cost has compounded the whole time in afternoons nobody bothered to total. Buying time is fine. Forgetting you borrowed it is the failure.&lt;/p&gt;
&lt;h2 id="the-tell"&gt;The tell
&lt;/h2&gt;&lt;p&gt;The signal that a team is patching instances instead of fixing classes is a particular déjà vu in the incident log: the same shape of problem, handled fresh each time, by whoever happened to catch it. Orphaned rows reconciled by three different people in three quarters with three slightly different queries. A reboot runbook followed forty times and automated zero. Ask what the recurring shape is and who owns the fix for the category. If the answer is a stack of individually-closed tickets and no name, the class is unowned, which almost always means it&amp;rsquo;s still arriving.&lt;/p&gt;</description></item></channel></rss>