Keep Your Skills Sharp the Hard Way

TL;DR

Staying sharp in the LLM era isn’t nostalgia and it isn’t prompt-craft. The “agentic loop / context engineering” genre is largely a marketing story told by the parties who profit from your subscription and your dependence. The move that holds up is to keep sweating hard problems by hand, because skill you didn’t earn through friction doesn’t stick, and without it you can’t push the model to its ceiling and can’t catch it when it ships damage instead of productivity.

Take it to the extreme to see the shape of it. Two people build the same production database with the same model and the same tools. One is a database SME with twenty years on the job. The other is the world’s best LLM prompter who genuinely could not tell you what SELECT does. On day one the prompter’s toy demo might even look fine.

The difference shows up before either of them writes a line, in the questions only one of them knows to ask. The SME reads the workload first. Read-heavy or write-heavy, the capacity it has to carry and the growth curve it has to absorb, and the SLO that decides what “fast enough” even means before any tuning is worth doing. She pins down how primary keys get generated, because on a write-heavy table a random UUID v4 scattered across the clustered index pays a page-split and write-amplification cost that a monotonic key never does, and that choice is brutal to walk back once the table is large. She knows the shard key Elasticsearch rides on decides whether tonight’s extract is a seek or a full scan, so the CDC path and the load it lands on the source get designed up front instead of discovered when the extract starts timing out. And she carries the domain itself: which application-side issues are already known, which failure modes she’s personally cleaned off the floor, the landmines you can only steer the model around if you already know they’re there. The prompter can aim the model at none of this, because they don’t know it exists, and they can’t catch the model quietly dropping it. The prompt was never the skill. The skill is what lets the prompt produce a thing that survives contact with production.

Both databases boot. Both serve the feature set. On day one nobody in the room can point at the difference. Hand the two of them to whoever will carry the pager, though, with no idea which model wrote either one or how good the prompter was, and they take the SME’s without opening the schema. They’ve met the other kind at 3am.

The uncomfortable part is personal. You become the prompter by giving away, a little at a time, the work that would have made you the SME the on-call just picked.

The block-an-afternoon fix doesn’t reach the erosion

The obvious move is to put learning back on the calendar. Block Friday afternoons for a course, grind some algorithm problems, carve out time to “stay current.” Reasonable on its face, and it mostly fails for two reasons.

The first is that after-hours willpower loses to the sprint every time. The judgment above didn’t erode on a quiet Friday; it eroded inside the delivery flow, one pasted prompt at a time, under deadline, where the agent was the fast path and the fast path was rewarded. A learning block on the calendar is the first thing the sprint eats. The erosion happens in production flow, so a fix that lives outside production flow is aimed at the wrong place.

The second is that decontextualized practice rebuilds the wrong thing. Grinding LeetCode is useful for the reps it gives you (more on that below), but it doesn’t rebuild system-specific judgment, because the judgment that catches a bad migration is judgment about your schema, your access patterns, the specific table that takes a metadata lock the moment you add a column with a default on the version of MySQL you happen to run. None of that is in a course. It’s in the work. Decontextualized practice keeps the general muscle warm; it does not deposit the local knowledge that turns the agent from a liability into a multiplier on your codebase specifically.

Note

This is an argument for selective retention, not Luddite refusal. This doesn’t mean write everything by hand. The point is narrower: there’s a specific category of work, the work that builds the judgment you need to supervise the agent, that you give up at your own cost when you outsource it by reflex. Most of the rest, hand it off and move on.

Why the skill you didn’t sweat for doesn’t stick

The mechanism here is old and well-measured, and it predates LLMs by decades. Reading the answer and generating the answer leave very different traces in memory.

Roediger and Karpicke ran the canonical version in Test-Enhanced Learning (2006): students who studied a passage and then took a recall test retained substantially more a week later than students who studied the same passage repeatedly without the test. The catch that matters here is the time course. Measured five minutes after, the re-readers actually did better; the testing group only pulled ahead at two days and a week. Effortful retrieval feels worse in the moment and pays off later. That is precisely the trade an engineer declines every time they paste the problem into the agent and read back a smooth answer that felt productive and deposited nothing.

Robert Bjork named the general principle desirable difficulties: the conditions that make learning feel easiest in the moment are often the ones that produce the weakest long-term retention. His qualifier is the load-bearing part here. A difficulty is only desirable if the learner has enough foundation to eventually overcome it; below that floor it’s just an obstacle. The engineer who hands every hard problem straight to the agent is sitting below the floor, which is why watching it work teaches them nothing; there’s no foundation for the struggle to deposit against.

So the friction isn’t a cost you tolerate to look hardcore. The friction is the thing doing the work. Skip it and the skill does not form, the same way watching a climbing video builds no grip strength.

Set that next to what the tooling actually delivers, because the gap between the felt productivity and the measured productivity is the same shape as the gap between re-reading and retrieval. METR ran a controlled study in July 2025 on sixteen experienced developers working in mature repositories they knew well. The developers expected AI to speed them up by about 20%, and afterward believed it had. Measured against the control, they were 19% slower. Google’s 2025 DORA report (published September 2025) found AI adoption near 90% with over 80% of developers reporting a productivity gain, while flagging that AI adoption continues to correlate negatively with software delivery stability and that 30% of developers report little or no trust in the code the tools generate.

Weigh those numbers against the volume of content telling you the answer is a better prompt, a tighter agentic loop, more “context engineering.” A lot of that genre is a marketing campaign run by parties whose revenue is your subscription and whose moat is your dependence. The pitch quietly reframes a skill problem as a prompt problem, because a prompt problem is one they sell the solution to. The honest version, which the vendors mostly don’t lead with: the tools are real and useful, and on the right slice of work the 10x is genuinely real. What’s oversold is the dependence narrative. Prompt-craft is the multiplier, and a multiplier needs something to multiply.

Warning

The inverse failure mode is just as expensive. Hand-coding everything out of sentiment, refusing the agent on work where you’d learn nothing by doing it yourself, is its own waste. Re-deriving a CRUD endpoint by hand for the fortieth time builds no judgment you don’t already have; it just burns the hours you could spend on the query plan you can’t yet read. Sentiment is not a delegation strategy in either direction.

Where to spend the friction

The discipline is selective, and the through-line is simple: if you don’t sweat it, it won’t stick, so spend the sweat where the judgment is worth owning.

Keep doing reps on hard problems, and solve them yourself before you look. The LeetCode-style grind isn’t about the tree-inversion ever appearing in your job; it’s that forming the approach cold keeps the general muscle warm, and warm is what lets you recognize the shape of a problem before the agent has framed it for you. When a gnarly issue lands in your channel, the kind you’d normally forward to the model in one motion, take the twenty minutes to actually work it out. That twenty minutes is the deposit. The forwarded version withdraws from an account you stopped paying into.

The highest-return habit costs almost nothing: form your own hypothesis before you read the agent’s answer. Predict the query plan before you run EXPLAIN. Predict the failure cause before you read the stack trace. Predict the API shape before the model proposes one. Then read its answer and compare. The gap between your guess and the better answer is the entire lesson, delivered for free, and it’s exactly the retrieval-then-feedback loop Roediger and Karpicke measured. Skip the guess and you’ve turned a test back into a re-read.

There’s a second reason the habit matters, beyond memory. Questioning the agent’s calls and correcting them has to be constant, because once the work gets past boilerplate the model starts making decisions that are plausible and wrong. Wave them through and the small ones compound into the kind of damage you can’t cleanly restore. Coast on auto-accept and you’re fine until it stalls on something genuinely hard. Then the engineer who never kept an independent read has nothing left to break the stall with, and the work stops where it got difficult.

Reserve your value-center domain for hand-solving even when the agent is faster. If your career bet is on the database, do the query tuning and the lock analysis and the plan-regression hunts by hand, because that’s where the local judgment compounds. Delegating work outside your value center is fine and good; the discipline is about protecting the core, not maximizing total hand-coding.

Warning

Hold a hard line on one rule: don’t merge what you couldn’t have written and can’t explain. If you can’t reconstruct why the agent’s migration takes the lock it takes, you’re rubber-stamping it, and rubber-stamping is how the table-locking migration, the silently-wrong join, and the auth bypass get into production wearing your name on the commit.

Last, measure the drift directly. Every so often, turn the agent off and cold-solve something in your core domain. Not as penance. As a gauge. If a problem you’d have handled fluently two years ago now feels foreign, that’s the readout, and it’s better to get it on a quiet afternoon than in the postmortem.

When this doesn’t apply

The argument is about protecting your core, so it goes quiet everywhere your core isn’t at stake.

Throwaway internal tools are the clearest exemption. The binlog-purge script that’s sat in the toil backlog for two quarters, the diagnostic endpoint nobody will ever staff, the one-off migration helper: let the agent rip and ship it. The 10x is real on exactly that slice, the verification cost is low because the blast radius is low, and there’s no durable judgment you’re failing to build by not hand-writing a script that dies in a week.

Domains outside your value center get the same pass. Delegating your CSS is not a moral failing if you’re a backend engineer who will never own the frontend. The point was never to hand-grind everything; it was to protect the specific judgment your role depends on. Spend the friction budget there and the agent everywhere else. Be honest about where the line really sits, though. Everywhere you delegate fully, you become the no-clue prompter for that domain. The surrender costs nothing on the frontend you’ll never own. It bites later on whatever you filed under “not mine” that turns out to matter, when you can’t tell a competent answer from a confident one.

Live incidents are their own case. At 2am with customers down, use every tool that shortens the outage, the agent included. The learning doesn’t happen during the fire. It happens in the postmortem, when you reconstruct the failure by hand, form the hypothesis you didn’t have time to form live, and deposit the pattern for next time. The incident is for stopping the bleeding; the postmortem is where the rep lives.

The bigger picture

The market is already pricing this. Work the model does at a competent mid-level is getting commoditized. The premium is concentrating on the judgment that recognizes when the model is wrong, the judgment you can only build by doing the work the model offers to take off your hands. The paradox is that the same offer that makes you fast this quarter is what hollows out the skill that makes you valuable next year.

The tools are worth using. This asks for one thing: keep one specific account funded. Pick the domain you’re betting your career on, and inside that domain, keep paying the friction tax the agent keeps offering to waive. Form the hypothesis before you read the answer, refuse the merge you can’t explain, and every few weeks turn the thing off and check whether the problem you used to own still feels like yours.