The Coding Lesson for the Rest of Us
Why coding agents got brilliant first, why your domain can't copy their shortcut, and what you have to build instead.
One question I get a lot is basically: am I going to be replaced by AI? I used to answer that 1) nope, won’t happen and 2) we have two moats: judgment and curiosity.
I was wrong. We’ll all be replaced.
Kidding. But I was wrong about something I want to talk about: the “moat”. Or more precisely, the human vs. machine false dichotomy.
This is a defensive posture of non-engagement that basically makes us shrink over time. We need to treat AI as “intelligence partners”, or to put it simply, as colleagues of silicium (sometimes a bit weird, like an AI version of Dwight Schrute). Because we can teach them at least a certain part of our judgment. The same judgment then operates across ten times the surface area and compounds over time.
A perfect example of that is this recent Toronto Star investigation into Ontario's strong-mayor powers. Kate Allen, David Rider and Nathan Pilla asked whether this new law, introduced to accelerate housing, were actually being used to build homes. The finding: only 2%, were directly aimed at creating housing.
But the method deserves just as much attention and shows what happens when you apply editorial judgment at scale. The team first made their editorial judgment explicit:
What counted as a substantive use of power and a housing decision
Which legislative power was being used
How each decision should be categorized
A custom AI tool then applied those criteria across thousands of documents. Three journalists independently classified 121 decisions, producing 484 human answers to compare with the model. They also reviewed the housing results and edge cases before publishing.
I had the pleasure of providing feedback and advice to the team, and to see first-hand how rigorous they were. This is the lesson more newsrooms should pay attention to: AI can help scale investigations when editorial judgment is codified, tested, and kept accountable to humans.
And let me circle back to curiosity: all of this investigation started the way every good story does. A reporter asked a question only someone deep in their beat would think to ask.
The reporter hadn’t been replaced. They’d been amplified. By an agent built around their judgment, not against it.
We’re still using agents like task-rabbits
Most of us approach AI the way we approach any tool: I have the answer, the tool executes. Draft this headline. Summarize this transcript. Suggest a follow-up question. Useful, and completely safe, but limited.
The first product we launched at Mizal is a “newsroom in a box” where journalists and creators put agents to work. We’ve now logged around ten thousand of those tasks, and roughly 80% are monitoring. Smart, high-value, and also the cautious end of the range.
The interesting thing is that another field already moved past this, and we can read the lesson straight off the page. This field is coding. For two years, software engineers have been the test population for working with agents, and there are two findings worth stealing.
1. Prove AI wrong
The first is about the models. The jump from GPT-5.2 to 5.5 or Claude 4.6 to 4.8 isn’t a bigger brain. It’s largely the same base model. What changes is everything that happens after: more rounds of reinforcement learning on problems where the answer can be checked. Code that compiles and passes its tests. Math that’s provably right.
The field calls it reinforcement learning from verifiable rewards, and the logic is simple: when a domain hands you a clean signal for “correct,” you can train a model relentlessly against it, and it gets very good very fast.
Unfortunately for the rest of us, most knowledge work doesn’t come with a compiler. There is no automatic reward that tells a model “that was sound editorial judgment” or “that source was strong enough”. You can’t simply wait for the next model to be brilliant at your domain the way it became brilliant at code. The training signal that produced those gains doesn’t exist for your work by default.
And even if it did, the lab approach took millions of checked examples and mountains of compute. You don’t have that (and anyway, a newsroom with millions of logged editorial errors is not exactly a reassuring sales pitch). But you don’t need it either.
2. The Lucky Luke of AI
Which is what makes the second finding the one that actually transfers, because it works at the scale you’ve got. Once you have a capable model, the difference between an agent that works and one that produces garbage lives almost entirely in what surrounds it. This is what engineers now call the harness: the steering, the context you feed in, the guardrails, the evaluation loops, the small encoded rules about what good looks like.
Addy Osmani put it cleanly: agent failures, more often than not, are configuration problems, not model limitations.
“A decent model with a great harness beats a great model with a bad harness.” - Addy Osmani
Put the two findings together and you get the thing that transfers to journalism or other domains. The model will not arrive pre-trained on your judgment, because your domain never generated the signal to train it. So the harness is the only place that judgment can come from, and it has to come from you.
Earlier this week, at the Nordic AI Media Summit in Copenhagen, I was making roughly this argument from a stage. The sharper version came from two of my co-speakers. Kasper Lindskow, who runs AI at JP/Politikens, framed the NAMS breakout around a simple premise: agentic coding is the signal for what's coming to the rest of news.
Simon McNish, at Thomson Reuters, made the case as directly as I've heard it. The industry is sketching toward teams that manage work rather than supervise agents. But you don't leap there, he added. You earn it, by first building the muscle of working alongside agents: knowing what they're good at, what they fail at, where they need scaffolding. The harness doesn't drop in from above. You build it because you've been working with agents long enough to know what they need.
Your judgment doesn’t have to stay in your head
So what is the harness, for a journalist, an analyst, a researcher or anyone whose value is judgment rather than code?
It’s your judgment, written down well enough that a machine can apply it.
When I used to call editorial judgment our moat, I had that exactly backwards. Judgment isn’t a moat because it’s locked in our heads. It’s a moat because we can teach it.
This is also what makes the shift from tasks to workflows possible. A task is one-shot, a workflow chains judgment across steps: ingest, weigh against the archive, score for newsworthiness, flag for the desk, learn from what got published. Each step is trivial on its own. The chain is where the work disappears, and it only holds together if your judgment is encoded at every step.
The principle I keep coming back to:
codify the decision rights,
delegate the execution rights.
The agent moves things. You still own the trajectory.
The hardest thing to teach is doubt
There’s one place this gets genuinely hard, and it happens to be the place journalism cares about most.
Agents are trained to produce. They are optimized to say something, and they are bad, like structurally bad, at saying nothing, or at flagging when their own confidence should drop. For most domains that’s an annoyance. For anyone in the business of trust, it’s the whole game. And you can see the industry trying to tackle this problem, for example with the Opus 4.8 release.
You cannot fix this by telling one model to be more careful - the good ol’ “don’t hallucinate” - you just get a confident liar with a hedge.
I’m incredibly lucky to count Philippe Beaudoin as one of our advisors at Mizal. He’s been building and thinking about AI for years. One of his most striking ideas, from my perspective, is that the unlock isn’t filtering an AI toward one correct view, it’s diverse interaction between perspectives. (This one-hour conversation with him at Cohere is definitely worth watching)
Think about the news desk. A story gets argued: reporters, editors, a skeptic in the corner. The reader never sees the fight, only the calibrated result. The argument is the product, but it happens backstage.
The research is converging on exactly this. A recent paper with the wonderful title “Reasoning Models Generate Societies of Thought“ looked inside the strongest reasoning models and found they work by simulating a society of perspectives debating internally. The diversity of viewpoints is the engine of good reasoning, not a side effect. The authors make the larger case in Science:
“The path to more powerful AI runs not through building a single colossal oracle but through composing richer social systems.” - James Evans
Intelligence, in other words, is a society, not a soloist.
Governance has to move too
If you take this seriously, the way you govern AI has to change with it. Almost every newsroom wrote its AI guidelines around “human in the loop”: a person approves the artifact before it ships. That works when there’s a single artifact. It quietly breaks the moment content is generated on demand, per user, or when an agent is chaining five steps on its own. Where, exactly, is the loop?
MIT Sloan recently published “Philosophy eats AI” with the argument that your system’s epistemology, what it treats as knowledge, decides what it produces. As they put it: “creating reliably effective autonomous or semiautonomous agents depends less on technical stacks and/or algorithmic innovation than philosophical training that intentionally embeds meaning, purpose, and genuine agency into their cognitive frameworks.”
“Those who ignore this philosophical verity will create powerful but ultimately limited tools; those embracing it will cultivate AI partners capable of advancing their strategic mission.”
This will also be an architectural question. We need to build agentic systems with full traceability and auditability. Again, like coding does.
Not replacement. Expansion.
So the question is whether we’ll teach agents enough of what you know to make them worth having.
Circling back to the question of replacement, it raises legitimate fears about whether, if I teach a model my judgment, I can be replaced. We’ve all seen Meta trying to codify their engineers’ expertise, or Mercor hiring experts to teach AI how to do their jobs.
Two thoughts on this:
We can codify judgment - at least part of it. But is there something we cannot? A few weeks ago, I shared part of these thoughts with the EPIC community, and got a fascinating question from Lindsey DeWitt Prat: what about intuition? You know, this muscle that tells you which question to ask before you have any evidence. Which is closely connected to curiosity. I have more questions than answers about this, and I’d love to hear your thoughts. But I do think that when judgment scales, intuition doesn’t get sidelined. It scales with it.
Secondly, it could unlock a move from expert to synergist. From “I’m the smart one, the tool runs the task” to “we figure this out together”. The agent sees patterns I miss, I see meaning it misses, and the loop produces something neither of us could alone. It’s a change in posture, not a change in tooling. The job didn’t go away. It moved up a level, and across ten times the surface area. The companies that can figure this out will have an incredible advantage over competitors with a short-sighted view of “let’s fire everyone and let AI do the work”.
Not replacement. Expansion.
P.S.: If you want to dig into the question of job replacement, this post from a16z combs through data on why the AI jobs apocalypse is “complete fantasy”.
Read also:
How agents will reach your business in 2026
Three news items worth lining up recently. If none of them was headline-grabbing on its own, together, they show how fast the agentic web is being built. And what it’s being built out of.






