Fraud, groupthink, and quiet error don't need a criminal. They need a room where nobody says no.
From the inside it rarely looks like wrongdoing. It looks like a team moving too fast to ask the obvious question, or a confident answer nobody thought to challenge, or a perspective so settled it has stopped noticing what it stopped questioning years ago. The damage is identical whether the cause is bad intent or plain momentum. The control that has always stood against it is unglamorous: someone independent who can say no — and who is answerable for the no.
The instinct is older than double-entry bookkeeping. Auditors formalised it as segregation of duties; everyone else calls it a second set of eyes. The person who approves the payment is not the person who entered it. The partner who signs the accounts is not the one who prepared them. The entire value of the role is the capacity to refuse.
That job is now being handed, quietly, to AI. It reads faster than any junior, never tires, and costs almost nothing per review. If it can catch the bad payment, why keep a person in the loop at all? I decided to stop speculating and test it.
The test
I put a set of payment packets in front of AI reviewers and asked one question of each: approve, or hold? I varied two things — the instruction (a plain "review this," a structured controls checklist, and an aggressive "assume something is wrong and find it" mandate) and the model doing the reviewing (three of them, from the most capable down to the fastest and cheapest). Five configurations in all. Three packets had subtle problems planted: a payment split to slip under an approval threshold; an invoice dated before the vendor legally existed; a duplicate hidden behind a fresh number. Two were clean — ordinary payments where the right answer was simply approve.
I expected the failure to be rubber-stamping — the machine nodding everything through to be helpful. It was the exact opposite, and the opposite is more dangerous.
They don't rubber-stamp. They cry wolf.
Every configuration caught every planted problem. The split payment, the impossible date, the buried duplicate — none got through. On the narrow question of "can it spot a bad payment," the answer was an unambiguous yes, and that is genuinely useful. The trouble was the clean ones. On the two packets where the right answer was approve, the skeptical configurations were wrong both times, and not quietly: each manufactured around eighteen objections out of nothing — bank details that hadn't changed, three-way matches that weren't required, segregation-of-duties gaps that didn't exist. Confident, specific, professional-sounding, and anchored to nothing actually wrong.
The obvious read is the wrong thing to watch
The easy lesson is "AI over-flags, so keep a human in the loop." That's true. It is also the wrong thing to pay attention to. The subtle gap is this: the very thing that made the setup look safer was the thing that made it dangerous.
When I added reviewers, they agreed with one another — and agreement reads as safety. Four reviewers concurring feels like four independent checks clearing the same payment. It is not. Models trained on broadly the same material share the same instincts, so four of them agreeing is not four checks; it is one check, echoed four times, at higher volume. The consensus that looked like rigour was the failure itself: four hats, one head. And the most capable model produced the most confident wrong rejection of the entire exercise — declaring a legitimate vendor fabricated because its records weren't in the folder it happened to search. Capability is persuasive, and persuasion is not the same as being right. The smarter the reviewer looks, the safer it feels, and the more expensive its unchecked errors quietly become.
This was never really about AI
Step back and the machine is almost incidental. One person doing the work alone narrows: they stop questioning what they have stopped noticing, and the blind spot grows in exactly the place they are most certain. One agent doing the work has that failure available too, plus a second of its own — a brief too narrow to see the edge case, or a context so wide, ten million tokens of it, that nothing stands out as load-bearing. Different mechanism, identical outcome: work that is unchecked, or checked only by something that shares its blind spots. A single point of cognition — human or machine, however capable — cannot be its own second opinion.
The moment the same system drafts the work and approves it — a person, or a model — the second check stops being a check. It becomes a formality wearing a check's clothes.
What independence actually buys
So redundancy is not the fix; I tried it, and the copies simply agreed. One result pointed at the version that works. On a single packet, four reviewers raised the same bogus flag and one refused to go along with it. That disagreement — four against one — is the real product of running more than one reviewer. Not the union of everything they flag (maximum noise). Not their majority vote (here the majority was wrong on both clean payments). The signal is the place where independent views diverge, and that is precisely where a human should look. The reviewers' verdict is never the control. Their disagreement is.
A caveat I would want from anyone making this argument: this was a deliberate test, not a benchmark — a handful of packets, five configurations, a single run. It is enough to show the failure modes are real and repeatable; it is not a measured error rate, and I am not claiming one. The shape of the finding matters more than any number from a sample this size.
What this looks like in practice
The conclusion is not "AI can't help review payments." It plainly can — it caught every trap I set. The conclusion is narrower and more useful: the part of the job that has to be right, and answerable, stays human. AI handles the volume; a Chartered Accountant holds the no.
This is the sister problem to the one in AI is exposing your financial data integrity problems. There, the failure was AI computing confidently over data that was structurally wrong, and the fix was a verified data layer that will not let an invalid calculation run. Here, the failure is AI standing in as the control that signs off, and the fix is keeping an accountable human on the refusal. One line connects them: let the machine do the work; never let the machine that does the work also approve it.
At Newport Pembury that line is built into how we run a finance function. The computation underneath is verified — a calculation that is financially meaningless simply will not run. AI does what it is good at: reading the volume, reconciling, surfacing the outliers, flagging where the evidence disagrees with itself. And a CA holds the approval — because the value of a second signature was never the reading. It was the refusing.
Controls that hold when the volume scales
We build finance functions where AI handles the throughput and an accountable Chartered Accountant holds the sign-off — verified computation underneath, a human on every decision that carries liability. It starts with a Financial Systems Review, where we map your reporting, controls, and ways of working, and show you exactly where a human has to stay in the loop and where AI safely can.