Story4 min read

We let GPT-5 run our identity verification for a month

A small B2B SaaS company replaced its identity-verification pipeline with a GPT-5 prompt for four weeks. We helped them write the post-mortem. They asked us to publish it as a warning. Names removed.

The setup

The company verified roughly 2,800 new users per week — primarily small-business owners signing up for a billing tool. Their previous stack was an off-the-shelf KYC vendor charging $1.40 per check. The CTO, eyeing the model's vision capabilities and a $7,200 monthly bill, wired up GPT-5 to read uploaded ID documents, extract fields, score authenticity, and decide whether to onboard.

What worked

For two weeks, nothing visible changed. The model extracted addresses and dates of birth correctly. It flagged five obviously poor forgeries that the prior vendor had also flagged. The CTO posted a Twitter thread celebrating the savings. Three competitors retweeted.

What did not

By week three, three patterns had emerged. The model accepted a series of obviously fake government IDs whose layouts came from a country whose authentic IDs the model had never seen. It declined two legitimate IDs from a Caribbean nation because it had no training distribution against which to compare. And it produced a single false-acceptance the company is now litigating: a synthetic ID generated from a public dataset that named a real person but used a fabricated photo. The fraudster opened an account, ran $14,000 in laundered card volume through it, and disappeared.

The vendor's response

The KYC vendor they had replaced reached out within a week of the breach. The replacement quote was higher than the original — $1.85 per check, "given the new model environment." The CTO accepted, with a side agreement to add behavioral biometrics on the laptop and continuous identity verification for high-risk accounts. The model is now in the stack, used as one of three signals, never as the decision.

What this teaches

Identity verification is not a vision task. It is a classification task with a long tail. The long tail of legitimate documents — Caribbean municipal IDs, Indian state-level licenses, Nigerian passport renewals — is where the model fails ungracefully. The long tail of forgeries is where new attackers spend their time. A KYC vendor's defensibility is not the model; it is the long-tail handling, the regulator relationships, the documented provenance of every accept-and-reject decision.

Where this connects to Manav

It does not, mostly. Manav is not a KYC vendor; we do not verify documents at onboarding. We verify the human is the same human across every action after onboarding. The right pattern is: KYC vendor for the moment of verification, Manav for the lifetime of the relationship. The CTO who replaced the vendor with a model was solving the wrong problem; the right problem was the lifetime, not the moment.

Common objections

Two questions readers raise. Couldn't this be prevented with better prompts? No — the failures were authority gaps, not prompt failures. Doesn't this just slow agents down? Only at the highest-stakes actions, by design. Velocity for safe work, friction for unsafe work, written into the delegation.

Frequently asked questions

Could the failure described have been prevented? At the delegation layer, yes. A scoped, magnitude-capped, witness-bound delegation would have refused the action at the relying party before the human even saw the request. The model behaved as instructed; the authority was the gap.

How common is this pattern in practice? More common than the press has caught. The cases that surface are the ones that produced headlines or lawsuits; the ones that did not surface are quietly absorbed as 'cost of running agents in production.' We expect the visible ratio to grow as audit trails make the invisible cases discoverable.

What's the immediate lesson? Authority is the bottleneck. Capability is the easy part — the model is good. Ship the delegation layer before the next agent goes into a system that touches dollars, data, or decisions.

Where to start

For the analytic frame behind the story, see manav vs checkr hireright. For the practical playbook the principals would have wanted in advance, see laptop farm playbook.

What the integration revealed about the protocol

The integration with GPT-5's reasoning surface produced a piece of evidence we had not expected. The protocol's delegation primitives, designed for human-to-agent authority, mapped cleanly onto agent-to-agent authority without modification. The reasoning agent could delegate sub-tasks to specialist agents, each operating under a scope derived from the parent's scope, with the chain of authority preserved end-to-end. The implication is that the substrate scales to multi-step agent reasoning without architectural change. We had designed for one delegation depth and discovered the architecture supported arbitrary depth. This is the property that matters most as agent reasoning chains lengthen. The substrate that handles depth-one delegation today handles depth-ten delegation tomorrow without redesign. The integration is the cleanest evidence we have that the architecture chose the right level of abstraction. The lesson generalizes: protocols that abstract authority correctly handle the next generation of capability without re-architecture.

The model is good. The model is not insurance. The day you discover the difference is expensive.