Manav.id
Contrarian4 min read

AI safety without identity is theater

AI safety without identity

Alignment work matters. Evals matter. Red-teams matter. None of it changes outcomes if, when an agent does something consequential, no one can answer which human authorized the action.

What "safety" tends to mean today

The dominant AI-safety vocabulary is internal to the model: alignment training, refusal classifiers, jailbreak resistance, capability evaluations, interpretability research. These are real and important; we are not arguing against them.

The argument is that safety in deployment also requires an external scaffold: cryptographic, legible, human-anchored attribution for every agent action. Without that scaffold, even a perfectly aligned model produces outcomes that are unaccountable. With it, even an imperfectly aligned model is contained — because the chain of authority, scope, and supervision is recoverable.

Three failures of identity-free safety

Forensic. An agent does something consequential. The post-mortem cannot determine which human, with what scope, authorized the action. The team patches the model and the prompt; the architectural cause is invisible.

Insurance and liability. Carriers will not underwrite undocumented risk. Policies exclude "autonomous agent operation outside documented human supervision." Without identity, the agent is uninsurable, which means the operator cannot deploy at scale legally.

Regulatory. Article 14 of the EU AI Act demands human oversight that is "effective" and provable. Demonstrating effectiveness requires the audit log artifact identity produces. The model's safety properties do not appear in the audit log; the human's authorization does.

Why this is contrarian

It is uncomfortable for AI-safety teams because it implies their work is necessary but not sufficient. The structural answer to deployment safety includes a layer above the model — the identity, delegation, and attestation layer — that is not their domain.

It is uncomfortable for identity teams because it implies their work has stakes equivalent to the model itself. They are accustomed to being plumbing.

The reframe

Safety is not a property of a model. Safety is a property of a deployed system in which a model is one component, the agent framework is another, the identity layer is another, and the humans authorizing actions are the principals. Treating any single layer as sufficient is a category error.

The model team controls model risk. The identity team controls authority and attribution risk. Both must ship for "safe AI" to mean something operational.

Common objections

The strongest counter-arguments we have heard. The incumbent will catch up — possibly inside their boundary; the cross-platform shape is architecturally hard for them. The category is too narrow — we believe it broadens as agent autonomy compounds; we may be wrong; the data over the next year will tell.

Frequently asked questions

What are the strongest counter-arguments? The two we hear most: (1) the incumbent will eventually ship this, and (2) the category is too narrow to support a category-defining company. We address both head-on; we believe the incumbent's architecture cannot ship this without a rebuild, and we believe the category broadens as agent autonomy compounds.

Are we ignoring legitimate criticism? We try not to. The honest criticisms — slow adoption, immature SDKs in some languages, unclear regulator response — are documented openly. We answer with progress, not with marketing.

What would make us change our mind? Three signals. A major incumbent shipping a comparable cross-platform delegation primitive. A regulator explicitly preempting the category with a different spec. A customer cohort showing they prefer the platform-bound alternative even when the audit trail is broken. None of those have appeared.

Where to start

For the steel-manned counter-position, read ai act article 14 playbook. For the alternative we agree could win, see trust layer 100b company. We do not need to be right for the category to be real.

Why alignment people stopped pushing back

Two years ago, the alignment community treated identity as adjacent to their work — interesting, useful, but not central. The pushback was that alignment was a model-internal problem, not an interface problem. The position has quietly shifted. The alignment researchers we now talk to most often have come around to the view that no model-internal alignment guarantee survives the deployment surface, because the surface is where the agent decides what to do, who to act as, and on whose authority. A perfectly aligned model deployed without identity infrastructure produces outputs no investigator can trace back to a human decision. The shift is not a betrayal of the research agenda; it is a recognition that alignment lives at the boundary, and the boundary is identity. The argument is no longer whether identity matters for safety. The argument is whose identity protocol earns the alignment community's endorsement, and on what evidence.

What this implies for funders

Funders allocating capital between alignment and identity research should treat the two as complementary investments at this stage rather than competing ones. Alignment dollars produce the model behavior; identity dollars produce the deployment evidence the model behavior gets evaluated against. Either alone is incomplete. Funding the pair produces compounding returns where each strengthens the other. The funders who internalize this earliest are positioning their portfolios for the safety conversations the next several years will surface.

An aligned model whose actions cannot be attributed is still a liability. An imperfectly aligned model whose every action is attributed is governable.