I gave my AI agent my credit card

For twenty days, I gave a Claude-driven shopping agent a virtual card with a $2,000 ceiling and a single instruction: keep my home, my pantry, and my office stocked. It spent $1,847.62. Three calls were genuinely smart. Two were bad. One was strange.
The setup
The card was a Stripe Issuing virtual card with category locks (groceries, household, office supplies). The agent ran on Claude, with MCP tools for Walmart, Amazon, Whole Foods, and Staples. It logged every purchase to a spreadsheet I reviewed each morning. No human approval per purchase; one weekly hard cap of $500.
The good calls
Toilet paper, anticipated. The agent noticed our consumption rate (4.2 rolls per week, family of four), saw a 20-pack on sale at Walmart in week two, and bought one. We did not run out for the entire experiment. Office printer ink, swapped. The agent spotted that the third-party ink we'd been buying had a $32 quality complaint thread the previous month and switched to OEM, costing $14 more. The next batch of prints came out cleanly. The dog food brand. The agent caught that our usual brand had a $30 price hike since our last order and switched to the equivalent within the same nutritional profile. I would not have noticed the hike for another two months.
The bad calls
Avocados, in bulk. The agent saw a 12-pack on a per-unit deal and bought it. Avocados ripen in 4 days. We ate three and threw out nine. The wrong batteries. Our smoke detectors take 9V; the agent ordered AA. Same brand, same household-supplies category, wrong shape. We did not catch it until the smoke detector chirped at 3am.
The strange one
On day 11, the agent ordered a single 6-pack of canned tuna from Walmart at 2:14am. We do not eat tuna. Reading back the agent's reasoning trace: it had read a recipe my partner had clipped to her Pinterest and inferred a future meal plan that never came. The agent, in its enthusiasm, prepared. It is funny. It is also exactly the moment I noticed the agent had no real boundary between "reasonable inference" and "wishful inference."
What was missing
Three things. Scope discipline. "Stock the home" was too broad; "stock the home from a maintained shopping list, with discretionary additions capped at $25 each" would have caught the tuna. Magnitude per category. A $40 cap on perishables would have caught the avocados. An escalation rule. "Anything outside the shopping list above $20 must be confirmed" would have surfaced the bulk avocado decision before it landed in the cart. All three are standard delegation primitives. None of them existed in this setup, because the agent's authority was a prompt, not a token.
What it taught me
The agent was not bad at shopping. It was good. The question I had not asked was: what is it allowed to decide, and what must it ask. That question is the defining question of agent commerce today. Manav is the answer that, at the time of writing, looks closest to right.
Common objections
Two questions readers raise. Couldn't this be prevented with better prompts? No — the failures were authority gaps, not prompt failures. Doesn't this just slow agents down? Only at the highest-stakes actions, by design. Velocity for safe work, friction for unsafe work, written into the delegation.
Frequently asked questions
Could the failure described have been prevented? At the delegation layer, yes. A scoped, magnitude-capped, witness-bound delegation would have refused the action at the relying party before the human even saw the request. The model behaved as instructed; the authority was the gap.
How common is this pattern in practice? More common than the press has caught. The cases that surface are the ones that produced headlines or lawsuits; the ones that did not surface are quietly absorbed as 'cost of running agents in production.' We expect the visible ratio to grow as audit trails make the invisible cases discoverable.
What's the immediate lesson? Authority is the bottleneck. Capability is the easy part — the model is good. Ship the delegation layer before the next agent goes into a system that touches dollars, data, or decisions.
Where to start
For the analytic frame behind the story, see ai agent ran my consulting. For the practical playbook the principals would have wanted in advance, see 100 things agents did.
What I would do differently
Reading the experiment back, three changes would have produced sharper signal. First, the delegation should have been scoped — not "stock the home" but "stock from this maintained list, with discretionary additions capped at $25." The tuna line item would have been refused at the substrate. Second, the magnitude caps should have been per-category — perishables at $40, durables at $200, household at $100. The avocado bulk-buy would have hit the cap. Third, an escalation rule for any item not on the maintained list above $20 — the agent surfaces the request, the human approves in seconds, the experiment proceeds. The three changes together produce velocity without surrendering judgment. The agent behaves better with structure than without; the human spends less attention with the structure than without. The experiment's lesson generalizes: delegation primitives are not friction. They are the substrate that lets velocity coexist with discretion.
You don't give an AI agent your credit card. You give it a delegation. The difference is the difference between a useful experiment and an expensive one.