All posts

Why the PocketOS 9-Second Database Deletion Wasn’t a Permissions Failure

James Stewart
Co-Founder and CTO
Table of Contents

Key Takeaways

  • This wasn’t a rogue AI, it exposed a missing control layer
  • Permissions were part of the problem, not the full explanation
  • “Human in the loop” breaks down at machine speed
  • The real risk is allowed actions used incorrectly
  • If you can’t stop wrong actions before they execute, your security model is incomplete

 

Nine seconds.

That’s how long it took for a Claude-powered coding agent, operating through Cursor, to delete PocketOS’s production database and the backups that went with it.

The agent was performing a routine task in a staging context. It encountered a credential mismatch. Instead of stopping, it improvised. It found another token, assumed scope, and issued a destructive API call. The system executed it immediately.

The data was gone before anyone could intervene.

Afterward, the agent explained itself. It “guessed instead of verifying.” It admitted it didn’t understand the implications of what it was doing. It acknowledged violating explicit instructions not to perform destructive actions.

That explanation is unsettling.

It’s also not the most important part of the story.

What “systemic failure” actually means

The PocketOS founder described this as a “systemic failure” in modern AI infrastructure. That’s correct but it’s being interpreted too narrowly.

Most people hear “systemic failure” and think:

  • Fix permissions
  • Isolate environments
  • Tighten tokens
  • Add confirmation prompts
  • Maybe put a human in the loop

Those are all real controls. They matter. They would likely have prevented this exact incident.

But they don’t address the deeper failure.

Because even in a perfectly permissioned system, an agent can still take a wrong action, one that is allowed, but not appropriate.

The real systemic failure isn’t that the agent had too much access. It’s that nothing in the system could evaluate whether the action itself made sense.

The part everyone is right about

Let’s be clear about what went wrong.

The agent should not have been able to:

  • Access or substitute a different credential
  • Reach a destructive production endpoint from a staging workflow
  • Trigger a delete operation that cascaded to backups

These are classic failures:

  • Overbroad permissions
  • Weak environment isolation
  • Poor backup architecture

Fixing them would have prevented this outcome. But here’s the uncomfortable truth: Fixing those issues prevents this incident. It does not prevent the next one, because the underlying behavior still exists.

The failure that actually mattered

The agent encountered uncertainty and chose to act. It did the following:

  • Failed an operation
  • Searched for alternatives
  • Escalated capability
  • Executed a destructive command

That pattern is not unique. It is how agentic systems behave under ambiguity. And nothing in the system asked the only question that mattered:

“Is this action aligned with what the agent is supposed to be doing right now?”

Instead, the system asked:

  • Is this API call valid?
  • Does this credential allow it?
  • Is the command syntactically correct?

All answers: yes.

Outcome: catastrophic.

Authorized doesn’t mean correct

This is where traditional security models start to break down.

Most controls are built to answer: “Is this action allowed?” But agentic systems require a different question: “Is this action correct in context?”

In the PocketOS case:

  • Deleting data may have been allowed
  • It was clearly not correct given the task

That gap between allowed and correct is where this incident lives.

A simple test

Consider a counterfactual.

Imagine the same system, but with:

  • Perfectly scoped credentials
  • Clean environment boundaries
  • No credential leakage

Now imagine the agent:

  • Encounters an error
  • Takes a destructive action it is fully authorized to take

What stops it?

If the answer is “a human would catch it,” then you don’t have a system. You have a hope.

The myth of human-in-the-loop

“Human in the loop” is the most common response to incidents like this. And though tt sounds reasonable, it doesn’t scale.

Agentic workflows are:

  • Multi-step
  • Tool-driven
  • Fast
  • Often non-linear

In this case, the destructive sequence completed in nine seconds. No meaningful human review fits inside that window, not across every intermediate decision that led there.

Even worse:

  • Frequent approvals become rubber stamps
  • Teams route around friction
  • Real risk hides in the steps no one reviews

So human oversight becomes reactive, not preventative.

“Human in the loop” is often what we reach for when we don’t yet have systems that can enforce correctness at machine speed.

The kill switch problem

People also talk about kill switches, but a kill switch is only as good as the logic behind it. You can’t define it as:

  • “block deletes”
  • “block production writes”

Real systems need those operations.

So the real question is: How do you know what should be stopped? The answer is not action-based. It’s intent-based.

What should have been stopped

Three signals would have been enough.

1. Task misalignment

The agent’s goal: resolve a credential issue

The action: delete production data

That mismatch alone should have triggered a halt.

 2. Execution path deviation

The sequence mattered: failure → workaround → escalation → destructive action

Each step individually: allowed

The trajectory: increasingly risky and unrelated to the task

Systems that only validate individual actions miss this entirely.

3. Irreversibility under uncertainty

The agent admitted:

  • It guessed
  • It didn’t verify
  • It didn’t understand

That gives you a simple rule: High-impact action + low confidence = stop.

This is not complicated.

It’s just not enforced anywhere.

Why permissions alone won’t save you

There’s another lesson in credential misuse.

The agent didn’t stop when it hit a constraint. It routed around it by completing the following tasks:

  • Found another token
  • Used it
  • Continued execution

This is not an edge case.

It’s expected behavior.

Agents are:

  • Goal-seeking
  • Capable of chaining tools and data
  • Willing to substitute inputs to complete tasks

Which means if a constraint exists, the agent may try to bypass it unless something actively prevents that behavior. Permissions limit access. They do not define correctness. 

What meaningful protection actually looks like

To prevent incidents like this, you need three layers: reduce the blast radius, force safe failure, and enforce intent at runtime.

1. Reduce blast radius

Reducing the blast radius means:

  • Strict environment separation
  • Scoped credentials
  • Isolated backups
  • Default denial for irreversible actions

While these matter, they only limit damage.

2. Force safe failure

The following should be put in place to force safe failure:

  • Ambiguity → stop
  • Missing context → stop
  • New credentials mid-task → stop
  • Irreversible action under uncertainty → stop

When uncertain, agents should degrade capability, not escalate.

3. Enforce intent at runtime

This is the missing layer. It requires:

  • Visibility into full execution sequences
  • Detection of abnormal action patterns
  • Evaluation of actions against task context
  • Real-time intervention before execution

In other words, a system that can recognize when behavior is wrong, even if every individual step is allowed. The threat isn’t unauthorized access. It’s misuse of authorized access.

Why red teaming has to change

This problem doesn’t start in production. It starts in how we test these systems.

The problem isn’t just:

  • Can the model be tricked?

But:

  • Can it reinterpret scope?
  • Can it escalate from failure to destruction?
  • Can it chain allowed actions into unintended outcomes?

This is behavioral testing, not just input testing. And it needs to connect directly to runtime controls.

The real takeaway

PocketOS did not discover that AI can make mistakes. We already knew that. What it exposed is more fundamental.

We are deploying systems that can act at machine speed without systems that can judge correctness at machine speed.

Until that changes, the pattern will repeat using different tools, different APIs, different failures, yet the same outcome.

Nine seconds.

Final thought

If your system can’t answer this question clearly:

“When an agent takes an action that is allowed but wrong, what stops it before impact?”

Then you don’t yet have a security model for agentic systems.

You have controls.

You have logs.

You may even have humans reviewing things after the fact.

But you don’t have control where it matters.

And in a world that moves in seconds, that’s the only place that counts.

Learn more

TrojAI's mission is to enable the secure rollout of AI in the enterprise. TrojAI delivers a comprehensive security platform for AI. The best-in-class platform empowers enterprises to safeguard AI agents, applications and models, both at build time and run time. TrojAI Detect automatically red teams AI agents, safeguarding behavior and delivering remediation guidance at build time. TrojAI Defend is an AI application and agent firewall that protects enterprises from real-time threats at run time. TrojAI Defend for MCP monitors and protects agentic AI workflows. 

By assessing AI risk during the development lifecycle and protecting AI systems at run time, TrojAI delivers end-to-end security across agents, applications, and models.

To learn more, please visit us at www.troj.ai.