The Trust Fall With AI | Scribe by Hemanth

When I was in theater, the first exercise they made us do wasn’t memorizing lines or hitting marks. It was the trust fall. Stand on a riser, cross your arms, close your eyes, and fall backwards into the arms of people you barely knew.

A few people hated it. They’d open their eyes mid-fall, uncross their arms to catch themselves, tense up the whole way down. Me? I was more like, let me just let go and see what happens. Not bravery, just curiosity. What does it feel like to actually commit? Turns out it feels like nothing. You fall, hands catch you, and your brain quietly updates its model of the world: okay, that worked.

That exercise has been stuck in my head lately, because it’s exactly where we are with AI coding agents right now. Some people are tensing up. Others are letting go. Most are somewhere in between, mid-fall, eyes half-open, peeking over their shoulders.

The trust spectrum

Most conversations about AI trust are binary: you either trust it or you don’t. That framing is wrong. Trust is a spectrum, and with AI, it looks something like this:

Level	What it looks like	Example
0 - Skeptic	Won’t use it at all	”I don’t trust AI with my code”
1 - Verifier	Uses it, checks everything	Copy-paste from ChatGPT, then manual review
2 - Collaborator	Works alongside it	AI writes tests, you write implementation
3 - Delegator	Hands off defined tasks	Agent handles PR reviews autonomously
4 - Orchestrator	Manages fleets of agents	Multi-agent workflows running in production

Most developers I talk to are somewhere between Level 1 and Level 2. They’ll use Copilot for autocomplete but won’t let an agent push to production. That’s a perfectly rational position, for now.

Why trust falls work

The trust fall exercise works because of three things:

Bounded risk - You’re falling 3 feet, not 30
Incremental exposure - First fall is short, each one gets bigger
Observable catch - You can feel the hands before you commit fully

Sound familiar? That’s exactly how good AI integration should work.

Bounded risk = sandboxed execution

When you let an AI agent run code, you don’t give it root access to production. You sandbox it:

# Bad trust fall - no guardrails
agent.execute(user_code, permissions="all")

# Good trust fall - bounded risk
agent.execute(user_code, sandbox={
    "network": False,
    "filesystem": "read-only",
    "timeout_seconds": 30
})

The sandbox is the spotter. It limits how far you can fall.

Real tools are already doing this well. Claude Code ships OS-level sandboxing (Seatbelt on macOS, bubblewrap on Linux) that enforces filesystem and network isolation at the kernel level. Even if the model is compromised through prompt injection, the OS sandbox holds. Anthropic’s docs are explicit: “design workflows assuming the agent might be compromised and ensure external controls hold even if the agent’s logic is bypassed.”

The principle is universal though. Any agent framework worth using should enforce boundaries at a level the agent itself can’t override.

Incremental exposure = progressive autonomy

Start small. Let the AI suggest. Then let it draft. Then let it execute with approval. Then let it execute autonomously for low-risk tasks.

Week 1: AI suggests code completions → you accept/reject
Week 2: AI writes unit tests → you review
Week 4: AI opens PRs with tests → you approve
Week 8: AI merges passing PRs on non-critical repos → you monitor

Each week is a slightly taller trust fall. You build confidence through repetition, not through a single leap of faith.

Claude Code has a concrete version of this: permission modes that dial autonomy up gradually. plan mode is read-only. default asks before every edit and command. acceptEdits auto-approves file changes but still asks for bash. auto mode runs an AI classifier that pre-screens tool calls, only escalating risky operations to you. If the classifier blocks three actions in a row or twenty in a session, it pauses and reverts to manual.

That’s not unique to Claude Code. The pattern shows up everywhere: GitHub Copilot’s suggestion-then-accept flow, Cursor’s diff-review mode, Devin’s plan-then-execute model. The best tools make the trust ramp explicit rather than forcing you to choose between “full access” and “read-only.”

Observable catch = traceability

You can’t trust what you can’t observe. Every AI action should leave a trace:

// Every agent decision should be auditable
agent.on("action", (event) => {
  log({
    tool: event.tool_name,
    input: event.params,
    output: event.result,
    reasoning: event.chain_of_thought,
    timestamp: Date.now()
  });
});

If you can replay the reasoning, you can trust the outcome. Or at least know why to distrust it.

Tools like Claude Code take this further with declarative permission rules checked into version control. Your settings.json defines what’s allowed, denied, and what requires approval, and the whole team shares the same trust boundaries:

{
  "permissions": {
    "allow": ["Read", "Edit", "Bash(npm test)"],
    "deny": ["Bash(rm -rf *)", "Bash(sudo *)"],
    "ask": ["Bash(git push *)", "Bash(docker *)"]
  }
}

The hierarchy (Deny > Ask > Allow) means destructive actions are blocked even if someone accidentally adds a broad allow rule. That’s defense in depth applied to trust.

The three failure modes of AI trust

Trust doesn’t fail in one way. It fails in three:

1. Over-trust (the blind fall) You close your eyes, fall backwards, and nobody’s there. In AI terms: you ship agent-generated code to production without review, and it introduces a subtle bug that costs you a weekend. Or worse, you run an agent with --dangerously-skip-permissions on a repo with production credentials. The flag name includes “dangerously” for a reason.

2. Under-trust (refusing to fall) You never close your eyes. You review every line, every suggestion, every completion. You spend more time reviewing AI output than you would have spent writing it yourself. The tool becomes overhead. This is the approval fatigue problem that newer features like auto-approve classifiers are specifically designed to solve.

3. Mis-calibrated trust (falling the wrong direction) You trust the AI for things it’s bad at (complex architectural decisions) and distrust it for things it’s great at (boilerplate generation, test writing, code formatting). You’re falling sideways.

The fix? Calibrate continuously. Keep a mental scorecard. Where does the AI nail it? Where does it fumble? Adjust your trust boundaries accordingly.

Agents, skills, and the trust architecture

The trust question gets real when agents start doing things on their own. Not “suggest a code completion,” that’s Level 1 trust. I mean agents that spin up, pick the right tool, run a multi-step workflow, and deliver a result while you’re making coffee.

This is where the trust fall metaphor stretches. In theater, you fall once. With autonomous agents, you’re falling continuously, every task is a micro-trust-fall. And the surface area is enormous:

Skills give agents domain expertise. A security audit skill, a database migration skill, a deployment skill. Each one is a new capability you’re trusting the agent to wield correctly.
Tool access (MCP, function calling, whatever the protocol) defines what an agent can do. Read files? Execute commands? Hit external APIs? Each permission is a trust boundary.
Orchestration decides what happens when agents coordinate. One agent reviews, another deploys, a third monitors. The trust isn’t just in each agent, it’s in the handoffs between them.

Skill: "deploy-to-staging"
  ├── read source code ✓ (low risk)
  ├── run test suite ✓ (bounded risk)
  ├── build container ✓ (sandboxed)
  ├── push to staging ⚠️ (needs approval)
  └── push to production ✗ (hard no)

The interesting pattern emerging is trust-per-skill, not trust-per-agent. I might trust an agent completely with test generation but not at all with infrastructure changes. Same agent, different trust levels, depending on what it’s doing.

Autonomous agents on critical tasks

Here’s where it gets uncomfortable. The industry is sprinting toward agents that handle critical operations: incident response, database migrations, security patching, production deployments. Tasks where a mistake doesn’t cost you an afternoon, it costs you customers.

The trust fall for critical tasks needs different spotters:

# Non-critical: agent runs freely
agent.run("generate-changelog", approval=False)

# Critical: defense in depth
agent.run("apply-db-migration", 
    approval=True,
    dry_run_first=True,
    rollback_plan=True,
    human_checkpoint_after=["schema-change", "data-migration"],
    alert_on_anomaly=True
)

The pattern isn’t “never let agents touch critical systems.” It’s “make the trust fall shorter for higher stakes.” A 2-foot fall onto a padded mat. Dry runs before real runs. Checkpoints before irreversible steps. Automatic rollback if something smells wrong.

MCP got one piece of this right: tools are explicit, permissions are scoped, the model doesn’t get access to everything, it gets access to what you connect. But Palo Alto Networks’ research on MCP security highlights the risks in the plumbing itself: server impersonation, overprivileged credentials in local config files, insufficient isolation between MCP servers. The trust boundary isn’t just the agent. It’s the entire tool chain.

The catch is earning the fall

Here’s what most people miss: in a trust fall, the catcher has to earn the faller’s trust. That means consistency. Reliability. Predictable behavior.

AI systems earn trust the same way:

Consistency - Same input should produce roughly the same quality output
Transparency - Show your reasoning, don’t just show your answer
Graceful failure - When you’re wrong, be obviously wrong, not subtly wrong
Bounded confidence - Say “I’m not sure” instead of hallucinating an answer

We’re building systems that need to earn trust at scale. That’s a harder problem than making them smarter.

Where I am on the spectrum

Depends on context. Professionally, I’m around Level 2.5, solidly collaborative, edging into delegation for well-scoped tasks. AI writes tests, drafts docs, reviews PRs, handles formatting. But anything that touches production or security still gets human eyes.

Personally? I’m at Level 4. My side projects are playgrounds for multi-agent workflows. Agents that chain skills together, orchestrate deployments, run experiments end-to-end. I let them fail fast so I can learn what the guardrails need to look like before the professional side catches up.

That split is intentional. Personal experiments build the intuition. Professional caution protects the things that matter. The gap between the two closes a little every month.

The paradox of trust

Here’s the weird part: the more you use AI agents, the better you get at knowing when not to trust them. Using them daily builds intuition for their failure modes. You start to feel when a response is off. You develop a sixth sense for hallucination.

People who never use AI have the least calibrated trust. They either fear it completely or expect magic. The practitioners, the ones doing trust falls every day, are the ones with the most nuanced view.

Trust isn’t given. It’s practiced. Every time you let an AI agent handle a task and verify the result, you’re doing a trust fall. Some days you land perfectly. Some days you hit the floor. But you get up, adjust the height, and fall again.

The question isn’t whether to trust AI. It’s how to build a practice of trust, with sandboxes, with traceability, with incremental exposure.

Close your eyes. Cross your arms. Fall.

Then check the logs.

#ai#trust#agents#philosophy#mcp