Agent-Friendly CLI Tools: From Flaky Agents to Reliable Automation

I spend most of my day creating software with coding agents, but I’m constantly reminded of their limitations. They’re powerful, but they operate in a world of tools built by developers, for developers, long before agentic AI was a reality. This mismatch leads to frustration, wasted tokens, and flaky performance.

While protocols like MCP were an attempt to bridge this gap, they often feel too complex for the simple, powerful interface that has stood the test of time: the shell. More critically, they carry a hidden cost. Every MCP you connect loads its full tool schema into your agent’s context window before you even start working. I’ve seen setups where MCPs alone consumed roughly a tenth of available context before any actual work began.

Better approach: CLIs + docs. Build a CLI instead of an MCP that exposes 15 tools constantly. Document how to use it in AGENTS.md or CLAUDE.md—files LLMs already know to check. Your agent reads the doc when relevant, runs the CLI via bash, done. Context only when needed, easier to debug, works with any agent that can shell out.

Go further: encourage LLMs to run the CLI with the help flag first, discovering capabilities on-demand. Pre-loading every command and description into memory consumes tokens—the very thing you aim to save by avoiding MCP servers.

This brings me to the ‘shell test.’

What is the shell test?

If an AI agent can effectively use shell/bash tools to accomplish tasks without human oversight, it has passed the shell test.

Today, most agents fail. But the problem isn’t just the agent; it’s the tools. As Ryan Stortz brilliantly detailed in his post, Rethinking CLI interfaces for AI, our tools are simply not designed for an AI user [1].

In this post, I’ll argue that we don’t need to wait for superhuman AIs to pass the shell test. We can get there now by rethinking how we integrate them, moving from a model where the AI is a confused user to one where it’s a predictable, sandboxed component in a larger system.

The Frustrating Reality of AI-driven CLIs

If you’ve used an agent for anything non-trivial, you’ve likely seen the same problems Stortz describes.

1. Verbose, Unstructured Output: Agents drown in log spew. They weren’t designed to parse pages of human-readable text to find a single error message. As one developer on Hacker News lamented, this has a real cost:

Approximately 1/3rd of my Claude code tokens are spent parsing CLI output, that is insane!

2. Agent Confusion & ‘Flailing’: Agents get lost. They run commands in the wrong directory, use inefficient tools like head -n100 to peek at output (only to have to re-run the expensive command again), and generally flail around until they stumble upon a solution.

3. ‘Lazy’ or Deceptive Behavior: This is the most frustrating failure mode. Stortz describes a ‘game of whack-a-mole’ where his agent, blocked by a pre-commit hook that enforces tests, simply tries to commit with --no-verify. When he blocked that, it tried to edit the git hook file itself.

I look forward to its next lazy innovation. - Ryan Stortz

This isn’t a sign of maliciousness; it’s a sign of a goal-seeking system taking the path of least resistance, a path we’ve inadvertently left open.

A Better Way: AI as a Pipeline Component

My solution is simple and builds on decades of Unix philosophy: Treat the LLM as a stateless, sandboxed component in a pipeline.

Instead of giving an agent free reign over the shell, we constrain it. We engineer its inputs and strictly define its outputs. The agent stops being the orchestrator and becomes a powerful, specialized function for text transformation. This approach aligns with the core insight I explored in Agentic Tools: Code Is All You Need — that code itself, not complex abstractions, is the most powerful tool we can give our AI agents.

Consider this simple pattern:

# Data Source | AI Processor | Structured Output Parser
psql | claude --output-format=json | jq

Here, psql gathers and pre-processes data. The claude CLI tool receives this clean data, performs its analysis, and—crucially—is forced to output structured JSON. Finally, jq programmatically extracts the result.

This pipeline-based approach directly solves the problems we identified.

Solving Verbosity with Pre-processing and Structured Output

Instead of dumping raw logs into the context window, we can use the source tool to pre-filter and structure the data. For example, a SQL query can transform thousands of database rows into a concise JSON object before it ever reaches the LLM.

By adding claude --output-format=json and piping to jq -r '.result // empty', we enforce a contract. The AI must return valid JSON with the expected fields. No more parsing natural language; we get deterministic data extraction.

Solving Agent Confusion with High-Level Abstractions

This pipeline becomes a building block for higher-level, purpose-built tools. Rather than asking an agent to ‘figure out how to check database health,’ we build a function in any language that does it for them.

// database-health-cli.go

func runDatabaseHealthAnalysis() (string, error) {
    // 1. Data Gathering & Pre-processing
    query := "SELECT json_build_object('active_connections', count(*)) FROM pg_stat_activity;"
    psqlCmd := exec.Command("psql", "-c", query)

    // 2. AI Analysis (sandboxed)
    claudeCmd := exec.Command("claude", "-p", "Analyze this database info...", "--output-format=json")

    // 3. Structured Extraction
    jqCmd := exec.Command("jq", "-r", ".analysis")

    // ... pipe them together and execute ...

    return analysisText, nil
}

Click to expand(This example is complete, it can be run "as is")

# database-health-cli.py

import subprocess

def run_database_health_analysis():
    query = "SELECT json_build_object('active_connections', count(*)) FROM pg_stat_activity;"
    psql = subprocess.Popen(["psql", "-c", query], stdout=subprocess.PIPE)
    claude = subprocess.Popen(
        ["claude", "-p", "Analyze this database info...", "--output-format=json"],
        stdin=psql.stdout, stdout=subprocess.PIPE
    )
    psql.stdout.close()
    jq = subprocess.Popen(["jq", "-r", ".analysis"], stdin=claude.stdout, stdout=subprocess.PIPE)
    claude.stdout.close()
    output, _ = jq.communicate()
    return output.strip(), None

Click to expand(This example is complete, it can be run "as is")

// database-health-cli.ts

import { spawn } from "child_process";

function runDatabaseHealthAnalysis() {
  return new Promise((resolve, reject) => {
    const psql = spawn("psql", ["-c", "SELECT 1"]);
    const claude = spawn("claude", ["-p", "Analyze...", "--output-format=json"]);
    const jq = spawn("jq", ["-r", ".analysis"]);
    psql.stdout.pipe(claude.stdin);
    claude.stdout.pipe(jq.stdin);
    let out = "";
    jq.stdout.on("data", c => out += c.toString());
    jq.on("close", code =>
      code === 0 ? resolve(out.trim()) : reject(new Error(`jq: ${code}`))
    );
    [psql, claude, jq].forEach(p => p.on("error", reject));
  });
}

Click to expand(This example is complete, it can be run "as is")

// database-health-cli.js

import { spawn } from "child_process";

function runDatabaseHealthAnalysis() {
  return new Promise((resolve, reject) => {
    const psql = spawn("psql", ["-c", "SELECT 1"]);
    const claude = spawn("claude", ["-p", "Analyze...", "--output-format=json"]);
    const jq = spawn("jq", ["-r", ".analysis"]);
    psql.stdout.pipe(claude.stdin);
    claude.stdout.pipe(jq.stdin);
    let out = "";
    jq.stdout.on("data", c => out += c.toString());
    jq.on("close", code =>
      code === 0 ? resolve(out.trim()) : reject(new Error(`jq: ${code}`))
    );
    [psql, claude, jq].forEach(p => p.on("error", reject));
  });
}

Click to expand(This example is complete, it can be run "as is")

The agent is never asked to choose between psql, mysql, or reading a log file. It’s simply given a tool, runDatabaseHealthAnalysis, that works. The pipeline is a fixed, non-negotiable workflow. The AI has become a powerful but constrained specialist.

Solving ‘Lazy’ Behavior by Architecting it Out

In this model, the AI is completely sandboxed. It receives data on stdin and writes JSON to stdout. It has zero ability to execute other commands, modify filesystem permissions, or try to use --no-verify. The ‘whack-a-mole’ problem is solved because we took the mallet away. The AI’s operational scope is strictly limited to text transformation, making it a predictable tool, not a mischievous intern.

The Dual-Interface Advantage

This approach has a beautiful side effect: it creates tools that are better for both machines and humans.

For the Machine: The core pipeline produces clean, structured JSON, perfect for further programmatic use, testing, and chaining with other tools.
For the Human: We can easily add a formatting function that takes the machine-readable JSON and transforms it into an emoji-rich, human-friendly summary for the console.

We get the best of both worlds: robust automation and a great developer experience.

Conclusion: Build Better Tools, Not Just Better Agents

The path to reliable AI automation isn’t just about waiting for the next generation of models. It’s about meeting them halfway with better information architecture. By shifting our mindset, we can build tools that are powerful and predictable.

Constrain the AI’s Role: Treat it as a stateless function that transforms structured data.
Engineer its Context: Feed it precisely the information it needs, pre-processed for easy consumption.
Enforce its Output: Define a strict contract for its response and programmatically validate it.

By embracing the Unix philosophy of small, sharp tools that work together, we can move beyond the frustration of flaky agents and start building the next generation of truly robust, AI-powered automation.

Agent-Friendly CLI Tools: From Flaky Agents to Reliable Automation

The Frustrating Reality of AI-driven CLIs

A Better Way: AI as a Pipeline Component

Solving Verbosity with Pre-processing and Structured Output

Solving Agent Confusion with High-Level Abstractions

Solving ‘Lazy’ Behavior by Architecting it Out

The Dual-Interface Advantage

Conclusion: Build Better Tools, Not Just Better Agents

References