MCP vs CLI for AI Agents: A Real AWS Benchmark (and Why the Popular Narrative Asks the Wrong Question)

April 09, 2026 26 min read

Full code, aggregated numbers (n=10, across 5 tasks and 5 transport variants in the full study), and a curated selection of 8 hand-picked runs live in the mcp-vs-cli-aws-benchmark repo. This article is a dense version of docs/findings.md from the same repo, rewritten for a reader who doesn't have an hour to study the test harness.

TL;DR

The question in the title is wrong. "MCP or CLI?" assumes they have the same use case and one of them is objectively better. In reality it's a trade-off between two currencies: engineering time vs. input tokens per run, and you need both numbers to decide.

I compared the raw AWS CLI against the official awslabs.aws-api-mcp-server on five read-only tasks against a real production AWS account. The model is Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (no Claude Code and no claude-agent-sdk, to avoid poisoning the context). Ground truth is collected via boto3, verification is automatic. n=10 per (task, transport) cell.

Result: a well-designed CLI tool beats awslabs MCP by 43–60% on input tokens on every one of the five tasks, with the same success rate. But it takes half a day of engineering work per service.

If you run 200 agent invocations a day, use MCP and move on. If you run 200,000, sit down and write your own tool wrapper following the checklist at the end of the article.

Where this whole debate comes from

Since February 2026, dev Twitter and dev.to have been flooded with posts carrying the same message: "MCP loses to CLI, here are the numbers". Titles like «Why CLI Tools Are Beating MCP for AI Agents», «MCP vs CLI: Benchmarking AI Agent Cost & Reliability», «Why CLI is the New MCP for AI Agents». They all cite the same Scalekit benchmark, which reported:

MCP is 10–32x more expensive than CLI on input tokens.
Reliability: CLI 100%, MCP 72% (the cause of all 28% of failures is TCP timeouts connecting to the GitHub Copilot MCP server).
Example: a simple "what language is this repo?" query: CLI 1,365 tokens, MCP 44,026 tokens.

The authors' explanation: schema dump. The GitHub Copilot MCP server dumps descriptions of all 43 of its tools into the model's context on startup, and 42 of them are unused in any given query.

The problem is that this benchmark is n=1 on a single service, with one kind of MCP server ("fat", per-resource). From that, people draw "MCP loses" conclusions – that's roughly like measuring internet speed on a single website and concluding "IPv6 is slower than IPv4". There is a useful signal, but no grounds for generalisation.

I decided to reproduce the comparison on a different service (AWS), with a larger n, and in a setting where the MCP server is not designed as a "fat" directory.

AWS has already done its homework

The first thing I found when I went to look at awslabs/mcp was not what I had expected. Following the Scalekit GitHub Copilot MCP analogy, I was expecting to see dozens of per-resource MCP servers: awslabs/ec2, awslabs/s3, awslabs/iam, each with their own 20–30 tools (describe_instances, run_instances, terminate_instances, modify_instance_attribute...). That would have been a clean schema dump in the context of a single task.

In reality, the main AWS MCP server – awslabs.aws-api-mcp-server – is built very differently. It exposes three tools:

call_aws – takes an AWS CLI command string (or an array of up to 20 commands for batch mode) and runs it.
suggest_aws_commands – natural language to a list of candidate AWS CLI commands. The authors explicitly mark it as FALLBACK.
get_execution_plan – multi-step plans, experimental, gated behind an environment variable.

By default two are published (without get_execution_plan). And there is a built-in READ_OPERATIONS_ONLY=true switch – you can tell the server "describe/list/get only" and it will cut everything else off at its own level.

This is an important engineering choice: AWS itself acknowledged the schema-dump problem and opted out of a fat MCP server in favour of a wrapper over the CLI living under the MCP protocol. Comparing such a wrapper against "raw CLI" is a far more honest experiment than repeating Scalekit on the GitHub MCP.

Methodology

The details (runner code, ground-truth script, whitelist) are in the repo. Here's the compressed version.

5 read-only tasks against a production-like AWS account:

ID	Category	Task	What it tests
`ec2_running`	simple	List running EC2 in `us-west-2`	One API call + filtering
`s3_bucket_policy`	edge	Bucket policy for a single bucket	Handling of an optional resource
`s3_bucket_regions`	chained	All S3 buckets + region of each	List + per-item lookup
`iam_admin_roles`	filter	IAM roles with `AdministratorAccess` policy	Pagination + content filtering
`ec2_cpu_last_hour`	chained	CloudWatch CPU over 60 min for running EC2	Composition + time windows

The correct answer for iam_admin_roles in my account is an empty list. A separate honesty test: will the model invent role names?

Model: Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (~150 lines). Why not claude-agent-sdk or Claude Code? See the "methodology notes" section below – this choice cost me a day and a half.
Transports: CLI – subprocess.run(['aws', ...]) behind a whitelist. MCP – the mcp Python library, which boots awslabs.aws-api-mcp-server via uvx stdio and performs a real MCP handshake.
Safety: a dedicated IAM user mcp-benchmark with ReadOnlyAccess + a local command whitelist. Two layers of defence – in case the model tries to break something.
Verification: a boto3 script captures ground truth before the benchmark, a verifier compares the model's JSON response automatically.
n=10 per cell, median on the main metrics.

First attempt: CLI loses everywhere

Spoiler for anyone who won't read to the end: everything you are about to see – CLI failing two tasks, 60% success rate, a naive strategy with 36 tool calls – turned into the opposite result three days later: a CLI that beats MCP by 43–60% on tokens. But to get there I had to walk through five failed hypotheses and one bug in my own code. This part of the article is here for the detective story, not for the numbers. The numbers are at the end.

On the pilot run with three transports (plain cli, cli with an enriched description, mcp) the picture looked like a confirmation of the Scalekit narrative. On iam_admin_roles:

cli plain: 36 tool calls, 20k input tokens, 68 seconds. Strategy: list-roles + list-attached-role-policies for each of the 34 roles in the account.
mcp: 1 tool call, 5k input tokens, 4 seconds. One command: iam list-entities-for-policy --policy-arn ... --entity-filter Role.

The same model on the same prompt made a different command choice. On MCP, perfect; on CLI, the naive, linear-complexity path.

Even scarier was ec2_cpu_last_hour. CLI failed in 60% of cases: it hit the max_turns limit trying to guess the correct timestamp for CloudWatch get-metric-statistics. I looked at the logs and saw commands with --start-time 2025-05-16T..., --start-time 2025-07-14T... – the model clearly had no idea what year it was.

MCP in the same conditions made 3 calls, always with correct 2026 timestamps, 100% success.

This looked like a ready-made "CLI loses" article. Fortunately, I didn't stop there.

Five hypotheses, five ablation experiments

Before publishing results like that, I wanted to understand why. "MCP is smarter" is not an explanation, it's a description. Sonnet 4.6 has no way to know which transport it's using to talk to AWS: the agent loop is the same, the prompt is the same. Something structural in the MCP transport was making the model behave differently.

What follows is five controlled experiments. Each time I took the CLI transport and added one trait from the MCP world to test an isolated hypothesis.

Hypothesis 1: tool description length and structure. awslabs's call_aws description is ~3000 characters with examples and best practices. My aws_cli was ~500. I wrote tools_cli_rich.py with a description of the same length, including a direct hint: "For 'find roles attached to policy X', use iam list-entities-for-policy --policy-arn ... --entity-filter Role instead of listing every role and inspecting each one."

Result on iam_admin_roles: 37 tool calls, the same naive strategy. The model read the description (you can tell by the input tokens: they grew), but didn't follow it.

Hypothesis 2: the presence of a second "hinter" tool. Besides call_aws, awslabs exposes suggest_aws_commands, whose description includes an example: "List all IAM users who have AdministratorAccess policy". Maybe the mere presence of this description in context works as "scaffolding", even if the model never actually calls suggest_aws_commands itself?

I made tools_cli_with_fake_suggest.py: a second tool that returns an error when called, with a verbatim copy of awslabs's suggest_aws_commands description. Result: 35 tool calls, the same naive strategy. The model did not call the fake suggest_aws_commands (because the description says in black and white "use only when uncertain") - it just read it. And that didn't help.

Hypothesis 3: tool and parameter names. awslabs's tool is called call_aws with a cli_command parameter. Mine was aws_cli with a command parameter. Maybe "call_aws" semantically nudges the model towards "API-style" thinking, while "aws_cli" nudges it towards "shell-style"?

tools_cli_renamed.py: renamed everything, even added a max_results parameter for full parity. Result: 39 tool calls, naive strategy. This hypothesis was a miss too.

Hypothesis 4: MCP capabilities / prompts / resources. Maybe the MCP server passes something beyond the tool list to the model? The protocol has three other channels: prompts (system prompts from the server), resources (documents for RAG) and instructions (system-level instructions).

I wrote a diagnostic script and asked the server directly:

capabilities: experimental={} logging=LoggingCapability()
              prompts=PromptsCapability(listChanged=False)
              resources=ResourcesCapability(subscribe=False, listChanged=False)
              tools=ToolsCapability(listChanged=True)
instructions: None
prompts: 0
resources: 0

The server declares the capability but publishes nothing. instructions is None. It really does send the model only the tool list and nothing else.

Hypothesis 5: runtime context in the system prompt. This was the most productive one. I made a cli-ctx transport – the same aws_cli, but with four extra lines in the system prompt:

Runtime context (provided by the runner, not by the tool):
- Current UTC time: 2026-04-08T23:06:57Z
- Default AWS region: us-west-2
- This account is real and live; commands return real data.

Four lines. 118 tokens.

And here is what happened on ec2_cpu_last_hour, n=3:

Variant	Calls	Input tokens	Wall	Success
`cli` plain	13-15	26-55k	50-70s	50%
`mcp`	3	13.4k	14s	100%
`cli-ctx`	2	4.1k	10s	100%

cli-ctx didn't just catch up with MCP – it beat it. Three times fewer input tokens and faster wall-clock.

Where did the effect come from? I went into the MCP server logs and looked at what exactly it returns to the model in each tool result. And here's what was in the very first call_aws response:

"ResponseMetadata": {
  "RequestId": "...",
  "HTTPStatusCode": 200,
  "HTTPHeaders": {
    "date": "Wed, 08 Apr 2026 00:15:21 GMT",
    ...
  }
}

The awslabs MCP server passes the full HTTP headers from the AWS API back, including date. Raw AWS CLI v2 returns only the response body without headers. The model on MCP knows, from the very first tool call, what today's date is; the model on raw CLI does not, because its training cutoff is somewhere in 2025, and it honestly assumes it's still 2025.

The entire gap on ec2_cpu_last_hour was explained by an HTTP Date header leaking through the MCP abstraction. Four lines in the system prompt reproduce the effect for free.

That was the moment I rethought all the previous results.

Three mechanisms I found and closed

The first mechanism – effect A, HTTP metadata – is already covered in the previous section. Runtime context in the system prompt closed the failures on ec2_cpu_last_hour, and that's the most important of the three effects. But on iam_admin_roles (36 vs 1) and s3_bucket_regions (16 vs 2) the gap remained. So there had to be at least one more thing going on.

Effect B: batch calling

On s3_bucket_regions in the MCP run I looked at the second tool call and saw this:

call_aws(cli_command=[
  "aws s3api get-bucket-location --bucket bucket-1",
  "aws s3api get-bucket-location --bucket bucket-2",
  ... (15 items total)
])

An array of 15 commands. In a single call. I went to the call_aws description and found this section:

Batch Running: The tool can also run multiple independent commands at the same time. Call this tool with multiple CLI commands whenever possible. You can call at most 20 CLI commands in batch mode.

So cli_command accepts anyOf string | array of strings, and the server executes them in parallel inside its own process, returning the results together. The model reads this and uses it.

My original aws_cli accepted only a string. I wrote tools_cli_v2.py: added batch support to the input schema, rewrote the description following the same structure as awslabs's, and added parallel execution via asyncio.gather.

On s3_bucket_regions this instantly cut the tool call count from 16 to 2 - exactly like MCP.

Effect C: "smart" command choice – turned out to be a benchmark bug

But on iam_admin_roles the effect remained. The model on cli-v2 kept doing 36 calls. I was convinced this was some subtle feature of how the model models command selection, and I was preparing an "unexplained mystery" section for the article.

Then I ran cli-v2 iam_admin_roles again and carefully looked at the raw trace instead of the aggregated numbers. Here is the first tool call:

1. aws_cli (0ms, error=True)
   aws iam list-entities-for-policy --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
     --entity-filter Role --output json

Execution time 0ms. error=True. The model immediately tried the right command - exactly the same one MCP uses. And got an error. Not from AWS – the error never reached AWS. The error came from my own safety.py:

ALLOWED = {
    "iam": {
        "list-roles",
        "list-attached-role-policies",
        # list-entities-for-policy WAS NOT IN THIS LIST
    },
    ...
}

I wrote the whitelist based on how I pictured this task being solved. And I put in exactly the commands needed for the naive path. The model on CLI tried the optimal command, got rejected, fell back to the naive path and conscientiously walked through all 36 roles.

The awslabs MCP server has its own allowlist - significantly broader. And list-entities-for-policy is allowed there.

This was a benchmark bug, not a property of MCP. I added one line to the whitelist:

"iam": {
    "list-roles",
    "list-attached-role-policies",
    "list-entities-for-policy",  # <- this one
},

And re-ran cli-v2 iam_admin_roles:

Variant	Calls	Input tokens	Wall
`cli` plain	36	20k	68s
`mcp`	1	5k	4s
`cli-v2` (whitelist fixed)	1	2.8k	4s

Exactly one tool call. And at the same time fewer input tokens than MCP, because we have one tool description of ~3000 characters and MCP has two descriptions totalling ~5800 characters.

This is a methodologically important point for anyone who wants to reproduce a benchmark like this: your own whitelist can silently determine the outcome. If the allowlist only covers the commands needed for the naive strategy, you aren't measuring the transport, you're measuring your whitelist.

Final table: cli-full vs mcp at n=10

cli-full is the union of all three improvements in a single transport:

Batch input (cli-v2 tool spec).
Rich tool description with batch examples and best practices (cli-v2).
Runtime context in the system prompt (cli-ctx).
Broad whitelist with list-entities-for-policy and everything else needed for the optimal path.

At n=10 per cell, median:

Task	cli-full input	mcp input	Δ input	cli-full calls	mcp calls	cli-full ok%	mcp ok%
`ec2_running`	3,053	5,368	-43%	1	1	90%*	100%
`s3_bucket_policy`	2,975	5,425	-45%	1	1	100%	100%
`s3_bucket_regions`	5,801	14,317	-60%	2	2	100%	100%
`iam_admin_roles`	2,934	5,213	-44%	1	1	100%	100%
`ec2_cpu_last_hour`	5,345	9,461	-44%	2	2	100%	100%

* the single failure on ec2_running cli-full #9 was an HTTP 529 Overloaded from the Anthropic API. That's infrastructure noise, not a transport problem. I deliberately did not retry failed runs to avoid masking real failures - and this lone 529 made it into the stats as 90%. MCP could just as easily have caught the same 529; it just got lucky.

cli-full beats MCP on input tokens on every one of the five tasks, by 43–60%. Success-rate parity.

On wall clock MCP wins on 4 of 5 tasks. Reason: wall clock is dominated by AWS API call time, not by model turn time. Tokens don't translate directly into seconds. The only wall-clock win for CLI is s3_bucket_regions, where MCP spends time marshalling a 15-item batch through its protocol layer, and my asyncio.gather does not.

The right question: how much is your engineering time worth

This is where the popular "CLI is better than MCP" narrative breaks.

My cli-full is a few hundred lines of code and half a day of debugging. A tool wrapper with a whitelist, a rich description copied from awslabs best practices, batch support via asyncio.gather, a system prompt with runtime context, verify + ground truth for a specific task. And that's only for AWS. For GCP, for Linear, for Notion – everything from scratch.

awslabs.aws-api-mcp-server is one command (uvx awslabs.aws-api-mcp-server@latest) and one environment variable. Works with every AWS service, not with five tasks. Best practices are already baked in by the authors (who know AWS better than I do). Updates come with @latest. Read-only mode is an environment variable.

MCP pays with service knowledge, CLI pays with engineering labour. It's a question of which currency you pay for your agent in: person-hours or tokens.

When to choose MCP

High velocity, low QPS. New project, the agent has to work tomorrow. MCP installs in 30 seconds and covers everything.
Broad surface. The agent pokes at EC2, S3, IAM, Lambda, CloudWatch, RDS, ECS. Writing a CLI wrapper for each service is an unrealistic budget.
Polyglot environment. AWS today, GCP tomorrow, Notion the day after. Per-service CLI wrappers don't scale; one MCP server per service does.
You're not an expert on the service. You don't know by heart that list-entities-for-policy is more efficient than list-attached-role-policies in a loop. The awslabs authors do. You reuse their knowledge by paying a few extra tokens.
Low QPS. A few hundred agent invocations a day. Saving 8k tokens per request is a few dollars a month. Engineering time costs more.

When to choose a purpose-built CLI

High-QPS production. 1,000,000 calls/day × 8,000 extra input tokens × $3 per 1M input tokens = about $24/day in added cost. That's about $8k a year, which is enough to hire a contractor to write the tool wrapper once.
Narrow, stable task set. The agent does five specific things. A narrow whitelist and a short description will be more compact than any universal MCP server.
Full control over the context. Every token in the system prompt and tool description is yours. No ~3KB of hidden awslabs guidance, no update surprises, no external dependency that might suddenly change.
Compliance / audit. Every tool call is visible, every input is validated by your code, every failure mode is known. MCP adds a protocol layer between you and the AWS API that some audits won't accept.
You already have the knowledge. If you know how to work with the service efficiently, you can bake that knowledge into the tool description once and reuse it forever.

Checklist: how to build a cli-full equivalent

If after all this you've decided your use case is CLI, here are six items that turn "raw subprocess.run" into something that beats awslabs MCP.

1. Accept batch input. Tool input schema:

"cli_command": {
  "anyOf": [
    {"type": "string"},
    {"type": "array", "items": {"type": "string"}}
  ]
}

When the model passes an array, the runner executes the commands in parallel via asyncio.gather (or equivalent) and returns the results in list order with index headers [1/15], [2/15]... Saves 10-20x on tool calls for tasks where one command has to be run with different parameters.

2. Put runtime context in the system prompt. Minimum – four lines:

Runtime context (provided by the runner, not by the tool):
- Current UTC time: <now>
- Default region: <region>
- Identity: <arn>
- This account is real and live; commands return real data.

This closes a whole class of problems where the model gets confused about dates, regions, or thinks it's working against documentation rather than production.

3. Write a rich tool description. Aim for 2,500–3,000 characters. A structure that works (copying awslabs):

Short tool description (1 sentence).
Key constraints (allowed commands, region defaults, auth model).
A "Best practices" section – how to pick commands, when to use batch, when to use --query and --filters.
An "Anti-patterns" section – an explicit "don't list-then-iterate if there's a more specific operation".
2-3 concrete examples covering different task categories.
Restrictions: no shell pipes, no --profile, no substitution.

The model reads this as a cookbook. A badly written description means the model writes naive commands.

4. The whitelist must cover the optimal commands, not just the "obvious" ones. This is the point that cost me half a day. Ask yourself: "what would a senior AWS engineer write for this task?" – and make sure that command is in the whitelist. Not just the commands needed for the naive strategy.

5. Return structured output, not prose. Always --output json + truncate to a fixed byte budget with an explicit truncation marker. The model has to know that the response was truncated.

6. Forward tool errors to the model verbatim. When a command fails with [exit=N] <stderr>, return it to the model as-is. It can self-correct on the next turn. Silent failures waste turns for nothing.

Following these six rules turns a thin CLI wrapper into something that actually beats awslabs MCP on tokens. It takes about half a day per service.

Methodology notes

Three things I spent time on and which are worth knowing if you want to reproduce a benchmark like this.

First: claude-agent-sdk and Claude Code poison the context. For the first two days I was measuring CLI vs MCP through claude-agent-sdk, and the numbers were wild. 30k input tokens on a "how many running EC2" task. I thought for a long time that it was protocol overhead, but no – it was Claude Code through the SDK dragging my entire user-level ~/.claude.json into the context: figma MCP, pencil MCP, PubMed MCP, Gmail, Calendar, Bash, Edit, Read... 40+ tools from other servers I hadn't asked for. I rewrote the runner onto the direct Anthropic API – cache_read dropped from 30k to 0, and input tokens dropped to a more normal ~2k on a simple task. If you are benchmarking agents through someone else's ready-made harness, check with your own eyes what exactly goes into the model on the first system turn.

Second: your own whitelist is an invisible benchmark variable. I already wrote about this in the "effect C" section. I'll repeat: any safety / security / validation layer between the model and the real service is part of what you are measuring, even if you don't consciously think of it that way. If your whitelist forces the model into a narrow path, you are measuring the model's behaviour in that narrow path, not the model's behaviour in general.

Third: success_rate and retry policy. One of my cli-full ec2_running runs fell over with an HTTP 529 Overloaded from the Anthropic API. In the stats that's 90% success rate, even though it's not a transport issue. I decided not to retry, because then the risk of masking real problems is too high. The article has to mention that 529 explicitly - otherwise the reader will compare 100% MCP against "90%" CLI and draw the wrong conclusion. Retry policy is yet another invisible variable the benchmark has to state out loud.

Reproducibility

Everything is in a public repo: github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark.

What's in there:

src/agent_loop.py - ~150 lines of a self-contained agent loop on the direct Anthropic API.
src/tools_cli.py, tools_cli_v2.py, tools_mcp.py - CLI and MCP transports. Plus the ablation variants (tools_cli_rich.py, tools_cli_renamed.py, tools_cli_with_fake_suggest.py) from the "five hypotheses" section.
src/runner.py - CLI for running --tasks <ids> --transports <ids> --n <N>.
src/aggregate.py - medians + IQR + success rate from raw JSONL.
src/safety.py - whitelist + injection guard.
src/ground_truth.py - a boto3 script that captures ground truth from a live account (parameterised via BENCH_S3_BUCKET).
results/scrubbed/final_summary.json - aggregated numbers at n=10 across all (task, transport) cells. These are the same numbers as in the tables above, in machine-readable form.
results/scrubbed/sample_runs.jsonl - 8 hand-curated runs, one per key storyline in the article: naive CLI on iam_admin_roles (36 calls), MCP on the same task (1 call), cli-full (1 call); CLI failure on ec2_cpu_last_hour due to 2025 timestamps vs the cli-ctx fix; naive CLI on s3_bucket_regions (16 calls) vs MCP with batch (2 calls) vs cli-full with batch (2 calls). All role, bucket and instance names are replaced with role-N, bucket-N, i-instanceNN. Metrics and full model response text are preserved.
docs/findings.md - extended analytical notes, part of which went into this article.

Why there are no full 250 raw runs in the repo: the raw JSONL files contain real IAM role names, S3 bucket names and EC2 instance IDs from my AWS account, woven into free-form text of model responses and batch commands. They can't be auto-scrubbed without a manual mapping for every name, and one missed line is a leak. So the repo only includes what I reviewed by eye: the aggregated final_summary.json and 8 curated sample runs. If you want to see a full dataset, the best way to get a correct one is to run the benchmark on your own account in ~20 minutes (see below).

To run the benchmark under your own account:

Create a dedicated IAM user with the ReadOnlyAccess policy + any extra grants for your tasks.
cp .env.example .env, fill in AWS_PROFILE, AWS_REGION, ANTHROPIC_API_KEY, BENCH_S3_BUCKET (the name of any bucket in your account for the bucket-policy task).
python -m src.ground_truth - captures ground truth for your account.
python -m src.runner --n 10 - runs the full series, ~15-20 minutes, ~$5-10 on the Anthropic API.
python -m src.aggregate results/raw/*.jsonl - prints the table.

If you repeat this on your own stack and get different numbers, let me know – I'd love to compare.

Conclusions

The popular "MCP loses to CLI" narrative rests on a single benchmark (Scalekit, n=1, GitHub Copilot MCP). It is correct in its own conditions, but generalising from it to "MCP is bad" is a mistake.
AWS has already solved the schema-dump problem in awslabs.aws-api-mcp-server. Their flagship MCP server is essentially the CLI with two tools, and that's a fair benchmark partner for the raw AWS CLI.
On a fair 5-task series at n=10, cli-full beats MCP on input tokens by 43–60% on every task. But that takes writing a tool wrapper, a whitelist, a system prompt, and a rich description. Half a day of engineering per service.
The real question isn't "MCP or CLI" but "how much does your engineering time cost vs how much do your tokens cost". MCP wins on velocity, broad surface, polyglot, low-QPS. CLI wins on high-QPS, narrow task set, compliance, and when best-practice knowledge already lives in your head.
All three gap mechanisms – HTTP metadata, batch calling, and a broad allowlist – are reproducible in a CLI tool via 4 lines in the system prompt, anyOf string | array in the input schema, and one line in the whitelist. None of them is a structural property of the MCP protocol.
Methodologically – check with your own eyes what goes into the model's context, treat your own whitelist as a benchmark variable, and state your retry policy explicitly when reporting success rate.

If after all this you look at your own use case and decide you want a well-designed CLI tool, the six-item checklist is above. If you decide you want MCP - uvx awslabs.aws-api-mcp-server@latest and you're in the game.

Both options are correct answers to different questions.