TL;DR

Zero LLM, zero network, zero API keys - a CI-gate harness for stdio MCPs.

I built 8 MCPs in a month. The first 2 broke on first real use. Then I wrote a test harness. The next 6 shipped green on first run, cutting build time from 90-120 minutes per server to 50-55 minutes - a 55% reduction measured across eight servers. The harness ships today as mcp-stdio-test on PyPI and as mcp-custom-template on GitHub. This is the v1 field report. A v2 post with three months of data is scheduled for ~8 weeks from now.

The gap

MCPs are easy to write. The official Python SDK lets you stand up a server in twenty lines:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("my-mcp")

@mcp.tool()
def add(a: int, b: int) -> int:
    return a + b

if __name__ == "__main__":
    mcp.run()

Easy to write - hard to know you wrote correctly. The first regression that bit me was silent: a decorator refactor dropped one tool off the registry. tools/list returned five items instead of six. No error, no warning, no stack trace. The server came up green; Claude just quietly stopped being able to call one tool until a real invocation finally failed with “unknown tool” deep inside a chain of agent steps.

The second regression was a schema mismatch after renaming an argument. The tool still appeared in tools/list. It just threw on every call. Again no error surfaced at startup.

Both failures have the same root cause: writing an MCP and validating an MCP are not the same act. The SDK checks the first. Nothing checks the second. So I wrote the nothing.

The three-piece harness

Three moving parts, each intentionally small.

1. mcp-stdio-test - the CLI

A ~600-LOC Python package that speaks JSON-RPC to any stdio MCP server, runs the initialize handshake, optionally calls tools/list or a specific tool, and exits with a well-defined code. No LLM. No network. No API keys. Zero runtime dependencies.

mcp-stdio-test path/to/server.py --list-tools

# Assert the tool inventory. Exit 4 on mismatch.
mcp-stdio-test path/to/server.py --list-tools --expect-count 5 --expect-tool search

# Self-check the environment.
mcp-stdio-test doctor

The exit codes are the whole contract: 0 OK, 1 handshake, 2 tools/list, 3 tools/call, 4 assertion mismatch. (argparse-style 2 for a bad CLI invocation is outside this ladder.) Green CI line = shippable. Anything else = broken, with the exit code telling you which layer is broken before you read any output.

2. test_<mcp>.py - five assertions per server

Each MCP repo ships a test_<name>.py file that invokes mcp-stdio-test as a subprocess and asserts on the results. Five assertions is enough:

  • Tool count matches.
  • Each tool by name is present.
  • One call per tool returns a non-error shape.

The test file is the one thing you actually edit per change. Add a tool, bump the expected count from 4 to 5, add an --expect-tool new_tool assertion, commit. If that line is missing in the diff, reviewer says no.

3. DEPLOY_CHECKLIST.md - the six-gate ladder

Gate What it proves
Dev Code parses, imports cleanly.
Unit Every tool is registered with a schema.
Integration Each tool returns a valid shape.
Staged Registered with the client, allow-listed.
Production Used in >=1 real session without regression.
Routing Whatever layer picks tools per request knows the new surface.

The checklist is a doc, not a tool. It lives in the template repo. You copy it, you follow it. Skipping gates catches up with you in the most embarrassing way possible.

Walkthrough: new MCP, zero to green

Fork, edit, run. Here is the full walkthrough from an empty directory to a passing CI line.

# 1. Fork and clone the template.
gh repo create my-org/weather-mcp --template dtchen07/mcp-custom-template --clone
cd weather-mcp

# 2. Install the harness and the MCP SDK.
python -m pip install mcp-stdio-test mcp

# 3. Confirm the toolchain is healthy.
mcp-stdio-test doctor
# [PASS] python: 3.12.1
# [PASS] mcp_sdk: 1.27.0
# [PASS] fixture: stdio handshake complete; 0 tool(s) listed
# Overall: PASS

# 4. Replace the two example tools in server.py with real ones.
#    Edit test_my_mcp.py: bump --expect-count, list the new tool names.

# 5. Run the harness.
python test_my_mcp.py
# All 5 assertions passed.

# 6. Push. CI runs the same five assertions on Linux, macOS, Windows.
git push

Wall-clock time from git clone to first green test, measured on a fresh MacBook with no cached dependencies: under 10 minutes, almost all of which is writing the two real tool bodies. The harness itself (init + list + call + assert on a 2-tool server) takes about 2 seconds of machine time; everything else is you editing Python. The v2 field report will have the 30-day moving average across more servers and more contributors.

Nothing on that list is novel. The novelty is the absence of other steps. No LLM judge. No scored quality report. No pytest plugin glue. No per-server fixture scaffold. Just: the server runs, the tools are where you said they are, the calls return what you said they would.

The feedback loop

The harness closes the inner loop: did the MCP I wrote match the MCP I said I wrote? A second tool closes the outer loop: is the MCP I wrote actually being used?

I run a separate MCP called claude-usage that introspects Claude’s own tool-call log. It has three tools I actually use:

  • usage_top_tools - ranks tools by invocation count over a window.
  • usage_dead_mcps - lists tools that have never been called.
  • usage_top_responses - surfaces the top Claude responses that triggered each tool, so I can see why Claude chose to call it.

usage_dead_mcps is the one I care about most. In week 2 it flagged three tools across two servers as never-called. Two of the three turned out to be genuinely redundant with existing tools - I deleted them. The third had a typo in the tool name that Claude was silently routing around. The harness never would have caught that; the usage log did.

Harness closes the inner loop. Usage log closes the outer. Between them, the functional score I track across ten operational dimensions went from 65 to 91 over six weeks. The harness is not the whole story - credential rotation, config-sync, the checklist ritual all contributed - but it is the single change whose delta I can measure cleanest.

Where this sits vs. neighbors

The space is more crowded than I expected when I started. Five adjacent tools, each doing a distinct job. Honest comparison:

Tool Primary job Transport LLM Exit-code CI Tool-count / name assertions
mcp-stdio-test (this) Per-repo declarative CI gates stdio no yes (0/1/2/3/4 tiered) yes (first-class, user-authored)
mcp-lint (LuxshanLux) 23 deterministic lint rules, scored stdio + SSE no yes (0/1/2) no
mcp-tester (saqadri, “MCP-Eval”) Pytest-style agent evaluation + cost stdio yes JSON + pytest no
mcp-doctor (destilabs) Diagnostic + agent-friendliness scorer stdio + HTTP partial (only generate-dataset) scored report no
mcp-probe (conikeec) Rust TUI debugger + compliance suite stdio / SSE / HTTP no --fail-fast pass/fail built-in suite, not user-authored
MCP Inspector (official) Interactive debugger + --cli stdio / SSE / HTTP no JSON (no assertions) no

Two distinctions hold up on direct comparison. First: none of the five ship per-repo, user-authored assertions. mcp-lint, mcp-doctor, and mcp-probe all ship fixed rule suites; you can’t commit a line that says “my server has exactly five tools named X, Y, Z” and have CI fail the moment that drifts. mcp-stdio-test’s --expect-count / --expect-tool flags compose into exactly that line. Second: only two neighbors publish numbered CI exit codes at allmcp-lint (0/1/2) and mcp-probe (pass/fail) — and neither distinguishes tool-count mismatch from handshake failure from usage error. Those three fail in wildly different ways and should tell you so at the exit code.

If you want a scored quality report, use mcp-doctor. If you want agent-driven end-to-end evaluation, use mcp-tester. If you want to click through your server, use MCP Inspector. If mcp-lint or Inspector could usefully absorb this pattern upstream, I will happily open a PR. Fewer better tools beat more overlapping tools.

What v1 gets wrong

Six weeks is not enough data to see slow-drift bugs. Eight servers is not enough to see cross-server interaction patterns. One developer on one desktop is not enough to surface the collaboration pathologies that matter at team scale. The windowing on the build-time numbers is biased toward the servers I wrote last - they benefit from the harness and from my familiarity with the pattern.

Explicit non-goals I am holding the line on for v1: no HTTP / SSE transport, no LLM-in-the-loop evaluation, no multi-MCP orchestration, no pytest plugin glue. Each of those is a larger project than the one I am willing to support at the stated SLA (7-day best-effort response, no SLA on fixes).

What v2 will bring

v0.2.0 is targeted for ~8 weeks from today. The data I want in it:

  • Three months of build-time numbers, unweighted, across at least ten servers.
  • At least one external contributor’s MCP built with the template.
  • Telemetry on protocol-drift incidents (there have been zero in six weeks; I expect this to change).
  • An honest “what stopped working” section if any of this v1 framing turned out to be wrong.

If the data is not there by day 56, I will ship a “what I have learned” delta post instead of a press release. The commitment is to ship something honest, not to ship the thing I said I would.

What’s next (unrelated)

The next project, which is not this one, is an LLM-as-judge evaluation harness for MCP tools that have natural-language outputs. The harness in this post is the CI gate. The next one is the regression test for the gate itself: is the tool answering questions well, not just answering them?

Different problem. Different tool. Shipping separately. Subscribe to the RSS feed if you want the v2 post when it lands.


Try it

pip install mcp-stdio-test
# Template repo:
gh repo create my-mcp --template dtchen07/mcp-custom-template --clone

Verify the supply chain yourself:

pip download mcp-stdio-test
gh attestation verify mcp_stdio_test-*.whl --owner dtchen07

Suggested citation: Chen, D. (2026). Harness-first MCP deployment: a field report from 6 weeks and 8 custom servers. https://dtchen07.github.io/2026/04/harness-first-mcp/

This post is licensed under CC BY 4.0. The accompanying code at mcp-stdio-test and mcp-custom-template is licensed under MIT.

No data is collected on this site beyond a self-hosted analytics counter (GoatCounter, no cookies, aggregate only).

Updates

  • 2026-04-17: Initial publication.