Harness-first MCP deployment: a field report from 6 weeks and 8 custom servers
TL;DR
Zero LLM, zero network, zero API keys - a CI-gate harness for stdio MCPs.
I built 8 MCPs in a month. The first 2 broke on first real use. Then I wrote
a test harness. The next 6 shipped green on first run, cutting build time
from 90-120 minutes per server to 50-55 minutes - a 55% reduction measured
across eight servers. The harness ships today as
mcp-stdio-test on PyPI and as
mcp-custom-template on
GitHub. This is the v1 field report. A v2 post with three months of data is
scheduled for ~8 weeks from now.
The gap
MCPs are easy to write. The official Python SDK lets you stand up a server in twenty lines:
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("my-mcp")
@mcp.tool()
def add(a: int, b: int) -> int:
return a + b
if __name__ == "__main__":
mcp.run()
Easy to write - hard to know you wrote correctly. The first regression
that bit me was silent: a decorator refactor dropped one tool off the
registry. tools/list returned five items instead of six. No error,
no warning, no stack trace. The server came up green; Claude just quietly
stopped being able to call one tool until a real invocation finally failed
with “unknown tool” deep inside a chain of agent steps.
The second regression was a schema mismatch after renaming an argument. The
tool still appeared in tools/list. It just threw on every call. Again no
error surfaced at startup.
Both failures have the same root cause: writing an MCP and validating an MCP are not the same act. The SDK checks the first. Nothing checks the second. So I wrote the nothing.
The three-piece harness
Three moving parts, each intentionally small.
1. mcp-stdio-test - the CLI
A ~600-LOC Python package that speaks JSON-RPC to any stdio MCP server,
runs the initialize handshake, optionally calls tools/list or a specific
tool, and exits with a well-defined code. No LLM. No network. No API keys.
Zero runtime dependencies.
mcp-stdio-test path/to/server.py --list-tools
# Assert the tool inventory. Exit 4 on mismatch.
mcp-stdio-test path/to/server.py --list-tools --expect-count 5 --expect-tool search
# Self-check the environment.
mcp-stdio-test doctor
The exit codes are the whole contract: 0 OK, 1 handshake, 2
tools/list, 3 tools/call, 4 assertion mismatch. (argparse-style
2 for a bad CLI invocation is outside this ladder.) Green CI line =
shippable. Anything else = broken, with the exit code telling you
which layer is broken before you read any output.
2. test_<mcp>.py - five assertions per server
Each MCP repo ships a test_<name>.py file that invokes mcp-stdio-test
as a subprocess and asserts on the results. Five assertions is enough:
- Tool count matches.
- Each tool by name is present.
- One call per tool returns a non-error shape.
The test file is the one thing you actually edit per change. Add a tool,
bump the expected count from 4 to 5, add an --expect-tool new_tool
assertion, commit. If that line is missing in the diff, reviewer says no.
3. DEPLOY_CHECKLIST.md - the six-gate ladder
| Gate | What it proves |
|---|---|
| Dev | Code parses, imports cleanly. |
| Unit | Every tool is registered with a schema. |
| Integration | Each tool returns a valid shape. |
| Staged | Registered with the client, allow-listed. |
| Production | Used in >=1 real session without regression. |
| Routing | Whatever layer picks tools per request knows the new surface. |
The checklist is a doc, not a tool. It lives in the template repo. You copy it, you follow it. Skipping gates catches up with you in the most embarrassing way possible.
Walkthrough: new MCP, zero to green
Fork, edit, run. Here is the full walkthrough from an empty directory to a passing CI line.
# 1. Fork and clone the template.
gh repo create my-org/weather-mcp --template dtchen07/mcp-custom-template --clone
cd weather-mcp
# 2. Install the harness and the MCP SDK.
python -m pip install mcp-stdio-test mcp
# 3. Confirm the toolchain is healthy.
mcp-stdio-test doctor
# [PASS] python: 3.12.1
# [PASS] mcp_sdk: 1.27.0
# [PASS] fixture: stdio handshake complete; 0 tool(s) listed
# Overall: PASS
# 4. Replace the two example tools in server.py with real ones.
# Edit test_my_mcp.py: bump --expect-count, list the new tool names.
# 5. Run the harness.
python test_my_mcp.py
# All 5 assertions passed.
# 6. Push. CI runs the same five assertions on Linux, macOS, Windows.
git push
Wall-clock time from git clone to first green test, measured on a fresh
MacBook with no cached dependencies: under 10 minutes, almost all of
which is writing the two real tool bodies. The harness itself (init +
list + call + assert on a 2-tool server) takes about 2 seconds of machine
time; everything else is you editing Python. The v2 field report will
have the 30-day moving average across more servers and more contributors.
Nothing on that list is novel. The novelty is the absence of other steps. No LLM judge. No scored quality report. No pytest plugin glue. No per-server fixture scaffold. Just: the server runs, the tools are where you said they are, the calls return what you said they would.
The feedback loop
The harness closes the inner loop: did the MCP I wrote match the MCP I said I wrote? A second tool closes the outer loop: is the MCP I wrote actually being used?
I run a separate MCP called claude-usage that introspects Claude’s own
tool-call log. It has three tools I actually use:
usage_top_tools- ranks tools by invocation count over a window.usage_dead_mcps- lists tools that have never been called.usage_top_responses- surfaces the top Claude responses that triggered each tool, so I can see why Claude chose to call it.
usage_dead_mcps is the one I care about most. In week 2 it flagged three
tools across two servers as never-called. Two of the three turned out to be
genuinely redundant with existing tools - I deleted them. The third had a
typo in the tool name that Claude was silently routing around. The harness
never would have caught that; the usage log did.
Harness closes the inner loop. Usage log closes the outer. Between them, the functional score I track across ten operational dimensions went from 65 to 91 over six weeks. The harness is not the whole story - credential rotation, config-sync, the checklist ritual all contributed - but it is the single change whose delta I can measure cleanest.
Where this sits vs. neighbors
The space is more crowded than I expected when I started. Five adjacent tools, each doing a distinct job. Honest comparison:
| Tool | Primary job | Transport | LLM | Exit-code CI | Tool-count / name assertions |
|---|---|---|---|---|---|
mcp-stdio-test (this) |
Per-repo declarative CI gates | stdio | no | yes (0/1/2/3/4 tiered) | yes (first-class, user-authored) |
mcp-lint (LuxshanLux) |
23 deterministic lint rules, scored | stdio + SSE | no | yes (0/1/2) | no |
mcp-tester (saqadri, “MCP-Eval”) |
Pytest-style agent evaluation + cost | stdio | yes | JSON + pytest | no |
mcp-doctor (destilabs) |
Diagnostic + agent-friendliness scorer | stdio + HTTP | partial (only generate-dataset) |
scored report | no |
mcp-probe (conikeec) |
Rust TUI debugger + compliance suite | stdio / SSE / HTTP | no | --fail-fast pass/fail |
built-in suite, not user-authored |
| MCP Inspector (official) | Interactive debugger + --cli |
stdio / SSE / HTTP | no | JSON (no assertions) | no |
Two distinctions hold up on direct comparison. First: none of the five
ship per-repo, user-authored assertions. mcp-lint, mcp-doctor, and
mcp-probe all ship fixed rule suites; you can’t commit a line that says
“my server has exactly five tools named X, Y, Z” and have CI fail the
moment that drifts. mcp-stdio-test’s --expect-count / --expect-tool
flags compose into exactly that line. Second: only two neighbors
publish numbered CI exit codes at all — mcp-lint (0/1/2) and
mcp-probe (pass/fail) — and neither distinguishes
tool-count mismatch from handshake failure from usage error. Those
three fail in wildly different ways and should tell you so at the exit
code.
If you want a scored quality report, use mcp-doctor. If you want
agent-driven end-to-end evaluation, use mcp-tester. If you want to
click through your server, use MCP Inspector. If mcp-lint or Inspector
could usefully absorb this pattern upstream, I will happily open a PR.
Fewer better tools beat more overlapping tools.
What v1 gets wrong
Six weeks is not enough data to see slow-drift bugs. Eight servers is not enough to see cross-server interaction patterns. One developer on one desktop is not enough to surface the collaboration pathologies that matter at team scale. The windowing on the build-time numbers is biased toward the servers I wrote last - they benefit from the harness and from my familiarity with the pattern.
Explicit non-goals I am holding the line on for v1: no HTTP / SSE transport, no LLM-in-the-loop evaluation, no multi-MCP orchestration, no pytest plugin glue. Each of those is a larger project than the one I am willing to support at the stated SLA (7-day best-effort response, no SLA on fixes).
What v2 will bring
v0.2.0 is targeted for ~8 weeks from today. The data I want in it:
- Three months of build-time numbers, unweighted, across at least ten servers.
- At least one external contributor’s MCP built with the template.
- Telemetry on protocol-drift incidents (there have been zero in six weeks; I expect this to change).
- An honest “what stopped working” section if any of this v1 framing turned out to be wrong.
If the data is not there by day 56, I will ship a “what I have learned” delta post instead of a press release. The commitment is to ship something honest, not to ship the thing I said I would.
What’s next (unrelated)
The next project, which is not this one, is an LLM-as-judge evaluation harness for MCP tools that have natural-language outputs. The harness in this post is the CI gate. The next one is the regression test for the gate itself: is the tool answering questions well, not just answering them?
Different problem. Different tool. Shipping separately. Subscribe to the RSS feed if you want the v2 post when it lands.
Try it
pip install mcp-stdio-test
# Template repo:
gh repo create my-mcp --template dtchen07/mcp-custom-template --clone
Verify the supply chain yourself:
pip download mcp-stdio-test
gh attestation verify mcp_stdio_test-*.whl --owner dtchen07
Suggested citation: Chen, D. (2026). Harness-first MCP deployment: a field report from 6 weeks and 8 custom servers. https://dtchen07.github.io/2026/04/harness-first-mcp/
This post is licensed under CC BY 4.0.
The accompanying code at mcp-stdio-test
and mcp-custom-template
is licensed under MIT.
No data is collected on this site beyond a self-hosted analytics counter (GoatCounter, no cookies, aggregate only).
Updates
- 2026-04-17: Initial publication.