<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://dtchen07.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dtchen07.github.io/" rel="alternate" type="text/html" /><updated>2026-04-17T19:58:56-04:00</updated><id>https://dtchen07.github.io/feed.xml</id><title type="html">Dennis Chen</title><subtitle>Field reports on building with Claude, MCP servers, and the CI gates that make them safe to evolve.</subtitle><author><name>Dennis Chen</name></author><entry><title type="html">Harness-first MCP deployment: a field report from 6 weeks and 8 custom servers</title><link href="https://dtchen07.github.io/2026/04/harness-first-mcp/" rel="alternate" type="text/html" title="Harness-first MCP deployment: a field report from 6 weeks and 8 custom servers" /><published>2026-04-17T00:00:00-04:00</published><updated>2026-04-17T00:00:00-04:00</updated><id>https://dtchen07.github.io/2026/04/harness-first-mcp</id><content type="html" xml:base="https://dtchen07.github.io/2026/04/harness-first-mcp/"><![CDATA[<!-- GoatCounter analytics snippet placeholder -->
<!-- <script data-goatcounter="https://<code>.goatcounter.com/count" async src="//gc.zgo.at/count.js"></script> -->

<h2 id="tldr">TL;DR</h2>

<blockquote>
  <p><em>Zero LLM, zero network, zero API keys - a CI-gate harness for stdio MCPs.</em></p>
</blockquote>

<p>I built 8 MCPs in a month. The first 2 broke on first real use. Then I wrote
a test harness. The next 6 shipped green on first run, cutting build time
from 90-120 minutes per server to 50-55 minutes - a 55% reduction measured
across eight servers. The harness ships today as
<a href="https://pypi.org/project/mcp-stdio-test/"><code class="language-plaintext highlighter-rouge">mcp-stdio-test</code></a> on PyPI and as
<a href="https://github.com/dtchen07/mcp-custom-template"><code class="language-plaintext highlighter-rouge">mcp-custom-template</code></a> on
GitHub. This is the v1 field report. A v2 post with three months of data is
scheduled for ~8 weeks from now.</p>

<h2 id="the-gap">The gap</h2>

<p>MCPs are easy to write. The official Python SDK lets you stand up a server
in twenty lines:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">mcp.server.fastmcp</span> <span class="kn">import</span> <span class="n">FastMCP</span>

<span class="n">mcp</span> <span class="o">=</span> <span class="n">FastMCP</span><span class="p">(</span><span class="s">"my-mcp"</span><span class="p">)</span>

<span class="o">@</span><span class="n">mcp</span><span class="p">.</span><span class="n">tool</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">mcp</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>
</code></pre></div></div>

<p>Easy to write - hard to <em>know</em> you wrote correctly. The first regression
that bit me was silent: a decorator refactor dropped one tool off the
registry. <code class="language-plaintext highlighter-rouge">tools/list</code> returned five items instead of six. No error,
no warning, no stack trace. The server came up green; Claude just quietly
stopped being able to call one tool until a real invocation finally failed
with “unknown tool” deep inside a chain of agent steps.</p>

<p>The second regression was a schema mismatch after renaming an argument. The
tool still appeared in <code class="language-plaintext highlighter-rouge">tools/list</code>. It just threw on every call. Again no
error surfaced at startup.</p>

<p>Both failures have the same root cause: <strong>writing an MCP and validating an
MCP are not the same act.</strong> The SDK checks the first. Nothing checks the
second. So I wrote the nothing.</p>

<h2 id="the-three-piece-harness">The three-piece harness</h2>

<p>Three moving parts, each intentionally small.</p>

<h3 id="1-mcp-stdio-test---the-cli">1. <code class="language-plaintext highlighter-rouge">mcp-stdio-test</code> - the CLI</h3>

<p>A ~600-LOC Python package that speaks JSON-RPC to any stdio MCP server,
runs the <code class="language-plaintext highlighter-rouge">initialize</code> handshake, optionally calls <code class="language-plaintext highlighter-rouge">tools/list</code> or a specific
tool, and exits with a well-defined code. No LLM. No network. No API keys.
Zero runtime dependencies.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mcp-stdio-test path/to/server.py <span class="nt">--list-tools</span>

<span class="c"># Assert the tool inventory. Exit 4 on mismatch.</span>
mcp-stdio-test path/to/server.py <span class="nt">--list-tools</span> <span class="nt">--expect-count</span> 5 <span class="nt">--expect-tool</span> search

<span class="c"># Self-check the environment.</span>
mcp-stdio-test doctor
</code></pre></div></div>

<p>The exit codes are the whole contract: <code class="language-plaintext highlighter-rouge">0</code> OK, <code class="language-plaintext highlighter-rouge">1</code> handshake, <code class="language-plaintext highlighter-rouge">2</code>
<code class="language-plaintext highlighter-rouge">tools/list</code>, <code class="language-plaintext highlighter-rouge">3</code> <code class="language-plaintext highlighter-rouge">tools/call</code>, <code class="language-plaintext highlighter-rouge">4</code> assertion mismatch. (argparse-style
<code class="language-plaintext highlighter-rouge">2</code> for a bad CLI invocation is outside this ladder.) Green CI line =
shippable. Anything else = broken, with the exit code telling you
<em>which layer</em> is broken before you read any output.</p>

<h3 id="2-test_mcppy---five-assertions-per-server">2. <code class="language-plaintext highlighter-rouge">test_&lt;mcp&gt;.py</code> - five assertions per server</h3>

<p>Each MCP repo ships a <code class="language-plaintext highlighter-rouge">test_&lt;name&gt;.py</code> file that invokes <code class="language-plaintext highlighter-rouge">mcp-stdio-test</code>
as a subprocess and asserts on the results. Five assertions is enough:</p>

<ul>
  <li>Tool count matches.</li>
  <li>Each tool by name is present.</li>
  <li>One call per tool returns a non-error shape.</li>
</ul>

<p>The test file is the one thing you actually edit per change. Add a tool,
bump the expected count from 4 to 5, add an <code class="language-plaintext highlighter-rouge">--expect-tool new_tool</code>
assertion, commit. If that line is missing in the diff, reviewer says no.</p>

<h3 id="3-deploy_checklistmd---the-six-gate-ladder">3. <code class="language-plaintext highlighter-rouge">DEPLOY_CHECKLIST.md</code> - the six-gate ladder</h3>

<table>
  <thead>
    <tr>
      <th>Gate</th>
      <th>What it proves</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Dev</strong></td>
      <td>Code parses, imports cleanly.</td>
    </tr>
    <tr>
      <td><strong>Unit</strong></td>
      <td>Every tool is registered with a schema.</td>
    </tr>
    <tr>
      <td><strong>Integration</strong></td>
      <td>Each tool returns a valid shape.</td>
    </tr>
    <tr>
      <td><strong>Staged</strong></td>
      <td>Registered with the client, allow-listed.</td>
    </tr>
    <tr>
      <td><strong>Production</strong></td>
      <td>Used in &gt;=1 real session without regression.</td>
    </tr>
    <tr>
      <td><strong>Routing</strong></td>
      <td>Whatever layer picks tools per request knows the new surface.</td>
    </tr>
  </tbody>
</table>

<p>The checklist is a doc, not a tool. It lives in the template repo. You copy
it, you follow it. Skipping gates catches up with you in the most
embarrassing way possible.</p>

<h2 id="walkthrough-new-mcp-zero-to-green">Walkthrough: new MCP, zero to green</h2>

<p>Fork, edit, run. Here is the full walkthrough from an empty directory to a
passing CI line.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Fork and clone the template.</span>
gh repo create my-org/weather-mcp <span class="nt">--template</span> dtchen07/mcp-custom-template <span class="nt">--clone</span>
<span class="nb">cd </span>weather-mcp

<span class="c"># 2. Install the harness and the MCP SDK.</span>
python <span class="nt">-m</span> pip <span class="nb">install </span>mcp-stdio-test mcp

<span class="c"># 3. Confirm the toolchain is healthy.</span>
mcp-stdio-test doctor
<span class="c"># [PASS] python: 3.12.1</span>
<span class="c"># [PASS] mcp_sdk: 1.27.0</span>
<span class="c"># [PASS] fixture: stdio handshake complete; 0 tool(s) listed</span>
<span class="c"># Overall: PASS</span>

<span class="c"># 4. Replace the two example tools in server.py with real ones.</span>
<span class="c">#    Edit test_my_mcp.py: bump --expect-count, list the new tool names.</span>

<span class="c"># 5. Run the harness.</span>
python test_my_mcp.py
<span class="c"># All 5 assertions passed.</span>

<span class="c"># 6. Push. CI runs the same five assertions on Linux, macOS, Windows.</span>
git push
</code></pre></div></div>

<p>Wall-clock time from <code class="language-plaintext highlighter-rouge">git clone</code> to first green test, measured on a fresh
MacBook with no cached dependencies: <strong>under 10 minutes</strong>, almost all of
which is writing the two real tool bodies. The harness itself (init +
list + call + assert on a 2-tool server) takes about 2 seconds of machine
time; everything else is you editing Python. The v2 field report will
have the 30-day moving average across more servers and more contributors.</p>

<p>Nothing on that list is novel. The novelty is the <em>absence</em> of other steps.
No LLM judge. No scored quality report. No pytest plugin glue. No per-server
fixture scaffold. Just: the server runs, the tools are where you said they
are, the calls return what you said they would.</p>

<h2 id="the-feedback-loop">The feedback loop</h2>

<p>The harness closes the inner loop: <em>did the MCP I wrote match the MCP I
said I wrote?</em> A second tool closes the outer loop: <em>is the MCP I wrote
actually being used?</em></p>

<p>I run a separate MCP called <code class="language-plaintext highlighter-rouge">claude-usage</code> that introspects Claude’s own
tool-call log. It has three tools I actually use:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">usage_top_tools</code> - ranks tools by invocation count over a window.</li>
  <li><code class="language-plaintext highlighter-rouge">usage_dead_mcps</code> - lists tools that have never been called.</li>
  <li><code class="language-plaintext highlighter-rouge">usage_top_responses</code> - surfaces the top Claude responses that triggered
each tool, so I can see <em>why</em> Claude chose to call it.</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">usage_dead_mcps</code> is the one I care about most. In week 2 it flagged three
tools across two servers as never-called. Two of the three turned out to be
genuinely redundant with existing tools - I deleted them. The third had a
typo in the tool name that Claude was silently routing around. The harness
never would have caught that; the usage log did.</p>

<p>Harness closes the inner loop. Usage log closes the outer. Between them,
the functional score I track across ten operational dimensions went from
65 to 91 over six weeks. The harness is not the whole story - credential
rotation, config-sync, the checklist ritual all contributed - but it is the
single change whose delta I can measure cleanest.</p>

<h2 id="where-this-sits-vs-neighbors">Where this sits vs. neighbors</h2>

<p>The space is more crowded than I expected when I started. Five adjacent
tools, each doing a distinct job. Honest comparison:</p>

<table>
  <thead>
    <tr>
      <th>Tool</th>
      <th>Primary job</th>
      <th>Transport</th>
      <th>LLM</th>
      <th>Exit-code CI</th>
      <th>Tool-count / name assertions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong><code class="language-plaintext highlighter-rouge">mcp-stdio-test</code></strong> (this)</td>
      <td><strong>Per-repo declarative CI gates</strong></td>
      <td>stdio</td>
      <td>no</td>
      <td>yes (0/1/2/3/4 tiered)</td>
      <td>yes (first-class, user-authored)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">mcp-lint</code> (LuxshanLux)</td>
      <td>23 deterministic lint rules, scored</td>
      <td>stdio + SSE</td>
      <td>no</td>
      <td>yes (0/1/2)</td>
      <td>no</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">mcp-tester</code> (saqadri, “MCP-Eval”)</td>
      <td>Pytest-style agent evaluation + cost</td>
      <td>stdio</td>
      <td>yes</td>
      <td>JSON + pytest</td>
      <td>no</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">mcp-doctor</code> (destilabs)</td>
      <td>Diagnostic + agent-friendliness scorer</td>
      <td>stdio + HTTP</td>
      <td>partial (only <code class="language-plaintext highlighter-rouge">generate-dataset</code>)</td>
      <td>scored report</td>
      <td>no</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">mcp-probe</code> (conikeec)</td>
      <td>Rust TUI debugger + compliance suite</td>
      <td>stdio / SSE / HTTP</td>
      <td>no</td>
      <td><code class="language-plaintext highlighter-rouge">--fail-fast</code> pass/fail</td>
      <td>built-in suite, not user-authored</td>
    </tr>
    <tr>
      <td>MCP Inspector (official)</td>
      <td>Interactive debugger + <code class="language-plaintext highlighter-rouge">--cli</code></td>
      <td>stdio / SSE / HTTP</td>
      <td>no</td>
      <td>JSON (no assertions)</td>
      <td>no</td>
    </tr>
  </tbody>
</table>

<p>Two distinctions hold up on direct comparison. First: <strong>none of the five
ship per-repo, user-authored assertions.</strong> <code class="language-plaintext highlighter-rouge">mcp-lint</code>, <code class="language-plaintext highlighter-rouge">mcp-doctor</code>, and
<code class="language-plaintext highlighter-rouge">mcp-probe</code> all ship fixed rule suites; you can’t commit a line that says
“my server has exactly five tools named X, Y, Z” and have CI fail the
moment that drifts. <code class="language-plaintext highlighter-rouge">mcp-stdio-test</code>’s <code class="language-plaintext highlighter-rouge">--expect-count</code> / <code class="language-plaintext highlighter-rouge">--expect-tool</code>
flags compose into exactly that line. Second: <strong>only two neighbors
publish numbered CI exit codes at all</strong> — <code class="language-plaintext highlighter-rouge">mcp-lint</code> (0/1/2) and
<code class="language-plaintext highlighter-rouge">mcp-probe</code> (pass/fail) — and neither distinguishes
<em>tool-count mismatch</em> from <em>handshake failure</em> from <em>usage error</em>. Those
three fail in wildly different ways and should tell you so at the exit
code.</p>

<p>If you want a scored quality report, use <code class="language-plaintext highlighter-rouge">mcp-doctor</code>. If you want
agent-driven end-to-end evaluation, use <code class="language-plaintext highlighter-rouge">mcp-tester</code>. If you want to
click through your server, use MCP Inspector. If <code class="language-plaintext highlighter-rouge">mcp-lint</code> or Inspector
could usefully absorb this pattern upstream, I will happily open a PR.
Fewer better tools beat more overlapping tools.</p>

<h2 id="what-v1-gets-wrong">What v1 gets wrong</h2>

<p>Six weeks is not enough data to see slow-drift bugs. Eight servers is not
enough to see cross-server interaction patterns. One developer on one
desktop is not enough to surface the collaboration pathologies that matter
at team scale. The windowing on the build-time numbers is biased toward the
servers I wrote last - they benefit from the harness <em>and</em> from my
familiarity with the pattern.</p>

<p>Explicit non-goals I am holding the line on for v1: no HTTP / SSE
transport, no LLM-in-the-loop evaluation, no multi-MCP orchestration, no
pytest plugin glue. Each of those is a larger project than the one I am
willing to support at the stated SLA (7-day best-effort response, no SLA
on fixes).</p>

<h2 id="what-v2-will-bring">What v2 will bring</h2>

<p>v0.2.0 is targeted for ~8 weeks from today. The data I want in it:</p>

<ul>
  <li>Three months of build-time numbers, unweighted, across at least ten
servers.</li>
  <li>At least one external contributor’s MCP built with the template.</li>
  <li>Telemetry on protocol-drift incidents (there have been zero in six
weeks; I expect this to change).</li>
  <li>An honest “what stopped working” section if any of this v1 framing
turned out to be wrong.</li>
</ul>

<p>If the data is not there by day 56, I will ship a “what I have learned”
delta post instead of a press release. The commitment is to <em>ship
something honest</em>, not to <em>ship the thing I said I would</em>.</p>

<h2 id="whats-next-unrelated">What’s next (unrelated)</h2>

<p>The next project, which is not this one, is an LLM-as-judge evaluation
harness for MCP tools that have natural-language outputs. The harness in
this post is the CI gate. The next one is the regression test for the gate
itself: <em>is the tool answering questions well, not just answering them?</em></p>

<p>Different problem. Different tool. Shipping separately. Subscribe to the
<a href="/feed.xml">RSS feed</a> if you want the v2 post when it lands.</p>

<hr />

<h2 id="try-it">Try it</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>mcp-stdio-test
<span class="c"># Template repo:</span>
gh repo create my-mcp <span class="nt">--template</span> dtchen07/mcp-custom-template <span class="nt">--clone</span>
</code></pre></div></div>

<p>Verify the supply chain yourself:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip download mcp-stdio-test
gh attestation verify mcp_stdio_test-<span class="k">*</span>.whl <span class="nt">--owner</span> dtchen07
</code></pre></div></div>

<hr />

<p><em>Suggested citation:</em> Chen, D. (2026). <em>Harness-first MCP deployment: a
field report from 6 weeks and 8 custom servers.</em>
<a href="https://dtchen07.github.io/2026/04/harness-first-mcp/">https://dtchen07.github.io/2026/04/harness-first-mcp/</a></p>

<p><em>This post is licensed under <a href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a>.
The accompanying code at <a href="https://github.com/dtchen07/mcp-stdio-test"><code class="language-plaintext highlighter-rouge">mcp-stdio-test</code></a>
and <a href="https://github.com/dtchen07/mcp-custom-template"><code class="language-plaintext highlighter-rouge">mcp-custom-template</code></a>
is licensed under <a href="https://opensource.org/licenses/MIT">MIT</a>.</em></p>

<p><em>No data is collected on this site beyond a self-hosted analytics counter
(GoatCounter, no cookies, aggregate only).</em></p>

<h3 id="updates">Updates</h3>

<ul>
  <li>2026-04-17: Initial publication.</li>
</ul>]]></content><author><name>Dennis Chen</name></author><category term="mcp" /><category term="tooling" /><category term="field-report" /><category term="mcp" /><category term="model-context-protocol" /><category term="stdio" /><category term="ci" /><category term="testing" /><category term="harness" /><category term="python" /><summary type="html"><![CDATA[I built 8 MCPs in a month. The first 2 broke on first real use. Then I wrote a harness. The next 6 shipped green on first run, cutting build time from 90 minutes to 55 minutes per server. Here is the harness - and what 6 weeks of running it has actually taught me.]]></summary></entry></feed>