Builder MSBuild Zombie Reuse

Definition

dotnet build invokes MSBuild, which by default keeps worker processes alive between invocations to reuse them (“MSBuild node reuse”). In a long-lived container running multiple sequential agent runs, these workers accumulate, pin CPU and RAM, and eventually saturate the host. Worse, they can cache stale state across runs and surface as “phantom” build failures that don’t reproduce on a clean process tree.

Symptoms in kalamos

  • Agent run N produces a build error that doesn’t match the on-disk state of the csproj
  • ps inside the kalamos container shows multiple MSBuild.dll processes from previous runs
  • Container RAM creeps up over hours despite no new ingest workload
  • During the 2026-04-26 rollback investigation, found dozens of zombie MSBuild workers contributing to the 100% CPU starvation that triggered the kalamos pause

Fix (already applied via ADR-025 compose patch)

# Set on the kalamos service environment in docker-compose.yml:
MSBUILDDISABLENODEREUSE=1

And always invoke dotnet with -nodeReuse:false for belt-and-suspenders:

dotnet build -maxcpucount:2 -nodeReuse:false

Both are now mandatory for any Builder agent invocation in this repo. The -maxcpucount:2 cap is a separate concern (CPU starvation per ADR-025 §C); they pair naturally.

Why it matters specifically here

CRMAPIGenerator has a multi-project solution (Templates, Generators, Tests, Tools). Without disable-node-reuse, each dotnet build of a sub-project leaves its worker around. Across 23 agents × multiple builds per run × 30s heartbeats, the math is brutal.

  • [[CRMAPIGenerator-Repository]] — the project hit by this
  • ADR-025 §“Compose patch shipped” — the kalamos infra fix