Postmortem: mega-release fallout (April 2026)

What happened when a large batch of infrastructure and product changes shipped together, how CI and deploy surfaced the gaps, and how we recovered.

Summary

In mid-April 2026, a large batch of infrastructure and product work shipped together (tag v5.32.0 and follow-on releases). Immediately afterward, deploy and CI needed sustained fixes: CDK synthesis and packaging, container image deploy for the AI processor, and Playwright end-to-end stability in GitHub Actions. Between April 19 and April 24 the branch accumulated about 106 commits, including roughly 22 scoped as fix(cdk), fix(ci), fix(e2e), or fix(terraform), and 6 as fix(e2e) or test(e2e). By the end of the week the pipeline and tests were green again and the shipped features were live.

This note is an engineering postmortem: what happened, why it hurt, and what we are doing differently. It is not about blaming tools — it is about coordination, isolation, and release discipline.

Context

Blogmarks is a bookmark manager PWA backed by AWS (DynamoDB, Lambdas, CloudFront static export, CDK for infrastructure). The backlog included CDK migration work, hosted MCP, EventBridge wiring, knowledge-graph entities, invite flows, and other tickets that had been open for a while.

The intent was to move several streams forward in parallel. In practice, multiple AI coding agents were run against the same git working tree without git worktrees or another isolation boundary. A machine crash mid-session left the tree in a mixed state (overlapping edits, partial refactors, dependency lockfile churn). Recovery involved reconciling and shipping rather than pausing for a clean split — which compressed many concerns into one release window and made CI the first place inconsistencies surfaced.

Late-evening context and impaired judgment are part of the honest story; the engineering lesson below still stands without centering that narrative.

What went wrong

Concurrent writers, one working tree. Agents read and write the same files without a merge protocol. Orthogonal tasks are rare in a monolith; overlap produces changes that look reasonable locally until synthesis, packaging, or tests run.
Unclean recovery after a hard stop. The highest-leverage move after a crash would have been: stop, inspect git status, stash or branch per stream, and discard or merge deliberately. Continuing to commit and push from a blended tree increased the risk of shipping half-integrated state.

Fallout (April 19–24)

CDK / deploy. cdk synth failed with missing required construct props, incorrect asset paths, and Lambda bundling assumptions that did not match the repo layout. Fixes landed incrementally until the deploy graph was valid again.

AI processor image. The enrichment Lambda moved toward a container image while parts of the pipeline still assumed a zip artifact; build tooling needed alignment (including Docker buildx provenance flags where relevant). Symptoms showed up as image validation or deploy rejections until the pipeline matched the declared package type.

E2E in CI. Playwright specs drifted from the UI under parallel change: selectors and headings, timing on slow runners, locale-dependent formatting, MCP wizard flow, and service worker interactions. Stabilization required explicit waits, stable data-testid hooks where appropriate, and pinning locale expectations where the UI is localized.

Release cadence. Public release notes on the site document the patch line through 5.41.2 (see the changelog). Git tags in this clone currently reach v5.41.1; tags for later patches may exist only on the remote — the important part for readers is the sequence of shipped fixes, not a single numbering quirk in one environment.

Root cause

Treat the failure mode as shared mutable state without coordination: one working directory, multiple independent writers, no lock or merge ordering. That is the same class of bug as multithreaded code writing one structure without synchronization — except the "threads" are automation sessions and the shared heap is the filesystem.

Git worktrees give each line of work its own checkout and branch. Agents (or humans) in separate worktrees cannot overwrite each other's files accidentally. Isolation is worth the small overhead of extra directories and explicit merges.

git worktree add ../blogmarks-cdk feat/cdk-migration
git worktree add ../blogmarks-mcp feat/mcp-server
git worktree add ../blogmarks-e2e fix/e2e-stabilization

For Claude Code specifically, prefer agent isolation that creates or uses a dedicated worktree instead of sharing the default tree — see the official Claude Code documentation for current options and flags.

Recovery

Stop feature additions until deploys are trustworthy again.
Fix CDK synthesis and packaging first — nothing else ships if the graph does not build.
Then stabilize E2E: flaky tests are costly, but broken deploys are worse for users.
Ship patch releases freely when they fix real issues; version numbers are information, not ego.

What we changed

One agent, one worktree, one concern before parallelizing automation.
Smaller releases with CI green between them instead of banking weeks of work into a single drop.
Guardrails for late-night shipping — personal rule: no merges from an ambiguous tree after a cutoff hour; reset or split work first.

Caveats

Parallel agents on one tree can appear fine when tasks touch disjoint files and the session ends cleanly. The danger is silent: the tree looks plausible until CI or production exercises the gaps. Worktree friction is deliberate — it buys verifiability.

Lessons

Automation sessions are concurrent writers; give them isolation like any other parallel work.
A crash tests whether your in-progress discipline was already sound; messy recovery points backward in time.
A burst of patch releases is a lagging indicator of planning and integration risk, not just execution speed.

Postmortem: parallel work, one tree, and a week of deploy fixes