All News
TrendingReliability

OpenClaw's AI Sessions Were Hanging, Duplicating Memory, and Nobody Knew

Three bugs. Three different subsystems. One shared trait: they degraded the product silently, never crashed anything, and left operators blaming themselves. On March 21, all three got fixed. Here's the damage they did first.

March 21, 20266 min read

Incident Log — March 21, 2026

Three fixes, filed the same day, exposing the same failure pattern: invisible degradation.

#51816by dutifulbobCritical

Bound ACP sessions hung indefinitely

Symptom

Discord channels like #codex-1 appeared dead. No responses. No errors. Just silence.

Root cause

A single failed runtime.runTurn() call blocked all subsequent turns in the session queue. No timeout existed.

Fix

Manager-side watchdog with configurable timeouts. Aborts hung turns, evicts stale runtime handles, lets queued work proceed.

Evidence

Live Discord testing confirmed stuck sessions recovered. Unit tests pass under fake timers.

#42119by samzongHigh

Heartbeat sessions silently grew past 200K messages

Symptom

Agents went silent in long-running heartbeat sessions. No crash, no warning — just a gradual fade to nothing.

Root cause

Compaction guard counted any user/assistant/toolResult message as 'real' without inspecting content. HEARTBEAT_OK and NO_REPLY boilerplate fooled the counter, so compaction never triggered.

Fix

Content-aware predicate: messages must contain non-empty, non-boilerplate text. Tool results only count if linked to genuine user requests within the last 20 messages.

Evidence

76+ tests passing. Covers markdown-wrapped boilerplate, reasoning-only blocks, and split-turn prefixes.

#34222by lml2468High

Memory writes duplicated 4–7x during compaction retries

Symptom

MEMORY.md filled with identical entries. Operators assumed the agent was being thorough. It was being broken.

Root cause

Deduplication used compactionCount (an event counter) instead of content state. Aborted compactions incremented the counter despite identical context, bypassing the dedup check.

Fix

SHA-256 content hash computed from message count + last 3 user/assistant messages. Hash compared before flush. Two-layer gate: token threshold first, then hash check.

Evidence

8 unit tests + 14 regression tests. Covers abort/retry cycles, session rotation, and large single-line JSONL records.

There's a category of software failure that doesn't have a good name. It's not a crash. It's not data loss. It's not a security breach. It's the slow, silent erosion of trust between software and the people running it. Every AI agent framework has these bugs. Most don't fix them — they ship new features instead. OpenClaw fixed three in one afternoon.

All three bugs fixed on March 21 share this DNA. A Discord channel goes quiet and the operator assumes traffic dropped. A heartbeat session balloons to 200K messages and the operator assumes the model is verbose. A MEMORY.md file fills with duplicate entries and the operator assumes the agent is being careful. In every case, the operator's mental model of what the software is doing diverges from reality. And the software never corrects them.

The Hung Turn Problem Is Worse Than It Sounds

PR #51816 fixes a scenario where a single failed runtime.runTurn() call could starve an entire ACP session. Not “slow it down.” Not “cause errors.” Starve it. Every subsequent turn queued behind the stuck one, waiting for a completion event that would never come. The Discord channel bound to that session just stopped responding.

The fix introduces a watchdog timer in the ACP session manager. It respects the session's configured timeoutSeconds (no hardcoded short fuse), and when it fires, it runs a disciplined cleanup sequence: abort, cancel, close, evict. The implementation uses an eventGate pattern to decouple event consumption from the timeout race condition — a detail that suggests the author has been bitten by this class of bug before.

Contributor dutifulbob verified the fix with live Discord testing. Sessions unstick. Queued turns proceed. Systemd restarts work cleanly. But the Codex review bot flagged dead code at line 726 — a streamError check that became unreachable after refactoring — and a potential cache leak if the cancel grace period fails but close succeeds. Neither was addressed before merge. The fix works, but it shipped with loose ends.

Heartbeat Sessions Were Growing Forever

Here's a question that PR #42119 answers: what happens when your compaction safeguard can't distinguish between a real conversation and an automated heartbeat?

Answer: the session grows without bound. The compaction guard was supposed to trigger when a conversation exceeded a 200K message ceiling. But its predicate only checked message roles — user, assistant, toolResult — without inspecting content. In heartbeat sessions, every message satisfies the role check because HEARTBEAT_OK and NO_REPLY are technically “user” and “assistant” messages. The guard thought the session was full of real conversation. It wasn't. The agent just slowly drowned in boilerplate.

Contributor samzong replaced the role-only predicate with a content-aware one. Non-empty text that isn't a known sentinel counts as real. Non-text content (images, attachments) always counts. Tool results only count if a meaningful user message appears within the previous 20 messages. The review process caught edge cases: markdown-wrapped boilerplate, reasoning-only blocks on Claude models, and split-turn prefix handling. Seventy-six tests pass.

Memory Was Writing the Same Thing Seven Times

PR #34222 is the most elegant of the three fixes, and it exposes the most embarrassing bug.

OpenClaw's memory flush system writes conversation context to MEMORY.md after compaction. To prevent duplicate writes, it used compactionCount — an event counter that incremented every time compaction ran. If the count hadn't changed since the last flush, skip the write.

The problem: compaction can abort and retry. When it does, the counter increments even though the session context hasn't changed. The dedup check sees a new count, assumes new content, and flushes again. Four to seven identical entries per abort/retry cycle. Operators saw MEMORY.md growing and assumed the agent was diligent. The agent was just stuck in a loop.

Contributor lml2468 replaced the event counter with a SHA-256 hash of the actual content: messages.length plus the last three user/assistant messages. If the hash matches, the content hasn't changed. Skip the write. A secondary fix adds a willRetry guard that prevents aborted compactions from marking themselves as complete. Twenty-two tests cover every permutation.

The Failure Mode Nobody Talks About

Software culture celebrates the dramatic failure. The downtime. The data breach. The P0 incident that spawns a postmortem and a blog post. But the bugs fixed this week aren't dramatic. They're erosive. They make operators lose confidence in their own judgment before they lose confidence in the software.

“The channel must just be quiet today.” “The agent is probably being thorough.” “Maybe I misconfigured the heartbeat interval.”

All three explanations are wrong. All three are reasonable. That's what makes silent failures dangerous: they produce symptoms that have plausible innocent explanations. And plausible explanations kill debugging instincts.

OpenClaw fixed all three on the same day, from three different contributors, in three different subsystems. That's either a coincidence or a sign that the project is getting serious about the kind of reliability work that doesn't generate GitHub stars. I'd bet on the latter.

Full technical details: PR #51816, PR #42119, and PR #34222 on GitHub. All three are live for DeployClaw users.

DeployClaw News · Analysis by Carlos Simpson · Reporting covers OpenClaw development independently. DeployClaw is a managed hosting service for OpenClaw instances. Upstream fixes ship automatically to managed deployments.