OpenClaw's AI Sessions Were Hanging, Duplicating Memory, and Nobody Knew

There's a category of software failure that doesn't have a good name. It's not a crash. It's not data loss. It's not a security breach. It's the slow, silent erosion of trust between software and the people running it. Every AI agent framework has these bugs. Most don't fix them — they ship new features instead. OpenClaw fixed three in one afternoon.

All three bugs fixed on March 21 share this DNA. A Discord channel goes quiet and the operator assumes traffic dropped. A heartbeat session balloons to 200K messages and the operator assumes the model is verbose. A MEMORY.md file fills with duplicate entries and the operator assumes the agent is being careful. In every case, the operator's mental model of what the software is doing diverges from reality. And the software never corrects them.

The Hung Turn Problem Is Worse Than It Sounds

PR #51816 fixes a scenario where a single failed runtime.runTurn() call could starve an entire ACP session. Not “slow it down.” Not “cause errors.” Starve it. Every subsequent turn queued behind the stuck one, waiting for a completion event that would never come. The Discord channel bound to that session just stopped responding.

The fix introduces a watchdog timer in the ACP session manager. It respects the session's configured timeoutSeconds (no hardcoded short fuse), and when it fires, it runs a disciplined cleanup sequence: abort, cancel, close, evict. The implementation uses an eventGate pattern to decouple event consumption from the timeout race condition — a detail that suggests the author has been bitten by this class of bug before.

Contributor dutifulbob verified the fix with live Discord testing. Sessions unstick. Queued turns proceed. Systemd restarts work cleanly. But the Codex review bot flagged dead code at line 726 — a streamError check that became unreachable after refactoring — and a potential cache leak if the cancel grace period fails but close succeeds. Neither was addressed before merge. The fix works, but it shipped with loose ends.

Heartbeat Sessions Were Growing Forever

Here's a question that PR #42119 answers: what happens when your compaction safeguard can't distinguish between a real conversation and an automated heartbeat?

Answer: the session grows without bound. The compaction guard was supposed to trigger when a conversation exceeded a 200K message ceiling. But its predicate only checked message roles — user, assistant, toolResult — without inspecting content. In heartbeat sessions, every message satisfies the role check because HEARTBEAT_OK and NO_REPLY are technically “user” and “assistant” messages. The guard thought the session was full of real conversation. It wasn't. The agent just slowly drowned in boilerplate.

Contributor samzong replaced the role-only predicate with a content-aware one. Non-empty text that isn't a known sentinel counts as real. Non-text content (images, attachments) always counts. Tool results only count if a meaningful user message appears within the previous 20 messages. The review process caught edge cases: markdown-wrapped boilerplate, reasoning-only blocks on Claude models, and split-turn prefix handling. Seventy-six tests pass.

Memory Was Writing the Same Thing Seven Times

PR #34222 is the most elegant of the three fixes, and it exposes the most embarrassing bug.

OpenClaw's memory flush system writes conversation context to MEMORY.md after compaction. To prevent duplicate writes, it used compactionCount — an event counter that incremented every time compaction ran. If the count hadn't changed since the last flush, skip the write.

The problem: compaction can abort and retry. When it does, the counter increments even though the session context hasn't changed. The dedup check sees a new count, assumes new content, and flushes again. Four to seven identical entries per abort/retry cycle. Operators saw MEMORY.md growing and assumed the agent was diligent. The agent was just stuck in a loop.

Contributor lml2468 replaced the event counter with a SHA-256 hash of the actual content: messages.length plus the last three user/assistant messages. If the hash matches, the content hasn't changed. Skip the write. A secondary fix adds a willRetry guard that prevents aborted compactions from marking themselves as complete. Twenty-two tests cover every permutation.

The Failure Mode Nobody Talks About

Software culture celebrates the dramatic failure. The downtime. The data breach. The P0 incident that spawns a postmortem and a blog post. But the bugs fixed this week aren't dramatic. They're erosive. They make operators lose confidence in their own judgment before they lose confidence in the software.

“The channel must just be quiet today.” “The agent is probably being thorough.” “Maybe I misconfigured the heartbeat interval.”

All three explanations are wrong. All three are reasonable. That's what makes silent failures dangerous: they produce symptoms that have plausible innocent explanations. And plausible explanations kill debugging instincts.

OpenClaw fixed all three on the same day, from three different contributors, in three different subsystems. That's either a coincidence or a sign that the project is getting serious about the kind of reliability work that doesn't generate GitHub stars. I'd bet on the latter.

Full technical details: PR #51816, PR #42119, and PR #34222 on GitHub. All three are live for DeployClaw users.

OpenClaw's AI Sessions Were Hanging, Duplicating Memory, and Nobody Knew

Incident Log — March 21, 2026

Bound ACP sessions hung indefinitely

Heartbeat sessions silently grew past 200K messages

Memory writes duplicated 4–7x during compaction retries

The Hung Turn Problem Is Worse Than It Sounds

Heartbeat Sessions Were Growing Forever

Memory Was Writing the Same Thing Seven Times

The Failure Mode Nobody Talks About

Related coverage