From Chatbot to Cognitive Architecture: CoALA in Production
Most agents are a while-loop around a model. Matrix runs a real cognitive architecture — one CoALA decision cycle powering interactive and autonomous agents alike.
Look at almost any "AI agent" in production and you'll find the same thing under the hood: a while loop around model.generate(). Append the user message, call the model, run whatever tool it asked for, append the result, call the model again. It works for a demo. It does not scale into a system you can reason about, govern, or extend — because there's no architecture there, just control flow.
We took a different bet. Matrix runs a real cognitive architecture for the LLM — specifically CoALA, Cognitive Architectures for Language Agents (Sumers, Yao, Narasimhan & Griffiths, TMLR 2024). One PERCEIVE → DECIDE → ACT → LEARN cycle drives both real-time interactive agents (chat, phone, app) and fully autonomous background agents. Same working memory. Same typed action space. Same execution path. Same principal. The only thing that branches is one persisted property: Agent.mode.
This is the post for the engineering leader deciding whether to build on a prompt loop or on something with a spine. Here's the spine.
The framing insight: dialogue is just a grounding action
The reason most teams build two systems — a chatbot and a separate "autonomous agent" framework — is that they look like two different problems. One has a human typing at you in real time; the other runs unattended against a goal. Different loops, different code, different bugs, twice the surface area.
CoALA dissolves that split with one observation: talking to a human is just another grounding action. A model emitting a turn to a person is the same kind of move as calling an HTTP endpoint or writing a file — it's the agent affecting the world. Once you accept that, interactive and autonomous stop being two architectures. They're one decision cycle with different drivers and different grounding actions.
So in Matrix there is exactly one axis the runtime branches on, and it's Agent.mode:
INTERACTIVE(the default) — a human is in the loop in real time. The realtime/chat provider does the in-turn proposing and selecting; it's a co-located brain optimized for latency.AUTONOMOUS— no human. Matrix supplies the brain and runs a deliberate propose→evaluate→select cycle per step.
What makes this clean rather than clever: Agent.mode is the same axis our access control already keys on. The authorization boundary was already unified along this line — AccessControlService.decide composes the agent's own grant when AUTONOMOUS, and agent ∩ caller when INTERACTIVE. The cognitive core just unifies cognition the same way the security model was already unified. One concept, used twice, instead of two parallel hierarchies.
Four seams, one cycle
Here's the whole runtime as four seams. Only DECIDE swaps on Agent.mode; everything else is shared machinery:
| Seam | Interactive | Autonomous | Shared machinery |
|---|---|---|---|
PERCEIVE — assemble WorkingMemory | per human/provider turn | per cycle | WorkingMemoryAssembler |
| DECIDE — choose next action(s) | provider proposes & selects in-turn | propose-N → evaluate → select | DecisionDriver |
| ACT — execute the chosen action | tool-call dispatched | tool-call dispatched | ToolDispatcher.invoke |
| LEARN — write memory | per-turn + post-session extractor | per-step + final summary | MemoryService.write |
| Principal / RBAC | agent ∩ caller | agent-only | AccessControlService.decide |
Read it top to bottom and the design intent is obvious: a phone call and a background task perceive into the same object, execute through the same dispatcher, learn into the same memory store, and run under the same principal model. They differ at precisely one row.
The rest of this post is the three pillars that make those seams real.
Pillar 1 — Working memory as the hub
CoALA's working memory is "active and readily available information as symbolic variables for the current decision cycle" — the central hub every other component reads from and writes to. In Matrix that's a typed record, runtime.WorkingMemory, and both modes build it the same way: WorkingMemoryAssembler.assemble(...).
The assembler reads long-term memory once into a MemorySnapshot, then projects it. This matters more than it sounds. The old failure mode — and the one most prompt loops still have — is reading long-term memory ad hoc, multiple times, in slightly different ways per channel. We split MemoryContextRenderer into read(agent, userId) → MemorySnapshot and render(snapshot, agent) → String, so there is exactly one read and one render. The legacy single-call render is now literally render(read(...)), and it's byte-for-byte identical to the old inline version — a property locked by AgentPromptComposerTest.
That's the discipline: refactor toward an architecture without changing a byte of what the model sees, then add new capability behind flags. The real composer is AgentPromptComposer.compose(WorkingMemory); the older 5-argument signature is a thin delegate kept for compatibility. Two new trailing blocks — the typed action space and a "Progress so far" observations block — are gated and render to an empty string unless populated. Voice and chat prompts are unchanged by default.
One PERCEIVE path. One snapshot. One composer. Chat, telephony voice, browser-direct voice, and autonomy all flow through it.
Pillar 2 — The typed action space
In a prompt loop, "tools" are an undifferentiated bag the model digs through. CoALA says actions have kinds, and the kind matters: internal actions (retrieval, reasoning, learning) versus external grounding. Matrix makes that taxonomy first-class. Every callable is classified into a runtime.ActionType:
- REASONING — internal, no side effect.
- RETRIEVAL — reads memory or knowledge (
lookup_contact_details,search_knowledge). - LEARNING — writes durable memory (
update_contact_profile,add_contact_note,set_contact_*,update_lead,update_record,update_campaign_contact). - GROUNDING — affects the world, including dialogue (HTTP / MCP / INTERNAL built-ins; the default).
ActionClassifier maps a tool by name for built-ins, with an optional Tool.actionType entity override so a custom HTTP tool can opt into a category. There's one composition path — AgentToolSurface.composeActions(...) returns List<AgentAction> — and composeForCaller(...) just unwraps it back to the existing List<ToolCallback> contract, so nothing downstream had to change.
Two payoffs:
- The model knows what kinds of moves it has. The grouped action-space block is surfaced into every channel's prompt — telephony, browser voice, chat, autonomous — gated by
matrix.runtime.expose-action-space(defaulttrue). The model isn't guessing which tool retrieves versus which one writes; the categories are in the prompt. - Every dispatch is observable by category.
ToolDispatchertags each dispatch log line with the action category, and on the chat pathRecordingToolCallbacktags each invocation with itsActionTypetoo — so the OBSERVE representation matches across voice and autonomous. You can read a run log and see retrieval-vs-grounding-vs-learning, not just opaque tool names.
A small but telling detail from our own validation: we initially classified get_current_time as REASONING, then corrected it. Reading the clock reads but does not write working memory through the LLM — it's a stateless environmental read, i.e. digital grounding. REASONING now holds no built-in at all; reasoning in CoALA is the LLM's own internal deliberation, carried out by the decision procedure itself, not invoked as a tool. We kept the empty REASONING set so a Tool entity can still opt in. Getting the taxonomy right is part of the work, not decoration.
Pillar 3 — The autonomous decision cycle
Here's where interactive and autonomous diverge — at DECIDE, and only there. Interactive agents lean on the realtime provider, which proposes and selects a single action in-turn (the ReAct tier). Autonomous agents get the deliberate tier.
AgentRuntime.runAutonomous(...) is the engine. Per step:
loop (step ≤ maxSteps):
wm = assemble(AUTONOMOUS, goal, retrieved-memory,
action-space, observations) # PERCEIVE
d = autonomousDriver.decide(wm) # DECIDE
if d.finish: write EPISODIC result; return completed
res = toolDispatcher.invoke(toolCtx, callbacks,
d.tool, d.args) # ACT
observations += summarize(d, res) # OBSERVE
write EPISODIC step # LEARN
return budget-exhausted
The difference from a prompt loop lives in AutonomousDriver.decide. It's a full multi-candidate planner, implementing CoALA's planning sub-stages explicitly:
- Proposal — one structured LLM call proposes up to K
{tool, args, rationale}-or-{finish, message}candidates (matrix.runtime.decision-candidates, default 3). - Evaluation — a second LLM call scores each candidate for how well it advances the objective.
- Selection — argmax over the scores; an unknown or invalid tool name terminates safely.
This is the deliberate propose→evaluate→select procedure the paper says most agents lack — "most works are still confined to proposing a single action." The planner reasons over the full WorkingMemory projection: persona + objective + recalled memory + action space + progress so far. (Those structured LLM calls go through VertexTextClient.generateForStructuredOutput with thinking disabled — a hard-won setting; Gemini 2.5 Flash otherwise burns its whole budget on thinking tokens.)
Two production-minded guards worth calling out:
- Graceful degradation.
VertexTextClientis gated behindmatrix.gcp.auth.enabled. When the bean is absent — local boot, say — the driver finishes immediately rather than hanging a task forever. - The unified-principal fix. This wasn't cosmetic; it closed a real privilege-escalation gap. The autonomous path used to run as a platform-admin context that bypasses access control entirely. It now runs under a non-admin agent principal, so
decideseesmode = AUTONOMOUSand composes the agent's own grant. The sameagentCtxrides onto every tool execution, so RBAC applies identically to interactive and autonomous agents — exactly the kind of hole an unstructured loop leaves open and an architecture closes by construction.
Kick one off and watch it think:
POST /api/orgs/{slug}/tasks
{ "name": "...", "agentId": <id>, "assigneeKind": "AGENT",
"payload": { "goal": "<objective>", "maxSteps": 6 } }
The TaskRun accumulates the step log in stepsJson (one row per propose→evaluate→select→act→observe cycle), writes an EPISODIC memory row per step plus the final result, and ends COMPLETED (objective met) or FAILED (step budget exhausted).
Be candid: what's validated, what's deferred
A cognitive architecture you can't audit against its blueprint is just vocabulary. So we wrote one down. COALA_VALIDATION.md is a point-by-point scorecard mapping each CoALA component (working/episodic/semantic/procedural memory, the four action types, the decision cycle, the modular Memory/Action/Agent classes) to the implementing file:line, with a conformance rating and an honest list of gaps.
The verdict it reaches: Matrix is an unusually literal implementation — the three CoALA dimensions map close to 1:1, the autonomous loop implements the deliberate cycle the paper says most agents skip, and the recall ranking (α·similarity + β·recency + γ·importance) reproduces a function the paper names from Generative Agents.
And the deferrals, stated plainly:
- Procedural self-editing. CoALA's procedural memory is the agent's own prompt and skills. Matrix learns episodic and semantic memory; it does not silently rewrite its own persona. Phase 1 has shipped — dark, two gates off by default — letting a self-improving agent propose a persona/skill edit that a human approves before it takes effect. That's the exact safety posture the paper prescribes: the agent never mutates its own code, it only ever proposes.
- Full bitemporal "as-of" queries. We filter superseded memory rows so the latest fact wins, but we don't yet answer "what was true on date X."
- Reasoning-scored
importance. The property and the ranking weight are wired; the score is currently a fixed0.5, so the importance signal is inert until we wire reasoning-based scoring. - Voice tool-call indirection. The voice
CallSessionreachesToolDispatcherdirectly (the shared ACT executor that logs the typed category) rather than routing throughAgentRuntime.act(). The unification is real at the contract, memory, and principal layers; the literal for-loop is implemented for autonomous, and threading it through voice would add a per-call dependency for no behavior change.
Notably, every one of those deferred items is something the paper itself flags as understudied, riskier, or frontier — so deferring them is consistent with its own guidance, not a corner we cut.
The takeaway
A prompt loop is control flow. A cognitive architecture is a contract. Matrix commits to the contract: one working-memory hub, one typed action space, one decision cycle, one principal — and Agent.mode as the single seam between a real-time conversation and an unattended task. The dividend is the thing you actually want as an engineering leader: interactive and autonomous agents you can reason about, govern, and extend together, instead of maintaining two of everything and praying they don't drift.
If you're building on a while-loop today, the migration isn't a rewrite. It's drawing the four seams — and then never having to choose between "chatbot" and "autonomous agent" again.
Keep reading
- One Decision Cycle for Interactive and Autonomous Agents — the seam-by-seam deep dive.
- Multi-Candidate Decisions: Don't Let Agents Take the First Idea — inside
AutonomousDriver's propose→evaluate→select. - Working, Episodic, Semantic, Procedural: The Four Agent Memories — the memory model the cycle reads and writes.
Build agents on an architecture, not a loop. Create a workspace and point an autonomous agent at a real objective — or read the full runtime in docs/COGNITIVE_CORE.md and the conformance scorecard in docs/COALA_VALIDATION.md.
Build your first agent on Matrix
Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.
Keep reading
- One Decision Cycle for Interactive and Autonomous AgentsCognitive Core & Autonomy · 7 min read
- Multi-Candidate Decisions: Don't Let Agents Take the First IdeaCognitive Core & Autonomy · 8 min read
- Self-Improving Agents That Can't Go Rogue: Propose, Approve, ApplyCognitive Core & Autonomy · 10 min read