Agent patterns in cyllama¶
This document maps the canonical agent patterns from the literature onto their cyllama implementations. Each entry says:
- Status — first-class (canned helper or primitive), recipe (you compose existing primitives yourself), or not supported.
- How to do it — the actual cyllama API call, with pointers to source files.
- Gap — what's missing or possible as a future refinement, if anything.
The five "worth-doing" pattern gaps identified in the original audit have all landed; the Gap summary records what shipped and what residual refinements remain. Patterns the framework intentionally does not support are explicitly listed.
For the broader question "should I use schema or contracts for X?" see
contracts.md and ../dev/contract-agent.md.
1. ReAct (Reason + Act)¶
Status: first-class primitive. Reference: ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022).
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
@tool
def search(query: str) -> str: ...
agent = ReActAgent(llm=LLM("model.gguf"), tools=[search])
result = agent.run("What is the capital of France?")
Implementation: src/cyllama/agents/react.py. The loop alternates
THOUGHT / ACTION / OBSERVATION events emitted on the stream. Loop
detection (_loop_detection.py) catches stuck patterns; coercion
(tools.coerce_args) catches malformed arguments before dispatch;
timeouts (Tool.timeout) cap individual tool calls.
Gap: none. This is the strongest pattern in the framework.
2. Plan-and-Execute¶
Status: first-class helper. Reference: Plan-and-Solve Prompting (Wang et al., 2023).
A planner agent emits a structured task list; an executor runs each step sequentially.
from cyllama.agents import ConstrainedAgent, ReActAgent, plan_and_execute
planner = ConstrainedAgent(
llm=planner_llm,
tools=[],
system_prompt="Emit a JSON list of step strings.",
)
executor = ReActAgent(llm=worker_llm, tools=[read_file, edit_file])
results = plan_and_execute(planner, executor, "Refactor module X.")
for step_result in results:
print(step_result.answer)
plan_and_execute(planner, executor, task, plan_parser=None,
stop_on_error=True) lives in composition.py. The default
plan_parser tries json.loads first (recognizing top-level lists or
{"steps"|"plan"|"tasks": [...]} shapes) and falls back to newline
splitting with bullet/number-prefix stripping. Pass a custom callable
for other plan formats.
Gap: none for the basic pattern. Streaming events from each step through a unified iterator is a possible future extension.
3. Reflection / Reflexion¶
Status: first-class helper. Reference: Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023).
A worker emits a draft; a critic agent emits acceptance or revision feedback; the loop iterates up to N times.
from cyllama.agents import ReActAgent, ReflectionLoop
worker = ReActAgent(llm=llm, tools=[...], system_prompt=WORKER_PROMPT)
critic = ReActAgent(
llm=llm, tools=[],
system_prompt=(
"Review the draft. Respond with 'ACCEPT' if correct, "
"otherwise list specific issues."
),
)
loop = ReflectionLoop(worker, critic, max_attempts=3)
result = loop.run("Implement quicksort with a 3-way partition.")
ReflectionLoop(worker, critic, max_attempts=3,
acceptance_marker="ACCEPT", critique_prefix=..., revision_template=...)
lives in composition.py. The critic's answer is checked for a
substring match (case-insensitive) on acceptance_marker; on a miss,
the worker is re-invoked with the task augmented by the draft and the
critic feedback (overridable via revision_template). Streams emit
worker events tagged source="worker" and critic events tagged
source="critic", with parent_event_id linking each pass; the
loop's own final ANSWER event has no source.
Gap: none for the basic pattern. Parallel critic ensembles (multiple critics voting) and reward-model-based acceptance are possible future extensions.
4. Tree of Thoughts (ToT)¶
Status: not supported. Reference: Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023).
ToT requires forking the generation state at each thought node and exploring multiple branches before committing — typically with KV-cache snapshotting so divergent branches don't recompute the shared prefix. cyllama's public API does not expose KV-cache snapshot / restore primitives, and the agent loops are linear (single stream of events, no branching).
Implementing ToT would require: (a) public LlamaContext.snapshot() /
restore() (currently absent), and (b) a branching agent loop that
maintains a frontier of candidate states with scoring. Significant new
machinery; non-trivial value for cyllama's typical user.
Gap: structurally hard. See Gap summary -- explicitly not on the roadmap unless a use case forces it.
5. Multi-Agent Systems¶
Status: first-class primitives.
Two building blocks in src/cyllama/agents/composition.py:
agent_as_tool(agent, name, description)— wraps anyAgentProtocolinstance as aToolso a supervisor can dispatch to it through the normal ReAct flow.TieredAgentTeam(supervisor, workers)— ergonomic container for supervisor + named workers, each potentially on a different LLM (capability tiering).
from cyllama.agents import ReActAgent, AgentRole, TieredAgentTeam
team = TieredAgentTeam(
supervisor=ReActAgent(llm=LLM("strong.gguf"), tools=[]),
workers=[
AgentRole("researcher", researcher, "Find facts."),
AgentRole("coder", coder, "Modify code."),
],
)
result = team.run("Refactor X using technique Y.")
Sub-agent events surface in the supervisor's stream with source and
parent_event_id annotations (AgentEvent fields) when a
forward_events callback is supplied. See agents_overview.md
for the full reference.
Cross-process variant: mcp_agent_tool(client, server_name,
agent_name, description) wraps a remote agent exposed through MCP as
a local Tool, symmetric to agent_as_tool but the dispatch crosses
a process boundary:
from cyllama.agents import McpClient, McpServerConfig, mcp_agent_tool, ReActAgent
client = McpClient([McpServerConfig(...)])
client.connect_all()
remote = mcp_agent_tool(
client,
server_name="research",
agent_name="web_search",
description="Search the web and return findings.",
)
supervisor = ReActAgent(llm=llm, tools=[remote])
The wrapped tool is named "{server_name}/{agent_name}" -- the
namespaced format the action parser supports. Network failures surface
as RuntimeError from McpClient.call_tool; the agent's generic
exception handler catches them. Pass timeout=... to apply a local
budget (separate from MCP transport timeouts).
Gap: streaming sub-agent events across the wire (the current MCP shape returns a single value per call). Possible future extension if the MCP server-streaming RFC stabilizes.
6. Memory-Augmented Agents¶
Status: partial — short-term and episodic memory only.
cyllama ships three session-storage backends in
src/cyllama/agents/session.py:
MemorySessionStore— in-process dict, lost at process end.FileSessionStore— JSON files on disk; multi-process via OS locks.SqliteSessionStore— SQLite-backed, supports concurrent readers.
Each stores Message records, ToolCallRecord records, and
Permission records keyed by session id. This covers short-term
(within a run) and episodic (across runs) memory.
Semantic memory primitive in src/cyllama/agents/memory.py:
from cyllama.rag import RAG
from cyllama.agents import SemanticMemory
rag = RAG(...)
memory = SemanticMemory(rag)
# Write into a per-user namespace.
memory.remember(
"The user prefers concise answers.",
namespace="user:alice",
)
# Recall later.
hits = memory.retrieve("response style", namespace="user:alice")
for hit in hits:
print(hit.text, hit.score)
SemanticMemory(rag, namespace_field="_memory_namespace",
default_namespace="default") is a thin namespace-aware facade over any
RAG-shaped object (add_texts + search). Records under different
namespaces share the same embedding store but are filtered apart at
retrieval time -- so one cyllama-managed RAG instance can back many
logical memory buckets (per-user, per-conversation, per-topic).
retrieve() over-fetches from the underlying search to give the
namespace filter room to find enough matches. remember() defaults
split=False since memory fragments are typically short. forget()
exists but currently raises NotImplementedError -- the underlying
RAG store doesn't expose metadata-filtered deletion; the documented
workaround is to use separate RAG instances per namespace when you
need per-namespace clearing.
Gap: filtered deletion -- pending a RAG-side API. Long-term "profile" semantics (richer than text fragments -- structured user preferences, time-decay, summarization) is a possible future extension but the primitive covers the common cases as-is.
7. Retrieval-Augmented Agents (RAG agents)¶
Status: first-class helper.
cyllama ships a complete RAG pipeline in src/cyllama/rag/. The
rag_as_tool helper bridges it to the agent layer:
from cyllama.rag import RAG
from cyllama.agents import ReActAgent, rag_as_tool
kb = RAG(...) # load or build the knowledge base
search = rag_as_tool(
kb,
name="search_docs",
description="Search the project documentation.",
top_k=5,
)
agent = ReActAgent(llm=llm, tools=[search])
rag_as_tool(rag, name="search_kb", description=..., top_k=5,
query_param="query", method="search", formatter=None) lives in
composition.py. The default formatter emits one [score] text line
per hit, deduplicated by text so the agent doesn't see repeated
content. Pass method="retrieve" to use the higher-level
RAG.retrieve path (applies the RAG pipeline config); pass a custom
formatter callable to control the observation shape.
Gap: streaming results to the agent as they arrive (currently returns a single concatenated observation). Possible future extension.
8. Autonomous / AutoGPT-style¶
Status: not supported, by design.
The Auto-GPT / BabyAGI pattern is recursive goal decomposition with no external bound: the agent generates its own subgoals, executes them, generates more subgoals, etc. The literature consistently reports goal drift, infinite loops, cost explosion, and unreliability.
cyllama's design leans in the opposite direction:
ReActAgent.max_iterationscaps every run (default 10)._loop_detection.pyactively terminates runs that show signs of repetition.ContractAgentlets you express budget invariants (elapsed_ms < 30_000,tool_calls < 20,errors < 3) that enforce hard ceilings.
These are the opposite of what an autonomous agent needs. If you genuinely want unbounded goal-decomposition, cyllama is the wrong framework — use an explicitly autonomy-oriented one and accept the trade-offs.
Gap: not on the roadmap. Documented here so readers know the framework's stance.
9. Workflow / State-Machine Agents¶
Status: first-class.
cyllama.agents.workflow ships an explicit DAG / state-graph runtime
with two authoring layers over the same compile + execute model.
Nodes are typed Python callables (or agents, tools, or other
workflows); edges are static or conditional; state is a typed dict
threaded between nodes; independent nodes at the same topological
level run concurrently.
from cyllama.agents.workflow import Workflow
flow = Workflow()
# Layer C -- decorator sugar; dependencies inferred from parameter names.
@flow.node
def classify(query: str) -> dict:
confidence = score(query)
return {"confidence": confidence, "topic": guess_topic(query)}
@flow.node
def answer(classify: dict) -> str:
return llm.complete(f"Topic={classify['topic']}: {classify}").text
@flow.node
def escalate(classify: dict) -> str:
return queue_human_review(classify)
# Conditional router decides which downstream node runs.
@flow.route(after="classify")
def route(classify: dict) -> str:
return "answer" if classify["confidence"] > 0.7 else "escalate"
flow.set_entry("classify")
result = flow.run(query="What is the capital of France?")
print(result.answer) # state["answer"] or state["escalate"]
# Layer B -- explicit StateGraph; the canonical runtime form.
from cyllama.agents.workflow import Workflow, END
flow = Workflow()
flow.add_node("classify", classify_fn)
flow.add_node("answer", answer_fn)
flow.add_node("escalate", escalate_fn)
flow.add_conditional_edge(
"classify",
lambda s: "answer" if s["confidence"] > 0.7 else "escalate",
{"answer": "answer", "escalate": "escalate"},
)
flow.set_entry("classify")
result = flow.run(query="...")
A workflow's native API is kwargs-only (flow.run(**state)); to
plug it into the multi-agent layer, call flow.as_agent() to get
an AgentProtocol-conformant adapter. agent_as_tool(flow.as_agent(),
...), ReflectionLoop(flow.as_agent(), critic.as_agent(), ...),
and TieredAgentTeam(workers=[AgentRole("research", flow.as_agent(),
...)]) all work via the same explicit-adapter pattern. Workflows
nest one inside another via workflow_node(inner, name="research");
inner events forward into the outer event stream with source /
parent_event_id set so streaming UIs can render the tree.
Other capabilities (all landed in the Phase 1-5 rollout):
- Parallel level execution (
asyncio.gather); sync nodes dispatched onasyncio.to_thread. - Per-node
timeout=usingasyncio.wait_for. WorkflowInvariantreusingContractPolicysemantics for workflow-level pre/postconditions.- Reducer registry (
reducer.append/extend/merge_dict/add/last) for explicit multi-writer state keys; runtime detects unreduced collisions. flow.stream(...)/flow.astream(...)event streams withWORKFLOW_START/NODE_START/NODE_END/ANSWER/WORKFLOW_END/CONTRACT_VIOLATION.flow.to_mermaid()/flow.to_dot()for graph rendering;flow.dry_run()for execution-order inspection without running.- PEP 484
Workflow[StateT]generic for typed state.
Reference: cyllama/agents/workflow.py; design + per-phase
notes in workflow.md; 118 tests in
tests/test_agents_workflow.py.
Summary table¶
| Pattern | Status | Where |
|---|---|---|
| 1. ReAct | First-class | ReActAgent (react.py) |
| 2. Plan-and-Execute | First-class helper | plan_and_execute() (composition.py) |
| 3. Reflection / Reflexion | First-class helper | ReflectionLoop (composition.py) |
| 4. Tree of Thoughts | Not supported | requires KV-cache snapshot/restore + branching loop |
| 5. Multi-Agent Systems | First-class | agent_as_tool, TieredAgentTeam, mcp_agent_tool (composition.py) |
| 6. Memory-Augmented | First-class | session stores + SemanticMemory (memory.py) |
| 7. RAG Agents | First-class helper | rag_as_tool() (composition.py) |
| 8. Autonomous / AutoGPT | Intentionally not supported | counter to bounded-loop design stance |
| 9. Workflow / State-Machine | First-class | Workflow / CompiledWorkflow (workflow.py); decorator sugar (@flow.node, @flow.route) over explicit StateGraph |
Gap summary¶
All six tractable pattern gaps from the original audit have landed. The table below records what was added, where it lives, and the tests that pin each helper.
| Rank | Gap | Status | Where | Tests |
|---|---|---|---|---|
| 1 | ReflectionLoop (Reflection/Reflexion) |
Landed | composition.py |
tests/test_agents_composition.py::TestReflection* |
| 2 | rag_as_tool (RAG agents) |
Landed | composition.py |
tests/test_agents_composition.py::test_rag_as_tool_* |
| 3 | SemanticMemory (long-term memory) |
Landed | memory.py |
tests/test_agents_memory.py |
| 4 | plan_and_execute (Plan-and-Execute) |
Landed | composition.py |
tests/test_agents_composition.py::test_plan_and_execute_* |
| 5 | mcp_agent_tool (cross-process) |
Landed | composition.py |
tests/test_agents_composition.py::test_mcp_agent_tool_* |
| 6 | Workflow (DAG / State-Machine, §9) |
Landed (Phases 1-5) | workflow.py |
tests/test_agents_workflow.py (118 tests) |
Documented gaps that remain (each is a future extension, not a missing fundamental):
- Streaming sub-agent events across MCP --
mcp_agent_toolreturns a single value per call; streaming would require the MCP server-streaming RFC to stabilize. - Streaming RAG results to the agent --
rag_as_toolreturns a single concatenated observation today. - Filtered deletion in SemanticMemory --
forget()raisesNotImplementedErrorpending a RAG-side metadata-filtered delete API. - Parallel critic ensembles in ReflectionLoop -- multiple critics voting, reward-model-based acceptance.
- Unified streaming for plan_and_execute steps -- one iterator surfacing events from all steps in sequence.
None of these block the pattern; they're refinements waiting on real use cases.
Explicitly not on the roadmap¶
These appear in the table above but won't be addressed without a forcing use case. Listed here to make the position explicit.
-
Tree of Thoughts (§4) -- requires public
LlamaContext.snapshot()/restore()(currently absent) plus a branching agent loop with a scored candidate frontier. Significant new machinery; non-trivial value for cyllama's typical user. -
Autonomous / AutoGPT (§8) -- structurally opposed to cyllama's design stance (bounded loops, loop detection, max_iterations, contracts for budget invariants). Unbounded goal-decomposition is what the framework actively prevents.
Positioning consequence¶
Reading down the "Value" column reveals the pattern: every "high value" gap closes a reliability or composition gap (reflection loops, RAG bridging, persistent memory, plan-then-execute). Every "not on roadmap" item belongs to the autonomy or graph-orchestration half of the agent space.
This isn't accidental. cyllama leans into bounded, observable,
schema-constrained behavior (loop detection, max_iterations,
Annotated[] markers, ContractAgent invariants); it explicitly
declines unbounded autonomy and graph orchestration. Users who want
those should reach for an autonomy-first framework. Users who want
debuggable, schema-grounded, locally-runnable agents are in the right
place.
Further reading¶
agents_overview.md— the full agent framework reference: agent types, tools, events, configuration.contracts.md— nine worked recipes for contract patterns and the schema-vs-contract decision.../dev/contract-agent.md— design rationale for the ContractAgent reposition.