MCP Integration Proposal¶
Status: design draft. Tracks the MCP integration entry in TODO.md.
Cyllama's relationship to the Model Context Protocol (MCP) splits into two
independent directions, both of which are cyllama-layer concerns -- llama.cpp
upstream has no server-side MCP support; only the Svelte webui ships a
TypeScript client (build/llama.cpp/tools/server/webui/src/lib/utils/mcp.ts).
- Client direction -- a local LLM consumes external MCP servers' tools
during a tool-calling loop. Already partially implemented in
src/cyllama/agents/mcp.py(stdio + HTTP transports, wired into the agentToolabstraction). Needs to be lifted onto the top-levelLLM/chat()API so non-agent callers benefit too. - Server direction -- cyllama exposes its inference capabilities as MCP
tools and its model catalog as resources, so MCP clients (Claude Code,
Claude Desktop, etc.) can drive local GGUF models. New package
src/cyllama/mcp/, served via stdio and via Streamable-HTTP routes mounted on the existingEmbeddedServer.
Both sides reuse src/cyllama/agents/jsonrpc.py for framing and the
high-level API in src/cyllama/api.py for execution. No new heavy deps.
1. MCP client at the top-level LLM / chat() API¶
The transports in agents/mcp.py (McpClient, McpServerConfig,
McpTransportType, get_tools_for_agent()) stay as-is. New surface in
cyllama/api.py:
from cyllama.agents.mcp import (
McpClient, McpServerConfig, McpTransportType, McpTool, McpResource,
)
class LLM:
def add_mcp_server(
self,
name: str,
*,
# stdio transport
command: str | None = None,
args: list[str] | None = None,
env: dict[str, str] | None = None,
# http transport
url: str | None = None,
headers: dict[str, str] | None = None,
# inferred from which kwargs are set if omitted
transport: McpTransportType | None = None,
) -> None: ...
def remove_mcp_server(self, name: str) -> None: ...
def list_mcp_tools(self) -> list[McpTool]: ...
def list_mcp_resources(self) -> list[McpResource]: ...
def chat(
self,
messages: list[dict],
*,
tools: list[Tool] | None = None,
use_mcp: bool = True, # auto-include attached servers' tools
max_tool_iterations: int = 8,
...
) -> ChatResponse: ...
Module-level convenience mirrors:
Internals¶
-
LLMlazily owns a singleMcpClient. -
chat()pullsclient.get_tools_for_agent(), merges with caller-suppliedtools, runs the existing tool-call loop, and dispatches MCP tool calls viaclient.call_tool(name, args). -
Connect on first use; disconnect on
LLM.close()/__exit__.
Open questions¶
-
Sync-only (matches current
agents/mcp.py) vs. add async path. Recommend sync first; revisit when an async caller appears. -
Resource handling: surface
mcp://URIs through an explicitread_resource()helper rather than auto-injecting into the system prompt. Auto-injection is hard to undo and easy to abuse.
2. Cyllama as MCP server¶
New package layout:
src/cyllama/mcp/
__init__.py
protocol.py # MCP method dispatch over JSON-RPC (uses agents/jsonrpc.py)
tools.py # registry: name -> (input_schema, handler)
resources.py # model listing, gguf metadata
stdio.py # `python -m cyllama.mcp` stdio transport
http.py # Streamable-HTTP route handlers (mounted on EmbeddedServer)
Tool surface¶
One MCP tool per high-level capability; all are thin wrappers over the
existing cyllama API.
| MCP tool | Backed by | Inputs (subset) |
|---|---|---|
complete |
cyllama.complete |
prompt, model, max_tokens, temperature |
chat |
cyllama.chat |
messages, model, tools |
embed |
LLM.embed |
input (str or list), model |
transcribe |
whisper high-level |
audio_path or base64, language |
generate_image |
cyllama.sd.text_to_image |
prompt, width, height, steps |
Resources¶
-
models://local-- list of GGUF files discovered under configured roots. -
model://<name>-- JSON metadata (arch, params, ctx, quant) via existing model-introspection helpers.
Server entry points¶
# stdio transport (Claude Desktop config: command=python, args=["-m", "cyllama.mcp"])
def serve_stdio(options: McpServerOptions) -> None: ...
# Embedded HTTP transport (mounted into the existing mongoose server)
class EmbeddedServer:
def enable_mcp(
self,
*,
path: str = "/mcp",
options: McpServerOptions | None = None,
) -> None: ...
McpServerOptions carries:
-
Allowed model roots (filesystem allowlist).
-
Default model.
-
Tool subset to expose (a deployment can expose
embedonly, etc.). -
Auth hook for the HTTP route.
Wire-level¶
Implement only the subset MCP requires today:
-
initialize -
tools/list,tools/call -
resources/list,resources/read -
ping -
notifications/initialized
On stdio, reuse agents/jsonrpc.py framing. On HTTP, follow the
Streamable-HTTP spec: single POST /mcp for requests, GET /mcp upgrading
to SSE for server-initiated messages.
Open questions¶
-
Streaming
tools/callpartial results via SSE: defer until a real client consumes it. -
Concurrency:
EmbeddedServeris single-threaded; long-runninggenerate_imagecalls will block other routes. Either gate behind a worker thread or document the limitation in the route handler.
Merits and counterweights¶
Server-direction merits¶
-
Single endpoint, multiple modalities. Claude Desktop / Claude Code can reach
complete,embed,transcribe,generate_imagethrough one configured server. Upstream llama.cpp ships only an OpenAI-compat HTTP API -- nothing for whisper or SD. -
Local, offline, private. GGUF inference stays on the host but is reachable from a frontier-model agent loop. Useful for cheap bulk embedding, transcription of sensitive audio, or fast small-model drafts.
-
Capability-gated surface.
McpServerOptionslets a deployment exposeembedonly (ortranscribeonly). Embedding-as-a-service over MCP is a clean fit. -
Resource surface fits naturally.
models://local+model://<name>is exactly what MCP resources are for; the introspection helpers exist. -
Reuses existing infra.
agents/jsonrpc.pyframing plus mongoose routes mean small marginal code.
Counterweights¶
-
Llama.cpp's HTTP server already covers
complete/chat/embedvia OpenAI-compat. The MCP server's incremental value overllama-serveris mostlytranscribe+generate_image+ the resource catalog. -
Asymmetric value. The client direction (local LLM consuming MCP tools) is more clearly useful -- it gives small local models real reach. The server direction mostly benefits frontier-model users who want a local fallback, a smaller audience.
-
Concurrency mismatch.
EmbeddedServeris single-threaded;generate_imageblocks for tens of seconds. Either add worker threads (non-trivial) or the server is functionally single-user. -
Protocol churn. MCP transport spec moved twice in 2025 (stdio -> HTTP+SSE -> Streamable HTTP). Building now means tracking spec changes.
-
Better-served by an off-the-shelf wrapper for stdio. A ~200-line script using the
mcpPython SDK that imports cyllama would deliver most of the stdio value without touchingEmbeddedServer. The HTTP-mounted variant is the part that justifies in-tree code.
Recommendation and build order¶
Build the client direction now -- clear win, code mostly exists. Defer
the server direction until either (a) a concrete user wants
transcribe / generate_image over MCP, or (b) the resource catalog
unlocks a specific workflow. When the server direction is built, start with
the HTTP transport on EmbeddedServer (no off-the-shelf substitute) and
skip stdio unless a concrete client needs it.
- Now. Lift
agents/mcp.pyontoLLM/chat()(client surface). Lowest risk; transports and tests already exist (tests/test_mcp.py). - Deferred, when triggered. Stand up
src/cyllama/mcp/with the Streamable-HTTP transport mounted onEmbeddedServer.enable_mcp(). Start withembed+ the resource surface (the parts llama-server can't already do well over OpenAI-compat). - Deferred further. Add
transcribeandgenerate_imagetools, and address the single-threadedEmbeddedServerconcurrency limitation before exposing them. - Only on demand. Add the stdio entrypoint. Until then, document the off-the-shelf SDK-wrapper pattern for users who need stdio today.