cyllama overview¶

cyllama is a zero-dependency Python library for local LLM inference which uses cython to wrap the following high-performance inference engines:

llama.cpp: text-to-text, text-to-speech and multimodel
whisper.cpp: automatic speech recognition
stable-diffusion.cpp: text-to-image and text-to-video

Core Features¶

High-level API - complete(), chat(), LLM class for quick prototyping
Low-level API - Direct access to llama.cpp, whisper.cpp, and stable-diffusion.cpp internals
Streaming - Token-by-token output with callbacks
Batch processing - Process multiple prompts 3-10x faster
GPU acceleration - Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform) backends
Memory tools - Estimate GPU layers and VRAM usage
OpenAI-compatible servers - EmbeddedServer (C/Mongoose) and PythonServer implementations

Agent Framework¶

ReActAgent - Reasoning + Acting with tool calling
ConstrainedAgent - Grammar-enforced tool calls (100% valid output)
ContractAgent - Pre/post conditions on tools (C++26-inspired contracts)

Additional Capabilities¶

Speculative decoding - 2-3x speedup with draft models
GGUF utilities - Read/write model metadata
JSON schema grammars - Structured output generation

Integrations¶

OpenAI-compatible API - Drop-in client replacement
LangChain - Full LLM interface implementation
ACP/MCP support - Agent and Model Context Protocols

Quick Example¶

from cyllama import complete

response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7
)
print(response)

Requirements¶

Python 3.10+
macOS, Linux, or Windows
GGUF model files (download from HuggingFace)

repo: https://github.com/shakfu/cyllama