cyllama overview¶
cyllama is a zero-dependency Python library for local LLM inference which uses cython to wrap the following high-performance inference engines:
- llama.cpp: text-to-text, text-to-speech and multimodel
- whisper.cpp: automatic speech recognition
- stable-diffusion.cpp: text-to-image and text-to-video
Core Features¶
- High-level API -
complete(),chat(),LLMclass for quick prototyping - Low-level API - Direct access to llama.cpp, whisper.cpp, and stable-diffusion.cpp internals
- Streaming - Token-by-token output with callbacks
- Batch processing - Process multiple prompts 3-10x faster
- GPU acceleration - Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform) backends
- Memory tools - Estimate GPU layers and VRAM usage
- OpenAI-compatible servers -
EmbeddedServer(C/Mongoose) andPythonServerimplementations
Agent Framework¶
- ReActAgent - Reasoning + Acting with tool calling
- ConstrainedAgent - Grammar-enforced tool calls (100% valid output)
- ContractAgent - Pre/post conditions on tools (C++26-inspired contracts)
Additional Capabilities¶
- Speculative decoding - 2-3x speedup with draft models
- GGUF utilities - Read/write model metadata
- JSON schema grammars - Structured output generation
Integrations¶
- OpenAI-compatible API - Drop-in client replacement
- LangChain - Full LLM interface implementation
- ACP/MCP support - Agent and Model Context Protocols
Quick Example¶
from cyllama import complete
response = complete(
"Explain quantum computing in simple terms",
model_path="models/llama.gguf",
temperature=0.7
)
print(response)
Requirements¶
- Python 3.10+
- macOS, Linux, or Windows
- GGUF model files (download from HuggingFace)