Introduction¶
cyllama is a zero-dependency Python library for local AI inference. It provides high-performance Cython bindings to three powerful C++ inference engines:
-
llama.cpp - Large language model inference for text generation, chat, and embeddings
-
whisper.cpp - Automatic speech recognition (ASR) supporting 100+ languages
-
stable-diffusion.cpp - Image and video generation from text prompts
Why cyllama?¶
Zero Dependencies¶
Unlike other Python LLM libraries that require PyTorch, TensorFlow, or other heavy frameworks, cyllama compiles directly against the C++ libraries. The only requirement is a GGUF model file.
High Performance¶
By wrapping optimized C++ code with Cython (not Python bindings), cyllama achieves near-native performance:
- GPU acceleration via Metal (macOS), CUDA (NVIDIA), Vulkan (cross-platform)
- Efficient memory management with KV caching
- Batch processing for 3-10x throughput improvements
- Speculative decoding for 2-3x faster generation
Pythonic API¶
Despite the low-level foundations, cyllama provides a clean, Pythonic interface:
from cyllama import complete
response = complete(
"Explain quantum computing in simple terms",
model_path="models/llama.gguf",
temperature=0.7
)
print(response)
What's Covered¶
This documentation is organized into several parts:
Llama.cpp - Text generation with large language models, including the high-level API, streaming, batch processing, server implementations, and advanced features like speculative decoding and context caching.
Whisper.cpp - Speech-to-text transcription with support for timestamps, translation, and voice activity detection.
Stable-Diffusion.cpp - Image generation from text prompts, supporting SD 1.x/2.x, SDXL, SD3, FLUX, and video generation models.
Agents - A zero-dependency agent framework with three architectures: ReActAgent for general-purpose tasks, ConstrainedAgent for grammar-enforced tool calls, and ContractAgent for runtime verification.
Getting Started¶
cyllama uses uv for fast, reliable Python package management. uv is a modern Python package manager written in Rust that provides:
- Speed: 10-100x faster than pip for dependency resolution and installation
- Reliability: Deterministic builds with lockfile support
- Simplicity: Single tool for virtual environments, packages, and Python version management
Quick Start¶
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and build
git clone https://github.com/shakfu/cyllama.git
cd cyllama
# Sync dependencies (uv creates the virtual environment automatically)
uv sync
# Build cyllama
make
# Download a test model
make download
# Try it out in a python terminal
uv run python
>>> from cyllama import complete
>>> response = complete('Hello!', model_path='models/Llama-3.2-1B-Instruct-Q8_0.gguf')
>>> print(response)
Running Tests¶
Using uv Commands¶
Common uv commands for development:
uv sync # Install/sync all dependencies
uv add package # Add a new dependency
uv run python ... # Run Python in the virtual environment
uv run pytest ... # Run pytest
uv pip list # List installed packages
For detailed installation instructions, see the Installation Guide.