cyllama¶
This is the official documentation for cyllama, a high-performance Python library for local AI inference.
About¶
cyllama provides high-performance Cython bindings to three C++ inference engines -- all from Python with zero runtime dependencies:
-
llama.cpp -- LLM text generation, chat, embeddings, and text-to-speech
-
whisper.cpp -- Automatic speech recognition and translation
-
stable-diffusion.cpp -- Image and video generation from text prompts
This documentation covers:
-
Installation and setup across different platforms and GPU backends
-
Text generation with llama.cpp for chat, completion, and embeddings
-
Speech recognition with whisper.cpp for transcription and translation
-
Image generation with stable-diffusion.cpp for text-to-image workflows
-
Agent framework for building tool-using AI agents
-
RAG for retrieval-augmented generation with local models
Who This Is For¶
-
Python developers who want to run LLMs locally without cloud dependencies
-
ML engineers looking for a lightweight alternative to PyTorch-based inference
-
Application developers building AI-powered features with predictable latency
-
Researchers who need direct access to model internals and sampling parameters
Prerequisites¶
-
Python 3.10 or later
-
Familiarity with command-line tools
-
Understanding of what language models do (not how they work internally)
No machine learning expertise is required for basic usage.
Conventions¶
Code examples use Python 3.10+ syntax:
Shell commands are shown with bash syntax:
Source Code¶
cyllama is open source and available at:
https://github.com/shakfu/cyllama
Issues, contributions, and feedback are welcome.