Cyllama API Reference¶
Version: 0.1.20 Date: March 2026
Complete API reference for cyllama, a high-performance Python library for LLM inference built on llama.cpp.
Table of Contents¶
- High-Level Generation API
- Async API
- Framework Integrations
- Memory Utilities
- Core llama.cpp API
- Advanced Features
- Server Implementations
- Multimodal Support
- Whisper Integration
- Stable Diffusion Integration
High-Level Generation API¶
The high-level API provides simple, Pythonic functions and classes for text generation.
complete()¶
One-shot text generation function.
def complete(
prompt: str,
model_path: str,
config: Optional[GenerationConfig] = None,
stream: bool = False,
**kwargs
) -> Response | Iterator[str]
Parameters:
prompt(str): Input text promptmodel_path(str): Path to GGUF model fileconfig(GenerationConfig, optional): Generation configuration objectstream(bool): If True, return iterator of text chunks**kwargs: Override config parameters (temperature, max_tokens, etc.)
Returns:
Response: Response object with text and stats (if stream=False)Iterator[str]: Iterator of text chunks (if stream=True)
Example:
from cyllama import complete
response = complete(
"What is Python?",
model_path="models/llama.gguf",
temperature=0.7,
max_tokens=200
)
# Streaming
for chunk in complete("Tell me a story", model_path="models/llama.gguf", stream=True):
print(chunk, end="", flush=True)
chat()¶
Chat-style generation with message history. Automatically applies the model's built-in chat template.
def chat(
messages: List[Dict[str, str]],
model_path: str,
config: Optional[GenerationConfig] = None,
stream: bool = False,
template: Optional[str] = None,
**kwargs
) -> str | Iterator[str]
Parameters:
messages(List[Dict]): List of message dicts with 'role' and 'content' keysmodel_path(str): Path to GGUF model fileconfig(GenerationConfig, optional): Generation configurationstream(bool): Enable streaming outputtemplate(str, optional): Chat template name to use. If None, uses model's default.**kwargs: Override config parameters
Returns:
Response: Response object with text and stats (if stream=False)Iterator[str]: Iterator of text chunks (if stream=True)
Example:
from cyllama import chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="models/llama.gguf")
# With explicit template
response = chat(messages, model_path="models/llama.gguf", template="chatml")
apply_chat_template()¶
Apply a chat template to format messages into a prompt string.
def apply_chat_template(
messages: List[Dict[str, str]],
model_path: str,
template: Optional[str] = None,
add_generation_prompt: bool = True,
verbose: bool = False,
) -> str
Parameters:
messages(List[Dict]): List of message dicts with 'role' and 'content' keysmodel_path(str): Path to GGUF model filetemplate(str, optional): Template name or string. If None, uses model's default.add_generation_prompt(bool): Add assistant prompt prefix (default: True)verbose(bool): Enable detailed logging
Returns:
str: Formatted prompt string
Supported Templates:
- llama2, llama3, llama4
- chatml (Qwen, Yi, etc.)
- mistral-v1, mistral-v3, mistral-v7
- phi3, phi4
- deepseek, deepseek2, deepseek3
- gemma, falcon3, command-r, vicuna, zephyr, and more
Example:
from cyllama.api import apply_chat_template
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
]
prompt = apply_chat_template(messages, "models/llama.gguf")
print(prompt)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>
# Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
get_chat_template()¶
Get the chat template string from a model.
Parameters:
model_path(str): Path to GGUF model filetemplate_name(str, optional): Specific template name to retrieve
Returns:
str: Template string (Jinja-style), or empty string if not found
Example:
from cyllama.api import get_chat_template
template = get_chat_template("models/llama.gguf")
print(template) # Shows the Jinja-style template
Response Class¶
Structured response object returned by generation functions.
@dataclass
class Response:
text: str # Generated text content
stats: Optional[GenerationStats] # Generation statistics
finish_reason: str = "stop" # Why generation stopped
model: str = "" # Model path used
Attributes:
text(str): The generated text contentstats(GenerationStats, optional): Statistics including timing and token countsfinish_reason(str): Reason for completion ("stop", "length", etc.)model(str): Path to the model used
String Compatibility:
Response implements the string protocol for backward compatibility:
str(response)returnsresponse.textresponse == "string"compares with textlen(response)returns text lengthfor char in response:iterates over text characters"substring" in responsechecks text containmentresponse + " more"concatenates text
Methods:
to_dict()¶
Convert response to dictionary.
to_json()¶
Convert response to JSON string.
Example:
from cyllama import complete
response = complete("What is Python?", model_path="model.gguf")
# Use as string (backward compatible)
print(response) # Prints text
if "programming" in response:
print("Mentioned programming!")
# Access structured data
print(f"Finish reason: {response.finish_reason}")
if response.stats:
print(f"Tokens/sec: {response.stats.tokens_per_second:.1f}")
# Serialize
data = response.to_dict()
json_str = response.to_json(indent=2)
GenerationStats Class¶
Statistics from a generation run.
@dataclass
class GenerationStats:
prompt_tokens: int # Number of tokens in prompt
generated_tokens: int # Number of tokens generated
total_time: float # Total generation time (seconds)
tokens_per_second: float # Generation speed
prompt_time: float # Time for prompt processing
generation_time: float # Time for token generation
LLM Class¶
Reusable generator with model caching for improved performance.
class LLM:
def __init__(
self,
model_path: str,
config: Optional[GenerationConfig] = None,
verbose: bool = False
)
Parameters:
model_path(str): Path to GGUF model fileconfig(GenerationConfig, optional): Default generation configurationverbose(bool): Print detailed information during generation
Methods:
__call__()¶
Generate text from a prompt.
def __call__(
self,
prompt: str,
config: Optional[GenerationConfig] = None,
stream: bool = False,
on_token: Optional[Callable[[str], None]] = None
) -> Response | Iterator[str]
Parameters:
prompt(str): Input textconfig(GenerationConfig, optional): Override instance configstream(bool): Enable streamingon_token(Callable, optional): Callback for each token
Returns:
Response: Response object with text and stats (if stream=False)Iterator[str]: Iterator of text chunks (if stream=True)
chat()¶
Generate a response from chat messages using the model's chat template.
def chat(
self,
messages: List[Dict[str, str]],
config: Optional[GenerationConfig] = None,
stream: bool = False,
template: Optional[str] = None
) -> str | Iterator[str]
Parameters:
messages(List[Dict]): List of message dicts with 'role' and 'content' keysconfig(GenerationConfig, optional): Override instance configstream(bool): Enable streamingtemplate(str, optional): Chat template name to use
get_chat_template()¶
Get the chat template string from the loaded model.
Example:
from cyllama import LLM, GenerationConfig
gen = LLM("models/llama.gguf")
# Simple generation
response = gen("What is Python?")
# With custom config
config = GenerationConfig(temperature=0.9, max_tokens=100)
response = gen("Tell me a joke", config=config)
# With statistics
response, stats = gen.generate_with_stats("Question?")
print(f"Generated {stats.generated_tokens} tokens in {stats.total_time:.2f}s")
print(f"Speed: {stats.tokens_per_second:.2f} tokens/sec")
# Chat with template
messages = [{"role": "user", "content": "Hello!"}]
response = gen.chat(messages)
# Get template
template = gen.get_chat_template()
GenerationConfig Dataclass¶
Configuration for text generation.
@dataclass
class GenerationConfig:
max_tokens: int = 512
temperature: float = 0.8
top_k: int = 40
top_p: float = 0.95
min_p: float = 0.05
repeat_penalty: float = 1.1
n_gpu_layers: int = 99
n_ctx: Optional[int] = None
n_batch: int = 512
seed: int = -1
stop_sequences: List[str] = field(default_factory=list)
add_bos: bool = True
parse_special: bool = True
Attributes:
max_tokens: Maximum tokens to generate (default: 512)temperature: Sampling temperature, 0.0 = greedy (default: 0.8)top_k: Top-k sampling parameter (default: 40)top_p: Top-p (nucleus) sampling (default: 0.95)min_p: Minimum probability threshold (default: 0.05)repeat_penalty: Penalty for repeating tokens (default: 1.1)n_gpu_layers: GPU layers to offload (default: 99 = all)n_ctx: Context window size, None = auto (default: None)n_batch: Batch size for processing (default: 512)seed: Random seed, -1 = random (default: -1)stop_sequences: Strings that stop generation (default: [])add_bos: Add beginning-of-sequence token (default: True)parse_special: Parse special tokens in prompt (default: True)
GenerationStats Dataclass¶
Statistics from a generation run.
@dataclass
class GenerationStats:
prompt_tokens: int
generated_tokens: int
total_time: float
tokens_per_second: float
prompt_time: float = 0.0
generation_time: float = 0.0
Async API¶
The async API provides non-blocking generation for use in async applications (FastAPI, aiohttp, etc.).
AsyncLLM Class¶
Async wrapper around the LLM class for non-blocking text generation.
class AsyncLLM:
def __init__(
self,
model_path: str,
config: Optional[GenerationConfig] = None,
verbose: bool = False,
**kwargs
)
Parameters:
model_path(str): Path to GGUF model fileconfig(GenerationConfig, optional): Generation configurationverbose(bool): Print detailed information during generation**kwargs: Generation parameters (temperature, max_tokens, etc.)
Methods:
__call__() / generate()¶
Generate text asynchronously.
stream()¶
Stream generated text chunks asynchronously.
async def stream(
self,
prompt: str,
config: Optional[GenerationConfig] = None,
**kwargs
) -> AsyncIterator[str]
generate_with_stats()¶
Generate text and return statistics.
async def generate_with_stats(
self,
prompt: str,
config: Optional[GenerationConfig] = None
) -> Tuple[str, GenerationStats]
Example:
import asyncio
from cyllama import AsyncLLM
async def main():
# Context manager ensures cleanup
async with AsyncLLM("model.gguf", temperature=0.7) as llm:
# Simple generation
response = await llm("What is Python?")
print(response)
# Streaming
async for chunk in llm.stream("Tell me a story"):
print(chunk, end="", flush=True)
# With stats
text, stats = await llm.generate_with_stats("Question?")
print(f"Generated {stats.generated_tokens} tokens")
asyncio.run(main())
complete_async()¶
Async convenience function for one-off text completion.
async def complete_async(
prompt: str,
model_path: str,
config: Optional[GenerationConfig] = None,
verbose: bool = False,
**kwargs
) -> str
Example:
chat_async()¶
Async convenience function for chat-style generation.
async def chat_async(
messages: List[Dict[str, str]],
model_path: str,
config: Optional[GenerationConfig] = None,
verbose: bool = False,
**kwargs
) -> str
Example:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = await chat_async(messages, model_path="model.gguf")
stream_complete_async()¶
Async streaming completion for one-off use.
async def stream_complete_async(
prompt: str,
model_path: str,
config: Optional[GenerationConfig] = None,
verbose: bool = False,
**kwargs
) -> AsyncIterator[str]
Example:
async for chunk in stream_complete_async("Tell me a story", "model.gguf"):
print(chunk, end="", flush=True)
Framework Integrations¶
OpenAI-Compatible API¶
Drop-in replacement for OpenAI Python client.
OpenAICompatibleClient Class¶
from cyllama.integrations.openai_compat import OpenAICompatibleClient
class OpenAICompatibleClient:
def __init__(
self,
model_path: str,
temperature: float = 0.7,
max_tokens: int = 512,
n_gpu_layers: int = 99
)
Attributes:
chat: Chat completions interface
Example:
from cyllama.integrations.openai_compat import OpenAICompatibleClient
client = OpenAICompatibleClient(model_path="models/llama.gguf")
response = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
# Streaming
for chunk in client.chat.completions.create(
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
LangChain Integration¶
Full LangChain LLM interface implementation.
CyllamaLLM Class¶
from cyllama.integrations import CyllamaLLM
class CyllamaLLM(LLM):
model_path: str
temperature: float = 0.7
max_tokens: int = 512
top_k: int = 40
top_p: float = 0.95
repeat_penalty: float = 1.1
n_gpu_layers: int = 99
Example:
from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
llm = CyllamaLLM(model_path="models/llama.gguf", temperature=0.7)
prompt = PromptTemplate(
input_variables=["topic"],
template="Explain {topic} in simple terms:"
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="quantum computing")
# With streaming
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = CyllamaLLM(
model_path="models/llama.gguf",
streaming=True,
callbacks=[StreamingStdOutCallbackHandler()]
)
Memory Utilities¶
Tools for estimating and optimizing GPU memory usage.
estimate_gpu_layers()¶
Estimate optimal number of GPU layers for available VRAM.
def estimate_gpu_layers(
model_path: str,
available_vram_mb: int,
n_ctx: int = 2048,
n_batch: int = 512
) -> MemoryEstimate
Parameters:
model_path(str): Path to GGUF model fileavailable_vram_mb(int): Available VRAM in megabytesn_ctx(int): Context window sizen_batch(int): Batch size
Returns:
MemoryEstimate: Object with recommended settings
Example:
from cyllama import estimate_gpu_layers
estimate = estimate_gpu_layers(
model_path="models/llama.gguf",
available_vram_mb=8000, # 8GB VRAM
n_ctx=2048
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
print(f"Estimated VRAM usage: {estimate.vram / 1024 / 1024:.2f} MB")
estimate_memory_usage()¶
Estimate total memory requirements for model loading.
def estimate_memory_usage(
model_path: str,
n_ctx: int = 2048,
n_batch: int = 512,
n_gpu_layers: int = 0
) -> MemoryEstimate
MemoryEstimate Dataclass¶
Memory estimation results.
@dataclass
class MemoryEstimate:
layers: int # Total layers
graph_size: int # Computation graph size
vram: int # VRAM usage (bytes)
vram_kv: int # KV cache VRAM (bytes)
total_size: int # Total memory (bytes)
tensor_split: Optional[List[int]] # Multi-GPU split
Core llama.cpp API¶
Low-level Cython wrappers for direct llama.cpp access.
Core Classes¶
LlamaModel¶
Represents a loaded GGUF model.
from cyllama.llama.llama_cpp import LlamaModel, LlamaModelParams
params = LlamaModelParams()
params.n_gpu_layers = 99
params.use_mmap = True
params.use_mlock = False
model = LlamaModel("models/llama.gguf", params)
# Properties
print(model.n_params) # Total parameters
print(model.n_layers) # Number of layers
print(model.n_embd) # Embedding dimension
print(model.n_vocab) # Vocabulary size
# Methods
vocab = model.get_vocab() # Get vocabulary
model.free() # Free resources
LlamaContext¶
Inference context for model.
from cyllama.llama.llama_cpp import LlamaContext, LlamaContextParams
ctx_params = LlamaContextParams()
ctx_params.n_ctx = 2048
ctx_params.n_batch = 512
ctx_params.n_threads = 4
ctx_params.n_threads_batch = 4
ctx = LlamaContext(model, ctx_params)
# Decode batch
from cyllama.llama.llama_cpp import llama_batch_get_one
batch = llama_batch_get_one(tokens)
ctx.decode(batch)
# KV cache management
ctx.kv_cache_clear()
ctx.kv_cache_seq_rm(seq_id, p0, p1)
ctx.kv_cache_seq_add(seq_id, p0, p1, delta)
# Performance
ctx.print_perf_data()
LlamaSampler¶
Sampling strategies for token generation.
from cyllama.llama.llama_cpp import LlamaSampler, LlamaSamplerChainParams
sampler_params = LlamaSamplerChainParams()
sampler = LlamaSampler(sampler_params)
# Add sampling methods
sampler.add_top_k(40)
sampler.add_top_p(0.95, 1)
sampler.add_temp(0.7)
sampler.add_dist(seed)
# Sample token
token_id = sampler.sample(ctx, idx)
# Reset state
sampler.reset()
LlamaVocab¶
Vocabulary and tokenization.
vocab = model.get_vocab()
# Tokenization
tokens = vocab.tokenize("Hello world", add_special=True, parse_special=True)
# Detokenization
text = vocab.detokenize(tokens)
piece = vocab.token_to_piece(token_id, special=True)
# Special tokens
print(vocab.bos) # Begin-of-sequence token
print(vocab.eos) # End-of-sequence token
print(vocab.eot) # End-of-turn token
print(vocab.n_vocab) # Vocabulary size
# Check token types
is_eog = vocab.is_eog(token_id)
is_control = vocab.is_control(token_id)
LlamaBatch¶
Efficient batch processing.
from cyllama.llama.llama_cpp import LlamaBatch
# Create batch
batch = LlamaBatch(n_tokens=512, embd=0, n_seq_max=1)
# Add token
batch.add(token_id, pos, seq_ids=[0], logits=True)
# Clear batch
batch.clear()
# Convenience function
from cyllama.llama.llama_cpp import llama_batch_get_one
batch = llama_batch_get_one(tokens, pos_offset=0)
Backend Management¶
from cyllama.llama.llama_cpp import (
ggml_backend_load_all,
ggml_backend_offload_supported,
ggml_backend_metal_set_n_cb
)
# Load all available backends (Metal, CUDA, etc.)
ggml_backend_load_all()
# Check GPU support
if ggml_backend_offload_supported():
print("GPU offload supported")
# Configure Metal (macOS)
ggml_backend_metal_set_n_cb(2) # Number of command buffers
Advanced Features¶
GGUF File Manipulation¶
Inspect and modify GGUF model files.
GGUFContext Class¶
from cyllama.llama.llama_cpp import GGUFContext
# Read existing file
ctx = GGUFContext.from_file("model.gguf")
# Get metadata
metadata = ctx.get_all_metadata()
print(metadata['general.architecture'])
print(metadata['general.name'])
value = ctx.get_val_str("general.architecture")
# Create new file
ctx = GGUFContext.empty()
ctx.set_val_str("custom.key", "value")
ctx.set_val_u32("custom.number", 42)
ctx.write_to_file("custom.gguf", write_tensors=False)
# Modify existing
ctx = GGUFContext.from_file("model.gguf")
ctx.set_val_str("custom.metadata", "updated")
ctx.write_to_file("modified.gguf")
JSON Schema to Grammar¶
Convert JSON schemas to llama.cpp grammar format for structured output. This is implemented in pure Python (vendored from llama.cpp) with no C++ dependency.
from cyllama.llama.llama_cpp import json_schema_to_grammar
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"}
},
"required": ["name", "age"]
}
grammar = json_schema_to_grammar(schema)
# Use with generation
from cyllama.llama.llama_cpp import LlamaSampler
sampler = LlamaSampler()
sampler.add_grammar(grammar)
Model Download¶
Download models from HuggingFace with Ollama-style tags.
from cyllama.llama.llama_cpp import download_model, list_cached_models
# Download from HuggingFace
download_model(
hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:q4",
cache_dir="~/.cache/cyllama/models"
)
# List cached models
models = list_cached_models()
for model in models:
print(f"{model['user']}/{model['model']}:{model['tag']}")
print(f" Path: {model['path']}")
print(f" Size: {model['size'] / 1024 / 1024:.2f} MB")
# Direct URL download
download_model(
url="https://example.com/model.gguf",
output_path="models/custom.gguf"
)
N-gram Cache¶
Pattern-based token prediction for 2-10x speedup on repetitive text.
from cyllama.llama.llama_cpp import NgramCache
# Create cache
cache = NgramCache()
# Learn patterns from token sequences
tokens = [1, 2, 3, 4, 5, 6, 7, 8]
cache.update(tokens, ngram_min=2, ngram_max=4)
# Predict likely continuations
input_tokens = [1, 2, 3]
draft_tokens = cache.draft(input_tokens, n_draft=16)
# Save/load cache
cache.save("patterns.bin")
loaded_cache = NgramCache.from_file("patterns.bin")
# Clear cache
cache.clear()
Speculative Decoding¶
Use draft model for 2-3x inference speedup.
from cyllama.llama.llama_cpp import (
LlamaModel, LlamaContext, LlamaModelParams, LlamaContextParams,
Speculative, SpeculativeParams
)
# Load target and draft models
model_target = LlamaModel("models/large.gguf", LlamaModelParams())
model_draft = LlamaModel("models/small.gguf", LlamaModelParams())
ctx_params = LlamaContextParams()
ctx_params.n_ctx = 2048
ctx_target = LlamaContext(model_target, ctx_params)
# Configure speculative parameters
params = SpeculativeParams(
n_max=16, # Maximum number of draft tokens
n_reuse=8, # Tokens to reuse
p_min=0.75 # Minimum acceptance probability
)
# Create speculative decoding instance
spec = Speculative(params, ctx_target)
# Check compatibility
if spec.is_compat():
print("Models are compatible for speculative decoding")
# Begin a speculative decoding round
spec.begin()
# Generate draft tokens
prompt_tokens = [1, 2, 3]
last_token = prompt_tokens[-1]
draft_tokens = spec.draft(prompt_tokens, last_token)
# Accept verified tokens
spec.accept()
# Print performance statistics
spec.print_stats()
Parameters:
n_max: Maximum number of tokens to draft (default: 16)n_reuse: Number of tokens to reuse from previous draft (default: 8)p_min: Minimum acceptance probability (default: 0.75)
Methods:
| Method | Description |
|---|---|
is_compat() |
Check if target and draft models are compatible |
begin() |
Begin a speculative decoding round |
draft(...) |
Generate draft tokens from the draft model |
accept() |
Accept verified tokens after evaluation |
print_stats() |
Print speculative decoding performance statistics |
Server Implementations¶
Three OpenAI-compatible server implementations.
Embedded Server¶
Pure Python server implementation.
from cyllama.llama.server.embedded import start_server
# Start server
start_server(
model_path="models/llama.gguf",
host="127.0.0.1",
port=8000,
n_ctx=2048,
n_gpu_layers=99
)
# Use with OpenAI client
import openai
openai.api_base = "http://127.0.0.1:8000/v1"
response = openai.ChatCompletion.create(
model="cyllama",
messages=[{"role": "user", "content": "Hello!"}]
)
Mongoose Server¶
High-performance C server using Mongoose library.
from cyllama.llama.server.mongoose_server import EmbeddedServer
server = EmbeddedServer(
model_path="models/llama.gguf",
host="127.0.0.1",
port=8080,
n_ctx=2048,
n_threads=4
)
server.start()
# Server runs in background
# Access at http://127.0.0.1:8080
server.stop()
LlamaServer¶
Python wrapper around the llama.cpp server binary.
from cyllama.llama.server import LlamaServer, LauncherServerConfig
config = LauncherServerConfig(
model_path="models/llama.gguf",
host="127.0.0.1",
port=8080
)
server = LlamaServer(config, server_binary="bin/llama-server")
server.start()
# Check status
if server.is_running():
print("Server is running")
server.stop()
Multimodal Support¶
LLAVA and other vision-language models.
from cyllama.llama.mtmd.multimodal import (
LlavaImageEmbed,
load_mmproj,
process_image
)
# Load multimodal projector
mmproj = load_mmproj("models/mmproj.gguf")
# Process image
image_embed = process_image(
ctx=ctx,
image_path="image.jpg",
mmproj=mmproj
)
# Use in generation
# Image embeddings are automatically integrated into context
Whisper Integration¶
Speech-to-text transcription using whisper.cpp. See Whisper.cpp Integration for complete documentation.
Quick Start¶
from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np
# Load model
ctx = WhisperContext("models/ggml-base.en.bin")
# Audio must be 16kHz mono float32
samples = load_audio_as_float32("audio.wav") # Your audio loading function
# Transcribe
params = WhisperFullParams()
params.language = "en"
ctx.full(samples, params)
# Get results
for i in range(ctx.full_n_segments()):
t0 = ctx.full_get_segment_t0(i) / 100.0 # centiseconds to seconds
t1 = ctx.full_get_segment_t1(i) / 100.0
text = ctx.full_get_segment_text(i)
print(f"[{t0:.2f}s - {t1:.2f}s] {text}")
Key Classes¶
| Class | Description |
|---|---|
WhisperContext |
Main context for model loading and inference |
WhisperContextParams |
Configuration for context creation |
WhisperFullParams |
Configuration for transcription |
WhisperVadParams |
Voice activity detection parameters |
WhisperContext Methods¶
| Method | Description |
|---|---|
full(samples, params) |
Run transcription on float32 audio samples |
full_n_segments() |
Get number of transcribed segments |
full_get_segment_text(i) |
Get text of segment i |
full_get_segment_t0(i) |
Get start time (centiseconds) |
full_get_segment_t1(i) |
Get end time (centiseconds) |
full_lang_id() |
Get detected language ID |
is_multilingual() |
Check if model supports multiple languages |
Audio Requirements¶
- Sample rate: 16000 Hz
- Channels: Mono
- Format: Float32 normalized to [-1.0, 1.0]
Stable Diffusion Integration¶
Image generation using stable-diffusion.cpp. Supports SD 1.x/2.x, SDXL, SD3, FLUX, video generation (Wan/CogVideoX), and ESRGAN upscaling.
Note: Build with WITH_STABLEDIFFUSION=1 to enable this module.
Quick Start¶
from cyllama.stablediffusion import text_to_image
# Simple text-to-image generation
images = text_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
prompt="a photo of a cute cat",
width=512,
height=512,
sample_steps=4,
cfg_scale=1.0
)
# Save the result
images[0].save("output.png")
text_to_image()¶
Convenience function for text-to-image generation.
def text_to_image(
model_path: str,
prompt: str,
negative_prompt: str = "",
width: int = 512,
height: int = 512,
seed: int = -1,
batch_count: int = 1,
sample_steps: int = 20,
cfg_scale: float = 7.0,
sample_method: Optional[SampleMethod] = None,
scheduler: Optional[Scheduler] = None,
clip_skip: int = -1,
n_threads: int = -1
) -> List[SDImage]
Parameters:
model_path(str): Path to model file (.gguf, .safetensors, or .ckpt)prompt(str): Text prompt for generationnegative_prompt(str): Negative prompt (what to avoid)width(int): Output image width (default: 512)height(int): Output image height (default: 512)seed(int): Random seed (-1 for random)batch_count(int): Number of images to generatesample_steps(int): Sampling steps (use 1-4 for turbo models, 20+ for others)cfg_scale(float): CFG scale (use 1.0 for turbo, 7.0 for others)sample_method(SampleMethod): Sampling method (EULER, EULER_A, DPM2, etc.)scheduler(Scheduler): Scheduler (DISCRETE, KARRAS, EXPONENTIAL, etc.)clip_skip(int): CLIP skip layers (-1 for default)n_threads(int): Number of threads (-1 for auto)
Returns:
List[SDImage]: List of generated images
image_to_image()¶
Image-to-image generation with an initial image.
def image_to_image(
model_path: str,
init_image: SDImage,
prompt: str,
negative_prompt: str = "",
strength: float = 0.75,
seed: int = -1,
sample_steps: int = 20,
cfg_scale: float = 7.0,
sample_method: Optional[SampleMethod] = None,
scheduler: Optional[Scheduler] = None,
clip_skip: int = -1,
n_threads: int = -1
) -> List[SDImage]
Parameters:
init_image(SDImage): Initial image to transformstrength(float): Transformation strength (0.0-1.0)- Other parameters same as
text_to_image()
SDContext¶
Main context class for model reuse and advanced generation.
from cyllama.stablediffusion import SDContext, SDContextParams, SampleMethod, Scheduler
# Create context
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4
ctx = SDContext(params)
# Generate images
images = ctx.generate(
prompt="a beautiful landscape",
negative_prompt="blurry, ugly",
width=512,
height=512,
sample_steps=4,
cfg_scale=1.0,
sample_method=SampleMethod.EULER,
scheduler=Scheduler.DISCRETE
)
# Check if context is valid
print(ctx.is_valid)
Methods:
generate(...): Generate images from text promptgenerate_with_params(params: SDImageGenParams): Low-level generationgenerate_video(...): Generate video frames (requires video-capable model)
SDContextParams¶
Configuration for model loading.
params = SDContextParams()
params.model_path = "model.gguf" # Main model
params.vae_path = "vae.safetensors" # Optional VAE
params.clip_l_path = "clip_l.safetensors" # Optional CLIP-L (for SDXL)
params.clip_g_path = "clip_g.safetensors" # Optional CLIP-G (for SDXL)
params.t5xxl_path = "t5xxl.safetensors" # Optional T5-XXL (for SD3/FLUX)
params.lora_model_dir = "loras/" # LoRA directory
params.n_threads = 4 # Thread count
params.vae_decode_only = True # VAE decode only mode
params.diffusion_flash_attn = False # Flash attention
params.wtype = SDType.F16 # Weight type
params.rng_type = RngType.CPU # RNG type
SDImage¶
Image wrapper with numpy and PIL integration.
from cyllama.stablediffusion import SDImage
import numpy as np
# Create from numpy array
arr = np.zeros((512, 512, 3), dtype=np.uint8)
img = SDImage.from_numpy(arr)
# Properties
print(img.width, img.height, img.channels)
# Convert to numpy
arr = img.to_numpy() # Returns (H, W, C) uint8 array
# Convert to PIL (requires Pillow)
pil_img = img.to_pil()
# Save to file
img.save("output.png")
# Load from file
img = SDImage.load("input.png")
SDImageGenParams¶
Detailed generation parameters.
from cyllama.stablediffusion import SDImageGenParams, SDImage
params = SDImageGenParams()
params.prompt = "a cute cat"
params.negative_prompt = "ugly, blurry"
params.width = 512
params.height = 512
params.seed = 42
params.batch_count = 1
params.strength = 0.75 # For img2img
params.clip_skip = -1
# Set init image for img2img
init_img = SDImage.from_numpy(arr)
params.set_init_image(init_img)
# Set control image for ControlNet
params.set_control_image(control_img, strength=0.8)
# Access sample parameters
sample = params.sample_params
sample.sample_steps = 20
sample.cfg_scale = 7.0
sample.sample_method = SampleMethod.EULER
sample.scheduler = Scheduler.KARRAS
SDSampleParams¶
Sampling configuration.
from cyllama.stablediffusion import SDSampleParams, SampleMethod, Scheduler
params = SDSampleParams()
params.sample_method = SampleMethod.EULER_A
params.scheduler = Scheduler.KARRAS
params.sample_steps = 20
params.cfg_scale = 7.0
params.eta = 0.0 # Noise multiplier
Upscaler¶
ESRGAN-based image upscaling.
from cyllama.stablediffusion import Upscaler, SDImage
# Load upscaler model
upscaler = Upscaler("models/esrgan-x4.bin", n_threads=4)
# Check upscale factor
print(f"Factor: {upscaler.upscale_factor}x")
# Upscale an image
img = SDImage.load("input.png")
upscaled = upscaler.upscale(img)
# Or specify custom factor
upscaled = upscaler.upscale(img, factor=2)
upscaled.save("upscaled.png")
convert_model()¶
Convert models between formats.
from cyllama.stablediffusion import convert_model, SDType
# Convert safetensors to GGUF with quantization
convert_model(
input_path="sd-v1-5.safetensors",
output_path="sd-v1-5-q4_0.gguf",
output_type=SDType.Q4_0,
vae_path="vae-ft-mse.safetensors" # Optional
)
canny_preprocess()¶
Canny edge detection for ControlNet.
from cyllama.stablediffusion import SDImage, canny_preprocess
img = SDImage.load("photo.png")
# Apply Canny preprocessing (modifies image in place)
success = canny_preprocess(
img,
high_threshold=0.8,
low_threshold=0.1,
weak=0.5,
strong=1.0,
inverse=False
)
Callbacks¶
Set callbacks for logging, progress, and preview.
from cyllama.stablediffusion import (
set_log_callback,
set_progress_callback,
set_preview_callback
)
# Log callback
def log_cb(level, text):
level_names = {0: 'DEBUG', 1: 'INFO', 2: 'WARN', 3: 'ERROR'}
print(f'[{level_names.get(level, level)}] {text}', end='')
set_log_callback(log_cb)
# Progress callback
def progress_cb(step, steps, time_ms):
pct = (step / steps) * 100 if steps > 0 else 0
print(f'Step {step}/{steps} ({pct:.1f}%) - {time_ms:.2f}s')
set_progress_callback(progress_cb)
# Preview callback (for real-time preview during generation)
def preview_cb(step, frames, is_noisy):
for i, frame in enumerate(frames):
frame.save(f"preview_{step}_{i}.png")
set_preview_callback(preview_cb)
# Clear callbacks
set_log_callback(None)
set_progress_callback(None)
set_preview_callback(None)
Enums¶
SampleMethod: Sampling methods
EULER,EULER_A,HEUN,DPM2,DPMPP2S_A,DPMPP2M,DPMPP2Mv2IPNDM,IPNDM_V,LCM,DDIM_TRAILING,TCD
Scheduler: Schedulers
DISCRETE,KARRAS,EXPONENTIAL,AYS,GITSSGM_UNIFORM,SIMPLE,SMOOTHSTEP,LCM
SDType: Data types for quantization
F32,F16,BF16Q4_0,Q4_1,Q5_0,Q5_1,Q8_0,Q8_1Q2_K,Q3_K,Q4_K,Q5_K,Q6_K,Q8_K
RngType: Random number generators
STD_DEFAULT,CUDA,CPU
LogLevel: Log levels
DEBUG,INFO,WARN,ERROR
Utility Functions¶
from cyllama.stablediffusion import (
get_num_cores,
get_system_info,
type_name,
sample_method_name,
scheduler_name
)
# System info
print(f"CPU cores: {get_num_cores()}")
print(get_system_info())
# Get string names
print(type_name(SDType.Q4_0)) # "q4_0"
print(sample_method_name(SampleMethod.EULER)) # "euler"
print(scheduler_name(Scheduler.KARRAS)) # "karras"
CLI Tool¶
Command-line interface for stable diffusion operations.
# Generate image
python -m cyllama.stablediffusion generate \
--model models/sd_xl_turbo_1.0.q8_0.gguf \
--prompt "a beautiful sunset" \
--output sunset.png \
--steps 4 --cfg 1.0
# Upscale image
python -m cyllama.stablediffusion upscale \
--model models/esrgan-x4.bin \
--input image.png \
--output image_4x.png
# Convert model
python -m cyllama.stablediffusion convert \
--input sd-v1-5.safetensors \
--output sd-v1-5-q4_0.gguf \
--type q4_0
# Show system info
python -m cyllama.stablediffusion info
Supported Models¶
- SD 1.x/2.x: Standard Stable Diffusion models
- SDXL/SDXL Turbo: Stable Diffusion XL (use cfg_scale=1.0, steps=1-4 for Turbo)
- SD3/SD3.5: Stable Diffusion 3.x
- FLUX: FLUX.1 models (dev, schnell)
- Wan/CogVideoX: Video generation models (use
generate_video()) - LoRA: Low-rank adaptation files
- ControlNet: Conditional generation with control images
- ESRGAN: Image upscaling models
Error Handling¶
All cyllama functions raise appropriate Python exceptions:
from cyllama import complete, LLM
try:
response = complete("Hello", model_path="nonexistent.gguf")
except FileNotFoundError:
print("Model file not found")
except RuntimeError as e:
print(f"Runtime error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
# LLM with error handling
try:
gen = LLM("models/llama.gguf")
response = gen("What is Python?")
except Exception as e:
print(f"Generation failed: {e}")
Type Hints¶
All functions include comprehensive type hints for IDE support:
from typing import List, Dict, Optional, Iterator, Callable, Tuple
from cyllama import (
complete, # str | Iterator[str]
chat, # str | Iterator[str]
LLM, # class
GenerationConfig, # @dataclass
)
Performance Tips¶
1. Model Reuse¶
# BAD: Reloads model each time (slow)
for prompt in prompts:
response = complete(prompt, model_path="model.gguf")
# GOOD: Reuses loaded model (fast)
gen = LLM("model.gguf")
for prompt in prompts:
response = gen(prompt)
2. Batch Processing¶
from cyllama import batch_generate, GenerationConfig
# BAD: Sequential processing
responses = [generate(p, model_path="model.gguf") for p in prompts]
# GOOD: Parallel batch processing (3-10x faster)
prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(
prompts,
model_path="model.gguf",
n_seq_max=8, # Max parallel sequences
config=GenerationConfig(max_tokens=50, temperature=0.7)
)
3. GPU Offloading¶
# Estimate optimal layers
from cyllama import estimate_gpu_layers
estimate = estimate_gpu_layers("model.gguf", available_vram_mb=8000)
# Use recommended settings
config = GenerationConfig(n_gpu_layers=estimate.n_gpu_layers)
gen = LLM("model.gguf", config=config)
4. Context Sizing¶
# Auto-size context (recommended)
config = GenerationConfig(n_ctx=None, max_tokens=200)
# Manual sizing (for control)
config = GenerationConfig(n_ctx=2048, max_tokens=200)
5. Streaming for Long Outputs¶
# Non-streaming: waits for complete response
response = complete("Write a long essay", model_path="model.gguf", max_tokens=2000)
# Streaming: see output as it generates
for chunk in complete("Write a long essay", model_path="model.gguf",
max_tokens=2000, stream=True):
print(chunk, end="", flush=True)
Version Compatibility¶
- Python: >=3.10 (tested on 3.13)
- llama.cpp: b8429
- Platform: macOS, Linux, Windows
See Also¶
- User Guide - Comprehensive usage guide
- Cookbook - Practical recipes and patterns
- Changelog - Release history
- llama.cpp Documentation
Last Updated: November 21, 2025 Cyllama Version: 0.1.9