Cyllama API Reference¶

Version: 0.1.20 Date: March 2026

Complete API reference for cyllama, a high-performance Python library for LLM inference built on llama.cpp.

Table of Contents¶

High-Level Generation API
Async API
Framework Integrations
Memory Utilities
Core llama.cpp API
Advanced Features
Server Implementations
Multimodal Support
Whisper Integration
Stable Diffusion Integration

High-Level Generation API¶

The high-level API provides simple, Pythonic functions and classes for text generation.

`complete()`¶

One-shot text generation function.

def complete(
    prompt: str,
    model_path: str,
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    **kwargs
) -> Response | Iterator[str]

Parameters:

prompt (str): Input text prompt
model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Generation configuration object
stream (bool): If True, return iterator of text chunks
**kwargs: Override config parameters (temperature, max_tokens, etc.)

Returns:

Response: Response object with text and stats (if stream=False)
Iterator[str]: Iterator of text chunks (if stream=True)

Example:

from cyllama import complete

response = complete(
    "What is Python?",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)

# Streaming
for chunk in complete("Tell me a story", model_path="models/llama.gguf", stream=True):
    print(chunk, end="", flush=True)

`chat()`¶

Chat-style generation with message history. Automatically applies the model's built-in chat template.

def chat(
    messages: List[Dict[str, str]],
    model_path: str,
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    template: Optional[str] = None,
    **kwargs
) -> str | Iterator[str]

Parameters:

messages (List[Dict]): List of message dicts with 'role' and 'content' keys
model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Generation configuration
stream (bool): Enable streaming output
template (str, optional): Chat template name to use. If None, uses model's default.
**kwargs: Override config parameters

Returns:

Response: Response object with text and stats (if stream=False)
Iterator[str]: Iterator of text chunks (if stream=True)

Example:

from cyllama import chat

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]

response = chat(messages, model_path="models/llama.gguf")

# With explicit template
response = chat(messages, model_path="models/llama.gguf", template="chatml")

`apply_chat_template()`¶

Apply a chat template to format messages into a prompt string.

def apply_chat_template(
    messages: List[Dict[str, str]],
    model_path: str,
    template: Optional[str] = None,
    add_generation_prompt: bool = True,
    verbose: bool = False,
) -> str

Parameters:

messages (List[Dict]): List of message dicts with 'role' and 'content' keys
model_path (str): Path to GGUF model file
template (str, optional): Template name or string. If None, uses model's default.
add_generation_prompt (bool): Add assistant prompt prefix (default: True)
verbose (bool): Enable detailed logging

Returns:

str: Formatted prompt string

Supported Templates:

llama2, llama3, llama4
chatml (Qwen, Yi, etc.)
mistral-v1, mistral-v3, mistral-v7
phi3, phi4
deepseek, deepseek2, deepseek3
gemma, falcon3, command-r, vicuna, zephyr, and more

Example:

from cyllama.api import apply_chat_template

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
]

prompt = apply_chat_template(messages, "models/llama.gguf")
print(prompt)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>
# Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

`get_chat_template()`¶

Get the chat template string from a model.

def get_chat_template(
    model_path: str,
    template_name: Optional[str] = None
) -> str

Parameters:

model_path (str): Path to GGUF model file
template_name (str, optional): Specific template name to retrieve

Returns:

str: Template string (Jinja-style), or empty string if not found

Example:

from cyllama.api import get_chat_template

template = get_chat_template("models/llama.gguf")
print(template)  # Shows the Jinja-style template

`Response` Class¶

Structured response object returned by generation functions.

@dataclass
class Response:
    text: str                           # Generated text content
    stats: Optional[GenerationStats]    # Generation statistics
    finish_reason: str = "stop"         # Why generation stopped
    model: str = ""                     # Model path used

Attributes:

text (str): The generated text content
stats (GenerationStats, optional): Statistics including timing and token counts
finish_reason (str): Reason for completion ("stop", "length", etc.)
model (str): Path to the model used

String Compatibility:

Response implements the string protocol for backward compatibility:

str(response) returns response.text
response == "string" compares with text
len(response) returns text length
for char in response: iterates over text characters
"substring" in response checks text containment
response + " more" concatenates text

Methods:

`to_dict()`¶

Convert response to dictionary.

def to_dict(self) -> Dict[str, Any]

`to_json()`¶

Convert response to JSON string.

def to_json(self, indent: Optional[int] = None) -> str

Example:

from cyllama import complete

response = complete("What is Python?", model_path="model.gguf")

# Use as string (backward compatible)
print(response)  # Prints text
if "programming" in response:
    print("Mentioned programming!")

# Access structured data
print(f"Finish reason: {response.finish_reason}")
if response.stats:
    print(f"Tokens/sec: {response.stats.tokens_per_second:.1f}")

# Serialize
data = response.to_dict()
json_str = response.to_json(indent=2)

`GenerationStats` Class¶

Statistics from a generation run.

@dataclass
class GenerationStats:
    prompt_tokens: int       # Number of tokens in prompt
    generated_tokens: int    # Number of tokens generated
    total_time: float        # Total generation time (seconds)
    tokens_per_second: float # Generation speed
    prompt_time: float       # Time for prompt processing
    generation_time: float   # Time for token generation

`LLM` Class¶

Reusable generator with model caching for improved performance.

class LLM:
    def __init__(
        self,
        model_path: str,
        config: Optional[GenerationConfig] = None,
        verbose: bool = False
    )

Parameters:

model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Default generation configuration
verbose (bool): Print detailed information during generation

Methods:

`call()`¶

Generate text from a prompt.

def __call__(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    on_token: Optional[Callable[[str], None]] = None
) -> Response | Iterator[str]

Parameters:

prompt (str): Input text
config (GenerationConfig, optional): Override instance config
stream (bool): Enable streaming
on_token (Callable, optional): Callback for each token

Returns:

Response: Response object with text and stats (if stream=False)
Iterator[str]: Iterator of text chunks (if stream=True)

`chat()`¶

Generate a response from chat messages using the model's chat template.

def chat(
    self,
    messages: List[Dict[str, str]],
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    template: Optional[str] = None
) -> str | Iterator[str]

Parameters:

messages (List[Dict]): List of message dicts with 'role' and 'content' keys
config (GenerationConfig, optional): Override instance config
stream (bool): Enable streaming
template (str, optional): Chat template name to use

`get_chat_template()`¶

Get the chat template string from the loaded model.

def get_chat_template(
    self,
    template_name: Optional[str] = None
) -> str

Example:

from cyllama import LLM, GenerationConfig

gen = LLM("models/llama.gguf")

# Simple generation
response = gen("What is Python?")

# With custom config
config = GenerationConfig(temperature=0.9, max_tokens=100)
response = gen("Tell me a joke", config=config)

# With statistics
response, stats = gen.generate_with_stats("Question?")
print(f"Generated {stats.generated_tokens} tokens in {stats.total_time:.2f}s")
print(f"Speed: {stats.tokens_per_second:.2f} tokens/sec")

# Chat with template
messages = [{"role": "user", "content": "Hello!"}]
response = gen.chat(messages)

# Get template
template = gen.get_chat_template()

`GenerationConfig` Dataclass¶

Configuration for text generation.

@dataclass
class GenerationConfig:
    max_tokens: int = 512
    temperature: float = 0.8
    top_k: int = 40
    top_p: float = 0.95
    min_p: float = 0.05
    repeat_penalty: float = 1.1
    n_gpu_layers: int = 99
    n_ctx: Optional[int] = None
    n_batch: int = 512
    seed: int = -1
    stop_sequences: List[str] = field(default_factory=list)
    add_bos: bool = True
    parse_special: bool = True

Attributes:

max_tokens: Maximum tokens to generate (default: 512)
temperature: Sampling temperature, 0.0 = greedy (default: 0.8)
top_k: Top-k sampling parameter (default: 40)
top_p: Top-p (nucleus) sampling (default: 0.95)
min_p: Minimum probability threshold (default: 0.05)
repeat_penalty: Penalty for repeating tokens (default: 1.1)
n_gpu_layers: GPU layers to offload (default: 99 = all)
n_ctx: Context window size, None = auto (default: None)
n_batch: Batch size for processing (default: 512)
seed: Random seed, -1 = random (default: -1)
stop_sequences: Strings that stop generation (default: [])
add_bos: Add beginning-of-sequence token (default: True)
parse_special: Parse special tokens in prompt (default: True)

`GenerationStats` Dataclass¶

Statistics from a generation run.

@dataclass
class GenerationStats:
    prompt_tokens: int
    generated_tokens: int
    total_time: float
    tokens_per_second: float
    prompt_time: float = 0.0
    generation_time: float = 0.0

Async API¶

The async API provides non-blocking generation for use in async applications (FastAPI, aiohttp, etc.).

`AsyncLLM` Class¶

Async wrapper around the LLM class for non-blocking text generation.

class AsyncLLM:
    def __init__(
        self,
        model_path: str,
        config: Optional[GenerationConfig] = None,
        verbose: bool = False,
        **kwargs
    )

Parameters:

model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Generation configuration
verbose (bool): Print detailed information during generation
**kwargs: Generation parameters (temperature, max_tokens, etc.)

Methods:

`call()` / `generate()`¶

Generate text asynchronously.

async def __call__(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None,
    **kwargs
) -> str

`stream()`¶

Stream generated text chunks asynchronously.

async def stream(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None,
    **kwargs
) -> AsyncIterator[str]

`generate_with_stats()`¶

Generate text and return statistics.

async def generate_with_stats(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None
) -> Tuple[str, GenerationStats]

Example:

import asyncio
from cyllama import AsyncLLM

async def main():
    # Context manager ensures cleanup
    async with AsyncLLM("model.gguf", temperature=0.7) as llm:
        # Simple generation
        response = await llm("What is Python?")
        print(response)

        # Streaming
        async for chunk in llm.stream("Tell me a story"):
            print(chunk, end="", flush=True)

        # With stats
        text, stats = await llm.generate_with_stats("Question?")
        print(f"Generated {stats.generated_tokens} tokens")

asyncio.run(main())

`complete_async()`¶

Async convenience function for one-off text completion.

async def complete_async(
    prompt: str,
    model_path: str,
    config: Optional[GenerationConfig] = None,
    verbose: bool = False,
    **kwargs
) -> str

Example:

response = await complete_async(
    "What is Python?",
    model_path="model.gguf",
    temperature=0.7
)

`chat_async()`¶

Async convenience function for chat-style generation.

async def chat_async(
    messages: List[Dict[str, str]],
    model_path: str,
    config: Optional[GenerationConfig] = None,
    verbose: bool = False,
    **kwargs
) -> str

Example:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = await chat_async(messages, model_path="model.gguf")

`stream_complete_async()`¶

Async streaming completion for one-off use.

async def stream_complete_async(
    prompt: str,
    model_path: str,
    config: Optional[GenerationConfig] = None,
    verbose: bool = False,
    **kwargs
) -> AsyncIterator[str]

Example:

async for chunk in stream_complete_async("Tell me a story", "model.gguf"):
    print(chunk, end="", flush=True)

Framework Integrations¶

OpenAI-Compatible API¶

Drop-in replacement for OpenAI Python client.

`OpenAICompatibleClient` Class¶

from cyllama.integrations.openai_compat import OpenAICompatibleClient

class OpenAICompatibleClient:
    def __init__(
        self,
        model_path: str,
        temperature: float = 0.7,
        max_tokens: int = 512,
        n_gpu_layers: int = 99
    )

Attributes:

chat: Chat completions interface

Example:

from cyllama.integrations.openai_compat import OpenAICompatibleClient

client = OpenAICompatibleClient(model_path="models/llama.gguf")

response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)

# Streaming
for chunk in client.chat.completions.create(
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

LangChain Integration¶

Full LangChain LLM interface implementation.

`CyllamaLLM` Class¶

from cyllama.integrations import CyllamaLLM

class CyllamaLLM(LLM):
    model_path: str
    temperature: float = 0.7
    max_tokens: int = 512
    top_k: int = 40
    top_p: float = 0.95
    repeat_penalty: float = 1.1
    n_gpu_layers: int = 99

Example:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = CyllamaLLM(model_path="models/llama.gguf", temperature=0.7)

prompt = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms:"
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="quantum computing")

# With streaming
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = CyllamaLLM(
    model_path="models/llama.gguf",
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

Memory Utilities¶

Tools for estimating and optimizing GPU memory usage.

`estimate_gpu_layers()`¶

Estimate optimal number of GPU layers for available VRAM.

def estimate_gpu_layers(
    model_path: str,
    available_vram_mb: int,
    n_ctx: int = 2048,
    n_batch: int = 512
) -> MemoryEstimate

Parameters:

model_path (str): Path to GGUF model file
available_vram_mb (int): Available VRAM in megabytes
n_ctx (int): Context window size
n_batch (int): Batch size

Returns:

MemoryEstimate: Object with recommended settings

Example:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="models/llama.gguf",
    available_vram_mb=8000,  # 8GB VRAM
    n_ctx=2048
)

print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
print(f"Estimated VRAM usage: {estimate.vram / 1024 / 1024:.2f} MB")

`estimate_memory_usage()`¶

Estimate total memory requirements for model loading.

def estimate_memory_usage(
    model_path: str,
    n_ctx: int = 2048,
    n_batch: int = 512,
    n_gpu_layers: int = 0
) -> MemoryEstimate

`MemoryEstimate` Dataclass¶

Memory estimation results.

@dataclass
class MemoryEstimate:
    layers: int                          # Total layers
    graph_size: int                      # Computation graph size
    vram: int                            # VRAM usage (bytes)
    vram_kv: int                         # KV cache VRAM (bytes)
    total_size: int                      # Total memory (bytes)
    tensor_split: Optional[List[int]]    # Multi-GPU split

Core llama.cpp API¶

Low-level Cython wrappers for direct llama.cpp access.

Core Classes¶

`LlamaModel`¶

Represents a loaded GGUF model.

from cyllama.llama.llama_cpp import LlamaModel, LlamaModelParams

params = LlamaModelParams()
params.n_gpu_layers = 99
params.use_mmap = True
params.use_mlock = False

model = LlamaModel("models/llama.gguf", params)

# Properties
print(model.n_params)      # Total parameters
print(model.n_layers)      # Number of layers
print(model.n_embd)        # Embedding dimension
print(model.n_vocab)       # Vocabulary size

# Methods
vocab = model.get_vocab()  # Get vocabulary
model.free()               # Free resources

`LlamaContext`¶

Inference context for model.

from cyllama.llama.llama_cpp import LlamaContext, LlamaContextParams

ctx_params = LlamaContextParams()
ctx_params.n_ctx = 2048
ctx_params.n_batch = 512
ctx_params.n_threads = 4
ctx_params.n_threads_batch = 4

ctx = LlamaContext(model, ctx_params)

# Decode batch
from cyllama.llama.llama_cpp import llama_batch_get_one
batch = llama_batch_get_one(tokens)
ctx.decode(batch)

# KV cache management
ctx.kv_cache_clear()
ctx.kv_cache_seq_rm(seq_id, p0, p1)
ctx.kv_cache_seq_add(seq_id, p0, p1, delta)

# Performance
ctx.print_perf_data()

`LlamaSampler`¶

Sampling strategies for token generation.

from cyllama.llama.llama_cpp import LlamaSampler, LlamaSamplerChainParams

sampler_params = LlamaSamplerChainParams()
sampler = LlamaSampler(sampler_params)

# Add sampling methods
sampler.add_top_k(40)
sampler.add_top_p(0.95, 1)
sampler.add_temp(0.7)
sampler.add_dist(seed)

# Sample token
token_id = sampler.sample(ctx, idx)

# Reset state
sampler.reset()

`LlamaVocab`¶

Vocabulary and tokenization.

vocab = model.get_vocab()

# Tokenization
tokens = vocab.tokenize("Hello world", add_special=True, parse_special=True)

# Detokenization
text = vocab.detokenize(tokens)
piece = vocab.token_to_piece(token_id, special=True)

# Special tokens
print(vocab.bos)           # Begin-of-sequence token
print(vocab.eos)           # End-of-sequence token
print(vocab.eot)           # End-of-turn token
print(vocab.n_vocab)       # Vocabulary size

# Check token types
is_eog = vocab.is_eog(token_id)
is_control = vocab.is_control(token_id)

`LlamaBatch`¶

Efficient batch processing.

from cyllama.llama.llama_cpp import LlamaBatch

# Create batch
batch = LlamaBatch(n_tokens=512, embd=0, n_seq_max=1)

# Add token
batch.add(token_id, pos, seq_ids=[0], logits=True)

# Clear batch
batch.clear()

# Convenience function
from cyllama.llama.llama_cpp import llama_batch_get_one
batch = llama_batch_get_one(tokens, pos_offset=0)

Backend Management¶

from cyllama.llama.llama_cpp import (
    ggml_backend_load_all,
    ggml_backend_offload_supported,
    ggml_backend_metal_set_n_cb
)

# Load all available backends (Metal, CUDA, etc.)
ggml_backend_load_all()

# Check GPU support
if ggml_backend_offload_supported():
    print("GPU offload supported")

# Configure Metal (macOS)
ggml_backend_metal_set_n_cb(2)  # Number of command buffers

Advanced Features¶

GGUF File Manipulation¶

Inspect and modify GGUF model files.

`GGUFContext` Class¶

from cyllama.llama.llama_cpp import GGUFContext

# Read existing file
ctx = GGUFContext.from_file("model.gguf")

# Get metadata
metadata = ctx.get_all_metadata()
print(metadata['general.architecture'])
print(metadata['general.name'])

value = ctx.get_val_str("general.architecture")

# Create new file
ctx = GGUFContext.empty()
ctx.set_val_str("custom.key", "value")
ctx.set_val_u32("custom.number", 42)
ctx.write_to_file("custom.gguf", write_tensors=False)

# Modify existing
ctx = GGUFContext.from_file("model.gguf")
ctx.set_val_str("custom.metadata", "updated")
ctx.write_to_file("modified.gguf")

JSON Schema to Grammar¶

Convert JSON schemas to llama.cpp grammar format for structured output. This is implemented in pure Python (vendored from llama.cpp) with no C++ dependency.

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string"}
    },
    "required": ["name", "age"]
}

grammar = json_schema_to_grammar(schema)

# Use with generation
from cyllama.llama.llama_cpp import LlamaSampler
sampler = LlamaSampler()
sampler.add_grammar(grammar)

Model Download¶

Download models from HuggingFace with Ollama-style tags.

from cyllama.llama.llama_cpp import download_model, list_cached_models

# Download from HuggingFace
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:q4",
    cache_dir="~/.cache/cyllama/models"
)

# List cached models
models = list_cached_models()
for model in models:
    print(f"{model['user']}/{model['model']}:{model['tag']}")
    print(f"  Path: {model['path']}")
    print(f"  Size: {model['size'] / 1024 / 1024:.2f} MB")

# Direct URL download
download_model(
    url="https://example.com/model.gguf",
    output_path="models/custom.gguf"
)

N-gram Cache¶

Pattern-based token prediction for 2-10x speedup on repetitive text.

from cyllama.llama.llama_cpp import NgramCache

# Create cache
cache = NgramCache()

# Learn patterns from token sequences
tokens = [1, 2, 3, 4, 5, 6, 7, 8]
cache.update(tokens, ngram_min=2, ngram_max=4)

# Predict likely continuations
input_tokens = [1, 2, 3]
draft_tokens = cache.draft(input_tokens, n_draft=16)

# Save/load cache
cache.save("patterns.bin")
loaded_cache = NgramCache.from_file("patterns.bin")

# Clear cache
cache.clear()

Speculative Decoding¶

Use draft model for 2-3x inference speedup.

from cyllama.llama.llama_cpp import (
    LlamaModel, LlamaContext, LlamaModelParams, LlamaContextParams,
    Speculative, SpeculativeParams
)

# Load target and draft models
model_target = LlamaModel("models/large.gguf", LlamaModelParams())
model_draft = LlamaModel("models/small.gguf", LlamaModelParams())

ctx_params = LlamaContextParams()
ctx_params.n_ctx = 2048

ctx_target = LlamaContext(model_target, ctx_params)

# Configure speculative parameters
params = SpeculativeParams(
    n_max=16,        # Maximum number of draft tokens
    n_reuse=8,       # Tokens to reuse
    p_min=0.75       # Minimum acceptance probability
)

# Create speculative decoding instance
spec = Speculative(params, ctx_target)

# Check compatibility
if spec.is_compat():
    print("Models are compatible for speculative decoding")

    # Begin a speculative decoding round
    spec.begin()

    # Generate draft tokens
    prompt_tokens = [1, 2, 3]
    last_token = prompt_tokens[-1]
    draft_tokens = spec.draft(prompt_tokens, last_token)

    # Accept verified tokens
    spec.accept()

    # Print performance statistics
    spec.print_stats()

Parameters:

n_max: Maximum number of tokens to draft (default: 16)
n_reuse: Number of tokens to reuse from previous draft (default: 8)
p_min: Minimum acceptance probability (default: 0.75)

Methods:

Method	Description
`is_compat()`	Check if target and draft models are compatible
`begin()`	Begin a speculative decoding round
`draft(...)`	Generate draft tokens from the draft model
`accept()`	Accept verified tokens after evaluation
`print_stats()`	Print speculative decoding performance statistics

Server Implementations¶

Three OpenAI-compatible server implementations.

Embedded Server¶

Pure Python server implementation.

from cyllama.llama.server.embedded import start_server

# Start server
start_server(
    model_path="models/llama.gguf",
    host="127.0.0.1",
    port=8000,
    n_ctx=2048,
    n_gpu_layers=99
)

# Use with OpenAI client
import openai
openai.api_base = "http://127.0.0.1:8000/v1"

response = openai.ChatCompletion.create(
    model="cyllama",
    messages=[{"role": "user", "content": "Hello!"}]
)

Mongoose Server¶

High-performance C server using Mongoose library.

from cyllama.llama.server.mongoose_server import EmbeddedServer

server = EmbeddedServer(
    model_path="models/llama.gguf",
    host="127.0.0.1",
    port=8080,
    n_ctx=2048,
    n_threads=4
)

server.start()

# Server runs in background
# Access at http://127.0.0.1:8080

server.stop()

LlamaServer¶

Python wrapper around the llama.cpp server binary.

from cyllama.llama.server import LlamaServer, LauncherServerConfig

config = LauncherServerConfig(
    model_path="models/llama.gguf",
    host="127.0.0.1",
    port=8080
)

server = LlamaServer(config, server_binary="bin/llama-server")
server.start()

# Check status
if server.is_running():
    print("Server is running")

server.stop()

Multimodal Support¶

LLAVA and other vision-language models.

from cyllama.llama.mtmd.multimodal import (
    LlavaImageEmbed,
    load_mmproj,
    process_image
)

# Load multimodal projector
mmproj = load_mmproj("models/mmproj.gguf")

# Process image
image_embed = process_image(
    ctx=ctx,
    image_path="image.jpg",
    mmproj=mmproj
)

# Use in generation
# Image embeddings are automatically integrated into context

Whisper Integration¶

Speech-to-text transcription using whisper.cpp. See Whisper.cpp Integration for complete documentation.

Quick Start¶

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model
ctx = WhisperContext("models/ggml-base.en.bin")

# Audio must be 16kHz mono float32
samples = load_audio_as_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
params.language = "en"
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    t0 = ctx.full_get_segment_t0(i) / 100.0  # centiseconds to seconds
    t1 = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{t0:.2f}s - {t1:.2f}s] {text}")

Key Classes¶

Class	Description
`WhisperContext`	Main context for model loading and inference
`WhisperContextParams`	Configuration for context creation
`WhisperFullParams`	Configuration for transcription
`WhisperVadParams`	Voice activity detection parameters

WhisperContext Methods¶

Method	Description
`full(samples, params)`	Run transcription on float32 audio samples
`full_n_segments()`	Get number of transcribed segments
`full_get_segment_text(i)`	Get text of segment i
`full_get_segment_t0(i)`	Get start time (centiseconds)
`full_get_segment_t1(i)`	Get end time (centiseconds)
`full_lang_id()`	Get detected language ID
`is_multilingual()`	Check if model supports multiple languages

Audio Requirements¶

Sample rate: 16000 Hz
Channels: Mono
Format: Float32 normalized to [-1.0, 1.0]

Stable Diffusion Integration¶

Image generation using stable-diffusion.cpp. Supports SD 1.x/2.x, SDXL, SD3, FLUX, video generation (Wan/CogVideoX), and ESRGAN upscaling.

Note: Build with WITH_STABLEDIFFUSION=1 to enable this module.

Quick Start¶

from cyllama.stablediffusion import text_to_image

# Simple text-to-image generation
images = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)

# Save the result
images[0].save("output.png")

`text_to_image()`¶

Convenience function for text-to-image generation.

def text_to_image(
    model_path: str,
    prompt: str,
    negative_prompt: str = "",
    width: int = 512,
    height: int = 512,
    seed: int = -1,
    batch_count: int = 1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    sample_method: Optional[SampleMethod] = None,
    scheduler: Optional[Scheduler] = None,
    clip_skip: int = -1,
    n_threads: int = -1
) -> List[SDImage]

Parameters:

model_path (str): Path to model file (.gguf, .safetensors, or .ckpt)
prompt (str): Text prompt for generation
negative_prompt (str): Negative prompt (what to avoid)
width (int): Output image width (default: 512)
height (int): Output image height (default: 512)
seed (int): Random seed (-1 for random)
batch_count (int): Number of images to generate
sample_steps (int): Sampling steps (use 1-4 for turbo models, 20+ for others)
cfg_scale (float): CFG scale (use 1.0 for turbo, 7.0 for others)
sample_method (SampleMethod): Sampling method (EULER, EULER_A, DPM2, etc.)
scheduler (Scheduler): Scheduler (DISCRETE, KARRAS, EXPONENTIAL, etc.)
clip_skip (int): CLIP skip layers (-1 for default)
n_threads (int): Number of threads (-1 for auto)

Returns:

List[SDImage]: List of generated images

`image_to_image()`¶

Image-to-image generation with an initial image.

def image_to_image(
    model_path: str,
    init_image: SDImage,
    prompt: str,
    negative_prompt: str = "",
    strength: float = 0.75,
    seed: int = -1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    sample_method: Optional[SampleMethod] = None,
    scheduler: Optional[Scheduler] = None,
    clip_skip: int = -1,
    n_threads: int = -1
) -> List[SDImage]

Parameters:

init_image (SDImage): Initial image to transform
strength (float): Transformation strength (0.0-1.0)
Other parameters same as text_to_image()

`SDContext`¶

Main context class for model reuse and advanced generation.

from cyllama.stablediffusion import SDContext, SDContextParams, SampleMethod, Scheduler

# Create context
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)

# Generate images
images = ctx.generate(
    prompt="a beautiful landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0,
    sample_method=SampleMethod.EULER,
    scheduler=Scheduler.DISCRETE
)

# Check if context is valid
print(ctx.is_valid)

Methods:

generate(...): Generate images from text prompt
generate_with_params(params: SDImageGenParams): Low-level generation
generate_video(...): Generate video frames (requires video-capable model)

`SDContextParams`¶

Configuration for model loading.

params = SDContextParams()
params.model_path = "model.gguf"         # Main model
params.vae_path = "vae.safetensors"      # Optional VAE
params.clip_l_path = "clip_l.safetensors" # Optional CLIP-L (for SDXL)
params.clip_g_path = "clip_g.safetensors" # Optional CLIP-G (for SDXL)
params.t5xxl_path = "t5xxl.safetensors"  # Optional T5-XXL (for SD3/FLUX)
params.lora_model_dir = "loras/"         # LoRA directory
params.n_threads = 4                      # Thread count
params.vae_decode_only = True            # VAE decode only mode
params.diffusion_flash_attn = False      # Flash attention
params.wtype = SDType.F16                # Weight type
params.rng_type = RngType.CPU            # RNG type

`SDImage`¶

Image wrapper with numpy and PIL integration.

from cyllama.stablediffusion import SDImage
import numpy as np

# Create from numpy array
arr = np.zeros((512, 512, 3), dtype=np.uint8)
img = SDImage.from_numpy(arr)

# Properties
print(img.width, img.height, img.channels)

# Convert to numpy
arr = img.to_numpy()  # Returns (H, W, C) uint8 array

# Convert to PIL (requires Pillow)
pil_img = img.to_pil()

# Save to file
img.save("output.png")

# Load from file
img = SDImage.load("input.png")

`SDImageGenParams`¶

Detailed generation parameters.

from cyllama.stablediffusion import SDImageGenParams, SDImage

params = SDImageGenParams()
params.prompt = "a cute cat"
params.negative_prompt = "ugly, blurry"
params.width = 512
params.height = 512
params.seed = 42
params.batch_count = 1
params.strength = 0.75           # For img2img
params.clip_skip = -1

# Set init image for img2img
init_img = SDImage.from_numpy(arr)
params.set_init_image(init_img)

# Set control image for ControlNet
params.set_control_image(control_img, strength=0.8)

# Access sample parameters
sample = params.sample_params
sample.sample_steps = 20
sample.cfg_scale = 7.0
sample.sample_method = SampleMethod.EULER
sample.scheduler = Scheduler.KARRAS

`SDSampleParams`¶

Sampling configuration.

from cyllama.stablediffusion import SDSampleParams, SampleMethod, Scheduler

params = SDSampleParams()
params.sample_method = SampleMethod.EULER_A
params.scheduler = Scheduler.KARRAS
params.sample_steps = 20
params.cfg_scale = 7.0
params.eta = 0.0                 # Noise multiplier

`Upscaler`¶

ESRGAN-based image upscaling.

from cyllama.stablediffusion import Upscaler, SDImage

# Load upscaler model
upscaler = Upscaler("models/esrgan-x4.bin", n_threads=4)

# Check upscale factor
print(f"Factor: {upscaler.upscale_factor}x")

# Upscale an image
img = SDImage.load("input.png")
upscaled = upscaler.upscale(img)

# Or specify custom factor
upscaled = upscaler.upscale(img, factor=2)

upscaled.save("upscaled.png")

`convert_model()`¶

Convert models between formats.

from cyllama.stablediffusion import convert_model, SDType

# Convert safetensors to GGUF with quantization
convert_model(
    input_path="sd-v1-5.safetensors",
    output_path="sd-v1-5-q4_0.gguf",
    output_type=SDType.Q4_0,
    vae_path="vae-ft-mse.safetensors"  # Optional
)

`canny_preprocess()`¶

Canny edge detection for ControlNet.

from cyllama.stablediffusion import SDImage, canny_preprocess

img = SDImage.load("photo.png")

# Apply Canny preprocessing (modifies image in place)
success = canny_preprocess(
    img,
    high_threshold=0.8,
    low_threshold=0.1,
    weak=0.5,
    strong=1.0,
    inverse=False
)

Callbacks¶

Set callbacks for logging, progress, and preview.

from cyllama.stablediffusion import (
    set_log_callback,
    set_progress_callback,
    set_preview_callback
)

# Log callback
def log_cb(level, text):
    level_names = {0: 'DEBUG', 1: 'INFO', 2: 'WARN', 3: 'ERROR'}
    print(f'[{level_names.get(level, level)}] {text}', end='')

set_log_callback(log_cb)

# Progress callback
def progress_cb(step, steps, time_ms):
    pct = (step / steps) * 100 if steps > 0 else 0
    print(f'Step {step}/{steps} ({pct:.1f}%) - {time_ms:.2f}s')

set_progress_callback(progress_cb)

# Preview callback (for real-time preview during generation)
def preview_cb(step, frames, is_noisy):
    for i, frame in enumerate(frames):
        frame.save(f"preview_{step}_{i}.png")

set_preview_callback(preview_cb)

# Clear callbacks
set_log_callback(None)
set_progress_callback(None)
set_preview_callback(None)

Enums¶

SampleMethod: Sampling methods

EULER, EULER_A, HEUN, DPM2, DPMPP2S_A, DPMPP2M, DPMPP2Mv2
IPNDM, IPNDM_V, LCM, DDIM_TRAILING, TCD

Scheduler: Schedulers

DISCRETE, KARRAS, EXPONENTIAL, AYS, GITS
SGM_UNIFORM, SIMPLE, SMOOTHSTEP, LCM

SDType: Data types for quantization

F32, F16, BF16
Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1
Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K

RngType: Random number generators

STD_DEFAULT, CUDA, CPU

LogLevel: Log levels

DEBUG, INFO, WARN, ERROR

Utility Functions¶

from cyllama.stablediffusion import (
    get_num_cores,
    get_system_info,
    type_name,
    sample_method_name,
    scheduler_name
)

# System info
print(f"CPU cores: {get_num_cores()}")
print(get_system_info())

# Get string names
print(type_name(SDType.Q4_0))           # "q4_0"
print(sample_method_name(SampleMethod.EULER))  # "euler"
print(scheduler_name(Scheduler.KARRAS))  # "karras"

CLI Tool¶

Command-line interface for stable diffusion operations.

# Generate image
python -m cyllama.stablediffusion generate \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png \
    --steps 4 --cfg 1.0

# Upscale image
python -m cyllama.stablediffusion upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_4x.png

# Convert model
python -m cyllama.stablediffusion convert \
    --input sd-v1-5.safetensors \
    --output sd-v1-5-q4_0.gguf \
    --type q4_0

# Show system info
python -m cyllama.stablediffusion info

Supported Models¶

SD 1.x/2.x: Standard Stable Diffusion models
SDXL/SDXL Turbo: Stable Diffusion XL (use cfg_scale=1.0, steps=1-4 for Turbo)
SD3/SD3.5: Stable Diffusion 3.x
FLUX: FLUX.1 models (dev, schnell)
Wan/CogVideoX: Video generation models (use generate_video())
LoRA: Low-rank adaptation files
ControlNet: Conditional generation with control images
ESRGAN: Image upscaling models

Error Handling¶

All cyllama functions raise appropriate Python exceptions:

from cyllama import complete, LLM

try:
    response = complete("Hello", model_path="nonexistent.gguf")
except FileNotFoundError:
    print("Model file not found")
except RuntimeError as e:
    print(f"Runtime error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

# LLM with error handling
try:
    gen = LLM("models/llama.gguf")
    response = gen("What is Python?")
except Exception as e:
    print(f"Generation failed: {e}")

Type Hints¶

All functions include comprehensive type hints for IDE support:

from typing import List, Dict, Optional, Iterator, Callable, Tuple
from cyllama import (
    complete,          # str | Iterator[str]
    chat,              # str | Iterator[str]
    LLM,               # class
    GenerationConfig,  # @dataclass
)

Performance Tips¶

1. Model Reuse¶

# BAD: Reloads model each time (slow)
for prompt in prompts:
    response = complete(prompt, model_path="model.gguf")

# GOOD: Reuses loaded model (fast)
gen = LLM("model.gguf")
for prompt in prompts:
    response = gen(prompt)

2. Batch Processing¶

from cyllama import batch_generate, GenerationConfig

# BAD: Sequential processing
responses = [generate(p, model_path="model.gguf") for p in prompts]

# GOOD: Parallel batch processing (3-10x faster)
prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(
    prompts,
    model_path="model.gguf",
    n_seq_max=8,  # Max parallel sequences
    config=GenerationConfig(max_tokens=50, temperature=0.7)
)

3. GPU Offloading¶

# Estimate optimal layers
from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers("model.gguf", available_vram_mb=8000)

# Use recommended settings
config = GenerationConfig(n_gpu_layers=estimate.n_gpu_layers)
gen = LLM("model.gguf", config=config)

4. Context Sizing¶

# Auto-size context (recommended)
config = GenerationConfig(n_ctx=None, max_tokens=200)

# Manual sizing (for control)
config = GenerationConfig(n_ctx=2048, max_tokens=200)

5. Streaming for Long Outputs¶

# Non-streaming: waits for complete response
response = complete("Write a long essay", model_path="model.gguf", max_tokens=2000)

# Streaming: see output as it generates
for chunk in complete("Write a long essay", model_path="model.gguf",
                     max_tokens=2000, stream=True):
    print(chunk, end="", flush=True)

Version Compatibility¶

Python: >=3.10 (tested on 3.13)
llama.cpp: b8429
Platform: macOS, Linux, Windows

Cyllama API Reference¶

Table of Contents¶

High-Level Generation API¶

complete()¶

chat()¶

apply_chat_template()¶

get_chat_template()¶

Response Class¶

to_dict()¶

to_json()¶

GenerationStats Class¶

LLM Class¶

__call__()¶

chat()¶

get_chat_template()¶

GenerationConfig Dataclass¶

GenerationStats Dataclass¶

Async API¶

AsyncLLM Class¶

__call__() / generate()¶

stream()¶

generate_with_stats()¶

complete_async()¶

chat_async()¶

stream_complete_async()¶

Framework Integrations¶

OpenAI-Compatible API¶

OpenAICompatibleClient Class¶

LangChain Integration¶

CyllamaLLM Class¶

Memory Utilities¶

estimate_gpu_layers()¶

estimate_memory_usage()¶

MemoryEstimate Dataclass¶

Core llama.cpp API¶

Core Classes¶

LlamaModel¶

LlamaContext¶

LlamaSampler¶

LlamaVocab¶

LlamaBatch¶

Backend Management¶

Advanced Features¶

GGUF File Manipulation¶

GGUFContext Class¶

JSON Schema to Grammar¶

Model Download¶

N-gram Cache¶

Speculative Decoding¶

Server Implementations¶

Embedded Server¶

Mongoose Server¶

LlamaServer¶

Multimodal Support¶

Whisper Integration¶

Quick Start¶

Key Classes¶

WhisperContext Methods¶

Audio Requirements¶

Stable Diffusion Integration¶

Quick Start¶

text_to_image()¶

image_to_image()¶

SDContext¶

SDContextParams¶

SDImage¶

SDImageGenParams¶

SDSampleParams¶

Upscaler¶

convert_model()¶

canny_preprocess()¶

Callbacks¶

Enums¶

Utility Functions¶

CLI Tool¶

Supported Models¶

Error Handling¶

Type Hints¶

Performance Tips¶

1. Model Reuse¶

`complete()`¶

`chat()`¶

`apply_chat_template()`¶

`get_chat_template()`¶

`Response` Class¶

`to_dict()`¶

`to_json()`¶

`GenerationStats` Class¶

`LLM` Class¶

`call()`¶

`chat()`¶

`get_chat_template()`¶

`GenerationConfig` Dataclass¶

`GenerationStats` Dataclass¶

`AsyncLLM` Class¶

`call()` / `generate()`¶

`stream()`¶

`generate_with_stats()`¶

`complete_async()`¶

`chat_async()`¶

`stream_complete_async()`¶

`OpenAICompatibleClient` Class¶

`CyllamaLLM` Class¶

`estimate_gpu_layers()`¶

`estimate_memory_usage()`¶

`MemoryEstimate` Dataclass¶

`LlamaModel`¶

`LlamaContext`¶

`LlamaSampler`¶

`LlamaVocab`¶

`LlamaBatch`¶

`GGUFContext` Class¶

`text_to_image()`¶

`image_to_image()`¶

`SDContext`¶

`SDContextParams`¶

`SDImage`¶

`SDImageGenParams`¶

`SDSampleParams`¶

`Upscaler`¶

`convert_model()`¶

`canny_preprocess()`¶