Skip to content

User Guide

Complete guide to using cyllama for LLM inference.

Table of Contents

  1. Getting Started
  2. High-Level API
  3. Streaming Generation
  4. Framework Integrations
  5. Advanced Features
  6. Performance Optimization
  7. Troubleshooting

Getting Started

Installation

git clone https://github.com/shakfu/cyllama.git
cd cyllama
make  # Downloads llama.cpp, builds everything
make download  # Download default test model

Quick Start

The simplest way to generate text:

from cyllama import complete

response = complete(
    "What is Python?",
    model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)
print(response)

High-Level API

Basic Generation

The complete() function provides the simplest interface:

from cyllama import complete, GenerationConfig

# Simple generation
response = complete(
    "Explain quantum computing",
    model_path="models/llama.gguf",
    max_tokens=200,
    temperature=0.7
)

# With configuration object
config = GenerationConfig(
    max_tokens=500,
    temperature=0.8,
    top_p=0.95,
    top_k=40,
    repeat_penalty=1.1
)

response = complete(
    "Write a poem about AI",
    model_path="models/llama.gguf",
    config=config
)

Chat Interface

For multi-turn conversations with automatic chat template formatting:

from cyllama import chat

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Can you give an example?"}
]

response = chat(
    messages,
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=300
)

The chat() function automatically applies the model's built-in chat template (stored in GGUF metadata). This ensures proper formatting for models like Llama 3, Mistral, ChatML-based models, and others.

Chat Templates

cyllama uses llama.cpp's built-in chat template system. Templates are read from model metadata and applied automatically.

from cyllama import LLM
from cyllama.api import apply_chat_template, get_chat_template

# Get the template string from a model
template = get_chat_template("models/llama.gguf")
print(template)  # Shows Jinja-style template

# Apply template to format messages
messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
]
prompt = apply_chat_template(messages, "models/llama.gguf")
print(prompt)
# Output (Llama 3 format):
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
# You are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>
#
# Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

You can also use specific builtin templates:

# Apply a specific template (llama3, chatml, mistral, etc.)
prompt = apply_chat_template(messages, "models/any.gguf", template="chatml")

With the LLM class:

with LLM("models/llama.gguf") as llm:
    # Get the model's chat template
    template = llm.get_chat_template()

    # Chat directly
    response = llm.chat([
        {"role": "user", "content": "What is 2+2?"}
    ])
    print(response)

Supported templates include: llama2, llama3, llama4, chatml, mistral-v1/v3/v7, phi3, phi4, deepseek, deepseek2, deepseek3, gemma, falcon3, command-r, vicuna, zephyr, and many more. See the llama.cpp wiki for the full list.

LLM Class

For repeated generations, use the LLM class for better performance:

from cyllama import LLM, GenerationConfig

# Create generator (loads model once)
gen = LLM("models/llama.gguf")

# Generate multiple times
prompts = [
    "What is AI?",
    "What is ML?",
    "What is DL?"
]

for prompt in prompts:
    response = gen(prompt)
    print(f"Q: {prompt}")
    print(f"A: {response}\n")

Response Objects

All generation functions return Response objects that provide structured access to results:

from cyllama import complete, Response

# Response works like a string for backward compatibility
response = complete("What is Python?", model_path="models/llama.gguf")
print(response)  # Just works!

# But also provides structured data
print(f"Text: {response.text}")
print(f"Finish reason: {response.finish_reason}")

# Access generation statistics
if response.stats:
    print(f"Generated {response.stats.generated_tokens} tokens")
    print(f"Speed: {response.stats.tokens_per_second:.1f} tokens/sec")
    print(f"Time: {response.stats.total_time:.2f}s")

# Serialize for logging/storage
import json
data = response.to_dict()
json_str = response.to_json(indent=2)

The Response class implements string-like behavior, so existing code continues to work:

# String operations all work
if "programming" in response:
    print("Mentioned programming!")

full_text = response + " Additional text."
length = len(response)

Streaming Generation

Stream responses token-by-token:

from cyllama import LLM

gen = LLM("models/llama.gguf")

# Stream to console
for chunk in gen("Tell me a story", stream=True):
    print(chunk, end="", flush=True)
print()

# Collect chunks
chunks = []
for chunk in gen("Count to 10", stream=True):
    chunks.append(chunk)
full_response = "".join(chunks)

Token Callbacks

Process each token as it's generated:

from cyllama import LLM

gen = LLM("models/llama.gguf")

tokens_seen = []

def on_token(token: str):
    tokens_seen.append(token)
    print(f"Token: {repr(token)}")

response = gen(
    "Hello world",
    on_token=on_token
)

print(f"\nTotal tokens: {len(tokens_seen)}")

Framework Integrations

OpenAI-Compatible API

Drop-in replacement for OpenAI client:

from cyllama.integrations.openai_compat import OpenAICompatibleClient

client = OpenAICompatibleClient(
    model_path="models/llama.gguf",
    temperature=0.7
)

# Chat completions
response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

# Streaming
for chunk in client.chat.completions.create(
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

LangChain Integration

Use with LangChain chains and agents:

from cyllama.integrations import CyllamaLLM

# Note: Requires langchain to be installed
# pip install langchain

llm = CyllamaLLM(
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=500
)

# Use in chains
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

prompt = PromptTemplate.from_template(
    "Tell me about {topic} in {style} style"
)

chain = LLMChain(llm=llm, prompt=prompt)

result = chain.run(topic="AI", style="simple")
print(result)

# Streaming with callbacks
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm_streaming = CyllamaLLM(
    model_path="models/llama.gguf",
    temperature=0.7,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

Advanced Features

Configuration Options

Complete GenerationConfig options:

from cyllama import GenerationConfig

config = GenerationConfig(
    # Generation limits
    max_tokens=512,           # Maximum tokens to generate

    # Sampling parameters
    temperature=0.8,          # 0.0 = greedy, higher = more random
    top_k=40,                 # Top-k sampling
    top_p=0.95,               # Nucleus sampling
    min_p=0.05,               # Minimum probability threshold
    repeat_penalty=1.1,       # Penalize repetition

    # Model parameters
    n_gpu_layers=99,          # Layers to offload to GPU (-1 = all)
    n_ctx=2048,               # Context window size
    n_batch=512,              # Batch size for processing

    # Control
    seed=42,                  # Random seed (-1 = random)
    stop_sequences=["END"],   # Stop generation at these strings

    # Tokenization
    add_bos=True,             # Add beginning-of-sequence token
    parse_special=True        # Parse special tokens
)

Speculative Decoding

2-3x speedup with compatible models:

from cyllama import (
    LlamaModel, LlamaContext, LlamaModelParams, LlamaContextParams,
    Speculative, SpeculativeParams
)

# Load target (main) model
model_target = LlamaModel("models/llama-3b.gguf", LlamaModelParams())
ctx_target = LlamaContext(model_target, LlamaContextParams())

# Load draft (smaller, faster) model
model_draft = LlamaModel("models/llama-1b.gguf", LlamaModelParams())
ctx_draft = LlamaContext(model_draft, LlamaContextParams())

# Setup speculative decoding
params = SpeculativeParams(
    n_max=16,      # Maximum tokens to draft
    p_min=0.75     # Acceptance probability
)
spec = Speculative(params, ctx_target)

# Generate draft tokens
draft_tokens = spec.draft(
    prompt_tokens=[1, 2, 3, 4],
    last_token=5
)

Memory Estimation

Estimate GPU memory requirements:

from cyllama import estimate_gpu_layers, estimate_memory_usage

# Estimate optimal GPU layers
estimate = estimate_gpu_layers(
    model_path="models/llama.gguf",
    available_vram_mb=8000,
    n_ctx=2048
)

print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
print(f"Est. GPU memory: {estimate.gpu_memory_mb:.0f} MB")
print(f"Est. CPU memory: {estimate.cpu_memory_mb:.0f} MB")

# Detailed memory analysis
memory_info = estimate_memory_usage(
    model_path="models/llama.gguf",
    n_ctx=2048,
    n_batch=512
)

print(f"Model size: {memory_info.model_size_mb:.0f} MB")
print(f"KV cache: {memory_info.kv_cache_mb:.0f} MB")
print(f"Total: {memory_info.total_mb:.0f} MB")

How LLM Generation Works

Understanding how generation works helps you optimize performance.

Autoregressive generation means generating tokens one at a time, where each new token depends on all previous tokens:

  1. Feed prompt to model, get probability distribution for next token
  2. Sample/select next token
  3. Feed that token back into the model
  4. Repeat until done

Prefill vs Decode

  • Prefill: Process all prompt tokens in parallel (batch operation, very fast)
  • Decode: Generate output tokens one-by-one (autoregressive, slower)
Phase Speed Notes
Prefill ~65k tok/s Parallel batch processing
Decode ~40 tok/s Sequential, autoregressive

Performance Implications

  • Time to First Token (TTFT): Dominated by prefill time. Longer prompts = longer TTFT.
  • Generation Speed: Limited by decode speed, regardless of hardware.
  • Optimization Strategies: KV caching, speculative decoding, and batching help mitigate the bottleneck.

Performance Optimization

GPU Acceleration

from cyllama import LLM, GenerationConfig

# Offload all layers to GPU
config = GenerationConfig(n_gpu_layers=-1)  # or 99
gen = LLM("models/llama.gguf", config=config)

# Partial GPU offloading (for large models)
config = GenerationConfig(n_gpu_layers=20)  # First 20 layers only

Batch Size Tuning

# Larger batch = more throughput, more memory
config = GenerationConfig(n_batch=1024)

# Smaller batch = less memory, potentially slower
config = GenerationConfig(n_batch=128)

Context Window Management

# Auto-size context (prompt + max_tokens)
config = GenerationConfig(n_ctx=None, max_tokens=512)

# Fixed context size
config = GenerationConfig(n_ctx=4096, max_tokens=512)

Troubleshooting

Out of Memory

# Reduce GPU layers
config = GenerationConfig(n_gpu_layers=10)

# Reduce context size
config = GenerationConfig(n_ctx=1024)

# Reduce batch size
config = GenerationConfig(n_batch=128)

Slow Generation

# Maximize GPU usage
config = GenerationConfig(n_gpu_layers=-1)

# Increase batch size
config = GenerationConfig(n_batch=512)

# Use speculative decoding (if you have a draft model)

Quality Issues

# More deterministic (lower temperature)
config = GenerationConfig(temperature=0.1)

# More diverse (higher temperature)
config = GenerationConfig(temperature=1.2)

# Adjust top-p for nucleus sampling
config = GenerationConfig(top_p=0.9)

# Reduce repetition
config = GenerationConfig(repeat_penalty=1.2)

Import Errors

# Rebuild after updates
make build

# Clean rebuild
make remake

# Check installation
python -c "import cyllama; print(cyllama.__file__)"

Best Practices

  1. Reuse LLM Instances: Create once, generate many times - avoid reloading the model
  2. Monitor Memory: Use memory estimation tools before loading large models
  3. Tune Temperature: Start at 0.7, adjust based on needs (lower for factual, higher for creative)
  4. Use Stop Sequences: Prevent over-generation with appropriate stop tokens
  5. Stream Long Outputs: Better UX for users waiting for responses
  6. Profile Performance: Measure before optimizing

Examples

See the tests/examples/ directory for complete working examples:

  • generate_example.py - Basic generation
  • speculative_example.py - Speculative decoding
  • integration_example.py - Framework integrations

Next Steps

Support