User Guide¶
Complete guide to using cyllama for LLM inference.
Table of Contents¶
- Getting Started
- High-Level API
- Streaming Generation
- Framework Integrations
- Advanced Features
- Performance Optimization
- Troubleshooting
Getting Started¶
Installation¶
git clone https://github.com/shakfu/cyllama.git
cd cyllama
make # Downloads llama.cpp, builds everything
make download # Download default test model
Quick Start¶
The simplest way to generate text:
from cyllama import complete
response = complete(
"What is Python?",
model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)
print(response)
High-Level API¶
Basic Generation¶
The complete() function provides the simplest interface:
from cyllama import complete, GenerationConfig
# Simple generation
response = complete(
"Explain quantum computing",
model_path="models/llama.gguf",
max_tokens=200,
temperature=0.7
)
# With configuration object
config = GenerationConfig(
max_tokens=500,
temperature=0.8,
top_p=0.95,
top_k=40,
repeat_penalty=1.1
)
response = complete(
"Write a poem about AI",
model_path="models/llama.gguf",
config=config
)
Chat Interface¶
For multi-turn conversations with automatic chat template formatting:
from cyllama import chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."},
{"role": "user", "content": "Can you give an example?"}
]
response = chat(
messages,
model_path="models/llama.gguf",
temperature=0.7,
max_tokens=300
)
The chat() function automatically applies the model's built-in chat template (stored in GGUF metadata). This ensures proper formatting for models like Llama 3, Mistral, ChatML-based models, and others.
Chat Templates¶
cyllama uses llama.cpp's built-in chat template system. Templates are read from model metadata and applied automatically.
from cyllama import LLM
from cyllama.api import apply_chat_template, get_chat_template
# Get the template string from a model
template = get_chat_template("models/llama.gguf")
print(template) # Shows Jinja-style template
# Apply template to format messages
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
]
prompt = apply_chat_template(messages, "models/llama.gguf")
print(prompt)
# Output (Llama 3 format):
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
# You are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>
#
# Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
You can also use specific builtin templates:
# Apply a specific template (llama3, chatml, mistral, etc.)
prompt = apply_chat_template(messages, "models/any.gguf", template="chatml")
With the LLM class:
with LLM("models/llama.gguf") as llm:
# Get the model's chat template
template = llm.get_chat_template()
# Chat directly
response = llm.chat([
{"role": "user", "content": "What is 2+2?"}
])
print(response)
Supported templates include: llama2, llama3, llama4, chatml, mistral-v1/v3/v7, phi3, phi4, deepseek, deepseek2, deepseek3, gemma, falcon3, command-r, vicuna, zephyr, and many more. See the llama.cpp wiki for the full list.
LLM Class¶
For repeated generations, use the LLM class for better performance:
from cyllama import LLM, GenerationConfig
# Create generator (loads model once)
gen = LLM("models/llama.gguf")
# Generate multiple times
prompts = [
"What is AI?",
"What is ML?",
"What is DL?"
]
for prompt in prompts:
response = gen(prompt)
print(f"Q: {prompt}")
print(f"A: {response}\n")
Response Objects¶
All generation functions return Response objects that provide structured access to results:
from cyllama import complete, Response
# Response works like a string for backward compatibility
response = complete("What is Python?", model_path="models/llama.gguf")
print(response) # Just works!
# But also provides structured data
print(f"Text: {response.text}")
print(f"Finish reason: {response.finish_reason}")
# Access generation statistics
if response.stats:
print(f"Generated {response.stats.generated_tokens} tokens")
print(f"Speed: {response.stats.tokens_per_second:.1f} tokens/sec")
print(f"Time: {response.stats.total_time:.2f}s")
# Serialize for logging/storage
import json
data = response.to_dict()
json_str = response.to_json(indent=2)
The Response class implements string-like behavior, so existing code continues to work:
# String operations all work
if "programming" in response:
print("Mentioned programming!")
full_text = response + " Additional text."
length = len(response)
Streaming Generation¶
Stream responses token-by-token:
from cyllama import LLM
gen = LLM("models/llama.gguf")
# Stream to console
for chunk in gen("Tell me a story", stream=True):
print(chunk, end="", flush=True)
print()
# Collect chunks
chunks = []
for chunk in gen("Count to 10", stream=True):
chunks.append(chunk)
full_response = "".join(chunks)
Token Callbacks¶
Process each token as it's generated:
from cyllama import LLM
gen = LLM("models/llama.gguf")
tokens_seen = []
def on_token(token: str):
tokens_seen.append(token)
print(f"Token: {repr(token)}")
response = gen(
"Hello world",
on_token=on_token
)
print(f"\nTotal tokens: {len(tokens_seen)}")
Framework Integrations¶
OpenAI-Compatible API¶
Drop-in replacement for OpenAI client:
from cyllama.integrations.openai_compat import OpenAICompatibleClient
client = OpenAICompatibleClient(
model_path="models/llama.gguf",
temperature=0.7
)
# Chat completions
response = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
],
max_tokens=200
)
print(response.choices[0].message.content)
# Streaming
for chunk in client.chat.completions.create(
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
LangChain Integration¶
Use with LangChain chains and agents:
from cyllama.integrations import CyllamaLLM
# Note: Requires langchain to be installed
# pip install langchain
llm = CyllamaLLM(
model_path="models/llama.gguf",
temperature=0.7,
max_tokens=500
)
# Use in chains
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
prompt = PromptTemplate.from_template(
"Tell me about {topic} in {style} style"
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="AI", style="simple")
print(result)
# Streaming with callbacks
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm_streaming = CyllamaLLM(
model_path="models/llama.gguf",
temperature=0.7,
streaming=True,
callbacks=[StreamingStdOutCallbackHandler()]
)
Advanced Features¶
Configuration Options¶
Complete GenerationConfig options:
from cyllama import GenerationConfig
config = GenerationConfig(
# Generation limits
max_tokens=512, # Maximum tokens to generate
# Sampling parameters
temperature=0.8, # 0.0 = greedy, higher = more random
top_k=40, # Top-k sampling
top_p=0.95, # Nucleus sampling
min_p=0.05, # Minimum probability threshold
repeat_penalty=1.1, # Penalize repetition
# Model parameters
n_gpu_layers=99, # Layers to offload to GPU (-1 = all)
n_ctx=2048, # Context window size
n_batch=512, # Batch size for processing
# Control
seed=42, # Random seed (-1 = random)
stop_sequences=["END"], # Stop generation at these strings
# Tokenization
add_bos=True, # Add beginning-of-sequence token
parse_special=True # Parse special tokens
)
Speculative Decoding¶
2-3x speedup with compatible models:
from cyllama import (
LlamaModel, LlamaContext, LlamaModelParams, LlamaContextParams,
Speculative, SpeculativeParams
)
# Load target (main) model
model_target = LlamaModel("models/llama-3b.gguf", LlamaModelParams())
ctx_target = LlamaContext(model_target, LlamaContextParams())
# Load draft (smaller, faster) model
model_draft = LlamaModel("models/llama-1b.gguf", LlamaModelParams())
ctx_draft = LlamaContext(model_draft, LlamaContextParams())
# Setup speculative decoding
params = SpeculativeParams(
n_max=16, # Maximum tokens to draft
p_min=0.75 # Acceptance probability
)
spec = Speculative(params, ctx_target)
# Generate draft tokens
draft_tokens = spec.draft(
prompt_tokens=[1, 2, 3, 4],
last_token=5
)
Memory Estimation¶
Estimate GPU memory requirements:
from cyllama import estimate_gpu_layers, estimate_memory_usage
# Estimate optimal GPU layers
estimate = estimate_gpu_layers(
model_path="models/llama.gguf",
available_vram_mb=8000,
n_ctx=2048
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
print(f"Est. GPU memory: {estimate.gpu_memory_mb:.0f} MB")
print(f"Est. CPU memory: {estimate.cpu_memory_mb:.0f} MB")
# Detailed memory analysis
memory_info = estimate_memory_usage(
model_path="models/llama.gguf",
n_ctx=2048,
n_batch=512
)
print(f"Model size: {memory_info.model_size_mb:.0f} MB")
print(f"KV cache: {memory_info.kv_cache_mb:.0f} MB")
print(f"Total: {memory_info.total_mb:.0f} MB")
How LLM Generation Works¶
Understanding how generation works helps you optimize performance.
Autoregressive generation means generating tokens one at a time, where each new token depends on all previous tokens:
- Feed prompt to model, get probability distribution for next token
- Sample/select next token
- Feed that token back into the model
- Repeat until done
Prefill vs Decode¶
- Prefill: Process all prompt tokens in parallel (batch operation, very fast)
- Decode: Generate output tokens one-by-one (autoregressive, slower)
| Phase | Speed | Notes |
|---|---|---|
| Prefill | ~65k tok/s | Parallel batch processing |
| Decode | ~40 tok/s | Sequential, autoregressive |
Performance Implications¶
- Time to First Token (TTFT): Dominated by prefill time. Longer prompts = longer TTFT.
- Generation Speed: Limited by decode speed, regardless of hardware.
- Optimization Strategies: KV caching, speculative decoding, and batching help mitigate the bottleneck.
Performance Optimization¶
GPU Acceleration¶
from cyllama import LLM, GenerationConfig
# Offload all layers to GPU
config = GenerationConfig(n_gpu_layers=-1) # or 99
gen = LLM("models/llama.gguf", config=config)
# Partial GPU offloading (for large models)
config = GenerationConfig(n_gpu_layers=20) # First 20 layers only
Batch Size Tuning¶
# Larger batch = more throughput, more memory
config = GenerationConfig(n_batch=1024)
# Smaller batch = less memory, potentially slower
config = GenerationConfig(n_batch=128)
Context Window Management¶
# Auto-size context (prompt + max_tokens)
config = GenerationConfig(n_ctx=None, max_tokens=512)
# Fixed context size
config = GenerationConfig(n_ctx=4096, max_tokens=512)
Troubleshooting¶
Out of Memory¶
# Reduce GPU layers
config = GenerationConfig(n_gpu_layers=10)
# Reduce context size
config = GenerationConfig(n_ctx=1024)
# Reduce batch size
config = GenerationConfig(n_batch=128)
Slow Generation¶
# Maximize GPU usage
config = GenerationConfig(n_gpu_layers=-1)
# Increase batch size
config = GenerationConfig(n_batch=512)
# Use speculative decoding (if you have a draft model)
Quality Issues¶
# More deterministic (lower temperature)
config = GenerationConfig(temperature=0.1)
# More diverse (higher temperature)
config = GenerationConfig(temperature=1.2)
# Adjust top-p for nucleus sampling
config = GenerationConfig(top_p=0.9)
# Reduce repetition
config = GenerationConfig(repeat_penalty=1.2)
Import Errors¶
# Rebuild after updates
make build
# Clean rebuild
make remake
# Check installation
python -c "import cyllama; print(cyllama.__file__)"
Best Practices¶
- Reuse LLM Instances: Create once, generate many times - avoid reloading the model
- Monitor Memory: Use memory estimation tools before loading large models
- Tune Temperature: Start at 0.7, adjust based on needs (lower for factual, higher for creative)
- Use Stop Sequences: Prevent over-generation with appropriate stop tokens
- Stream Long Outputs: Better UX for users waiting for responses
- Profile Performance: Measure before optimizing
Examples¶
See the tests/examples/ directory for complete working examples:
generate_example.py- Basic generationspeculative_example.py- Speculative decodingintegration_example.py- Framework integrations
Next Steps¶
- Read the Cookbook for common patterns
- Check API Reference for detailed documentation
- See Examples for complete code
Support¶
- GitHub Issues: https://github.com/shakfu/cyllama/issues
- Documentation: https://github.com/shakfu/cyllama