Troubleshooting¶

Common issues and solutions when using cyllama.

Installation Issues¶

"No module named 'cyllama'"¶

Cause: Cyllama is not installed or not in the Python path.

Solution:

# Make sure you're in the project directory
cd cyllama

# Build and install
make

# Or manually install in editable mode
uv pip install -e .

Build Fails with CMake Errors¶

Cause: Missing dependencies or incompatible CMake version.

Solution:

# Check CMake version (need 3.21+)
cmake --version

# Clean and rebuild
make reset
make build

# On macOS, ensure Xcode tools are installed
xcode-select --install

"fatal error: 'llama.h' file not found"¶

Cause: llama.cpp headers not built or not found.

Solution:

# Rebuild dependencies
make reset
make

# Verify thirdparty structure
ls thirdparty/llama.cpp/include/

Model Loading Issues¶

"Failed to load model"¶

Cause: Model file doesn't exist, is corrupted, or incompatible format.

Solutions:

Verify file exists:

ls -la models/your-model.gguf

Check file integrity:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("models/your-model.gguf")
metadata = ctx.get_all_metadata()
print(metadata)  # Should show model info

Use correct GGUF format: Cyllama requires GGUF format (not GGML). Convert older models:

# Use llama.cpp's conversion tool
python llama.cpp/convert.py old-model.bin --outfile new-model.gguf

"Out of memory" / VRAM Exhaustion¶

Cause: Model too large for available memory/VRAM.

Solutions:

Reduce GPU layers:

from cyllama import LLM, GenerationConfig

config = GenerationConfig(n_gpu_layers=20)  # Reduce from 99
llm = LLM("model.gguf", config=config)

Estimate optimal layers:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers("model.gguf", available_vram_mb=8000)
print(f"Recommended: {estimate.n_gpu_layers} GPU layers")

Use smaller quantization: Download a more quantized model (Q4_0 < Q5_K < Q8_0 < F16).
Reduce context size:

config = GenerationConfig(n_ctx=2048)  # Smaller context = less memory

Model Loads but Generation is Slow¶

Cause: Model not using GPU acceleration.

Solutions:

Check GPU backend is loaded:

from cyllama.llama.llama_cpp import ggml_backend_load_all
ggml_backend_load_all()

Verify GPU layers are being used:

from cyllama import LLM

llm = LLM("model.gguf", n_gpu_layers=99, verbose=True)
# Verbose output should show GPU offload info

On macOS, check Metal:

# Ensure Metal is available
system_profiler SPDisplaysDataType | grep Metal

Generation Issues¶

Empty or Truncated Output¶

Cause: max_tokens too low, stop sequences triggered, or EOS token reached.

Solutions:

from cyllama import complete

# Increase max_tokens
response = complete(
    "Write a long essay",
    model_path="model.gguf",
    max_tokens=2000  # Increase this
)

# Check stop sequences aren't interfering
response = complete(
    "Write code",
    model_path="model.gguf",
    stop_sequences=[]  # Clear stop sequences
)

Repetitive Output¶

Cause: Repetition penalty too low or temperature issues.

Solutions:

from cyllama import GenerationConfig, LLM

config = GenerationConfig(
    repeat_penalty=1.2,  # Increase (default 1.1)
    temperature=0.8,     # Add some randomness
    top_k=40,
    top_p=0.95
)

llm = LLM("model.gguf", config=config)

Nonsensical Output¶

Cause: Temperature too high, wrong model, or corrupted model file.

Solutions:

Lower temperature:

response = complete("...", model_path="model.gguf", temperature=0.3)

Use greedy decoding for deterministic output:

response = complete("...", model_path="model.gguf", temperature=0.0)

Verify model integrity:

from cyllama.llama.llama_cpp import GGUFContext
ctx = GGUFContext.from_file("model.gguf")
print(ctx.get_val_str("general.architecture"))

Chat Format Issues¶

Cause: Model expects specific chat format that isn't being applied.

Solution: Use the chat() function which applies proper formatting:

from cyllama import chat

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello"}
]

# chat() applies the model's expected format
response = chat(messages, model_path="model.gguf")

GPU Issues¶

Metal Not Working (macOS)¶

Symptoms: Generation runs on CPU despite having Apple Silicon.

Solutions:

Verify Metal support:

system_profiler SPDisplaysDataType | grep -i metal

Reinstall Xcode tools:

xcode-select --install

Check build used Metal:

make show-backends
# Should show GGML_METAL=1

Rebuild with Metal:

make reset
make build-metal

CUDA Not Found (Linux)¶

Symptoms: Build fails or GPU not used on NVIDIA systems.

Solutions:

Set CUDA paths:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Rebuild with CUDA:

make reset
make build-cuda

Verify CUDA installation:

nvcc --version
nvidia-smi

CUDA DLLs Not Found (Windows)¶

Symptoms: ImportError or DLL load failed when importing cyllama on Windows with a CUDA build.

Cause: CUDA toolkit DLLs (e.g. cublas64_13.dll) are not on the DLL search path.

Cyllama automatically discovers CUDA DLL paths when built with GGML_CUDA=1, but the discovery may fail if:

CUDA toolkit is installed in a non-standard location
Neither CUDA_PATH nor CUDA_HOME is set
nvcc is not on PATH

Solutions:

Set the CUDA_PATH environment variable:

$env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"

Add CUDA bin to PATH:

$env:PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin;$env:PATH"

Verify the build detected CUDA:

from cyllama import _backend
print(_backend.cuda)  # Should be True

If False, the package was not built with CUDA support. Rebuild with GGML_CUDA=1.

Vulkan Issues¶

Symptoms: Vulkan backend not loading.

Solutions:

Install Vulkan SDK:
Linux: sudo apt install vulkan-tools libvulkan-dev
macOS: Install from LunarG
Verify Vulkan:

vulkaninfo | head -20

Rebuild:

make build-vulkan

Agent Issues¶

Agent Loops Forever¶

Cause: Agent stuck in reasoning loop.

Solutions:

from cyllama.agents import ReActAgent

agent = ReActAgent(
    llm=llm,
    tools=tools,
    max_iterations=5,              # Hard limit
    detect_loops=True,             # Enable loop detection
    max_consecutive_same_action=2, # Stop after 2 identical actions
    max_consecutive_same_tool=3,   # Stop after 3 calls to same tool
)

Agent Can't Parse Tool Calls¶

Cause: Model not following expected format.

Solutions:

Use ConstrainedAgent for guaranteed parsing:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=tools)
# Grammar constraints ensure valid JSON output

Use a better model: Larger models (7B+) are more reliable at tool calling.
Simplify tool definitions:

@tool
def simple_tool(query: str) -> str:  # Simple, clear signature
    """Search for information."""     # Clear docstring
    return f"Results: {query}"

Tool Execution Errors¶

Cause: Tool function throws exception.

Solution: Add error handling in tools:

@tool
def safe_calculate(expression: str) -> str:
    """Safely evaluate a math expression."""
    try:
        result = eval(expression)
        return str(result)
    except Exception as e:
        return f"Error: Could not evaluate '{expression}': {e}"

Async Issues¶

"RuntimeError: Event loop is already running"¶

Cause: Trying to use asyncio.run() inside an existing event loop (e.g., Jupyter).

Solution:

# In Jupyter notebooks, use:
import nest_asyncio
nest_asyncio.apply()

# Or use await directly:
response = await complete_async("...", model_path="model.gguf")

Async Tasks Not Running Concurrently¶

Cause: AsyncLLM uses a lock to serialize access (by design, for thread safety).

Solution: For true parallelism, use multiple AsyncLLM instances:

import asyncio
from cyllama import AsyncLLM

async def parallel_generation():
    # Create multiple instances for parallel inference
    async with AsyncLLM("model.gguf") as llm1, \
               AsyncLLM("model.gguf") as llm2:

        task1 = llm1("Prompt 1")
        task2 = llm2("Prompt 2")

        results = await asyncio.gather(task1, task2)

Whisper Issues¶

"Invalid audio format"¶

Cause: Audio not in correct format (16kHz, mono, float32).

Solution:

import numpy as np

def prepare_audio(samples, sample_rate):
    """Convert audio to Whisper-compatible format."""
    # Resample to 16kHz if needed
    if sample_rate != 16000:
        # Use scipy or librosa for resampling
        from scipy import signal
        samples = signal.resample(samples, int(len(samples) * 16000 / sample_rate))

    # Convert to mono if stereo
    if len(samples.shape) > 1:
        samples = samples.mean(axis=1)

    # Convert to float32 in [-1, 1]
    samples = samples.astype(np.float32)
    if samples.max() > 1.0 or samples.min() < -1.0:
        samples = samples / max(abs(samples.max()), abs(samples.min()))

    return samples

Stable Diffusion Issues¶

"Module not available"¶

Cause: Built without Stable Diffusion support.

Solution:

# Rebuild with SD support
WITH_STABLEDIFFUSION=1 make reset
WITH_STABLEDIFFUSION=1 make build

Images are Blank or Corrupted¶

Cause: Wrong model type or incompatible settings.

Solutions:

For SDXL Turbo models:

images = text_to_image(
    model_path="sd_xl_turbo.gguf",
    prompt="...",
    sample_steps=4,   # Turbo uses fewer steps
    cfg_scale=1.0     # Turbo uses low CFG
)

For standard SD models:

images = text_to_image(
    model_path="sd_v1_5.gguf",
    prompt="...",
    sample_steps=20,  # More steps
    cfg_scale=7.0     # Higher CFG
)

Performance Tips¶

Slow First Generation¶

Cause: Model loading and context creation on first call.

Solution: Use the LLM class to keep the model loaded:

from cyllama import LLM

llm = LLM("model.gguf")  # Load once

# Subsequent calls are fast
for prompt in prompts:
    response = llm(prompt)

High Memory Usage¶

Solutions:

Close resources when done:

llm = LLM("model.gguf")
# ... use llm ...
llm.close()  # Free memory

Use context managers:

with LLM("model.gguf") as llm:
    response = llm("...")
# Automatically freed

Use batch processing for multiple prompts:

from cyllama import batch_generate

responses = batch_generate(prompts, model_path="model.gguf")

Getting Help¶

If you're still having issues:

Check the logs: Run with verbose=True for detailed output
Search existing issues: GitHub Issues
Open a new issue: Include your platform, Python version, error message, and minimal reproduction code