Skip to content

Troubleshooting

Common issues and solutions when using cyllama.

Installation Issues

"No module named 'cyllama'"

Cause: Cyllama is not installed or not in the Python path.

Solution:

# Make sure you're in the project directory
cd cyllama

# Build and install
make

# Or manually install in editable mode
uv pip install -e .

Build Fails with CMake Errors

Cause: Missing dependencies or incompatible CMake version.

Solution:

# Check CMake version (need 3.21+)
cmake --version

# Clean and rebuild
make reset
make build

# On macOS, ensure Xcode tools are installed
xcode-select --install

"fatal error: 'llama.h' file not found"

Cause: llama.cpp headers not built or not found.

Solution:

# Rebuild dependencies
make reset
make

# Verify thirdparty structure
ls thirdparty/llama.cpp/include/

Model Loading Issues

"Failed to load model"

Cause: Model file doesn't exist, is corrupted, or incompatible format.

Solutions:

  1. Verify file exists:
ls -la models/your-model.gguf
  1. Check file integrity:
from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("models/your-model.gguf")
metadata = ctx.get_all_metadata()
print(metadata)  # Should show model info
  1. Use correct GGUF format: Cyllama requires GGUF format (not GGML). Convert older models:
# Use llama.cpp's conversion tool
python llama.cpp/convert.py old-model.bin --outfile new-model.gguf

"Out of memory" / VRAM Exhaustion

Cause: Model too large for available memory/VRAM.

Solutions:

  1. Reduce GPU layers:
from cyllama import LLM, GenerationConfig

config = GenerationConfig(n_gpu_layers=20)  # Reduce from 99
llm = LLM("model.gguf", config=config)
  1. Estimate optimal layers:
from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers("model.gguf", available_vram_mb=8000)
print(f"Recommended: {estimate.n_gpu_layers} GPU layers")
  1. Use smaller quantization: Download a more quantized model (Q4_0 < Q5_K < Q8_0 < F16).

  2. Reduce context size:

config = GenerationConfig(n_ctx=2048)  # Smaller context = less memory

Model Loads but Generation is Slow

Cause: Model not using GPU acceleration.

Solutions:

  1. Check GPU backend is loaded:
from cyllama.llama.llama_cpp import ggml_backend_load_all
ggml_backend_load_all()
  1. Verify GPU layers are being used:
from cyllama import LLM

llm = LLM("model.gguf", n_gpu_layers=99, verbose=True)
# Verbose output should show GPU offload info
  1. On macOS, check Metal:
# Ensure Metal is available
system_profiler SPDisplaysDataType | grep Metal

Generation Issues

Empty or Truncated Output

Cause: max_tokens too low, stop sequences triggered, or EOS token reached.

Solutions:

from cyllama import complete

# Increase max_tokens
response = complete(
    "Write a long essay",
    model_path="model.gguf",
    max_tokens=2000  # Increase this
)

# Check stop sequences aren't interfering
response = complete(
    "Write code",
    model_path="model.gguf",
    stop_sequences=[]  # Clear stop sequences
)

Repetitive Output

Cause: Repetition penalty too low or temperature issues.

Solutions:

from cyllama import GenerationConfig, LLM

config = GenerationConfig(
    repeat_penalty=1.2,  # Increase (default 1.1)
    temperature=0.8,     # Add some randomness
    top_k=40,
    top_p=0.95
)

llm = LLM("model.gguf", config=config)

Nonsensical Output

Cause: Temperature too high, wrong model, or corrupted model file.

Solutions:

  1. Lower temperature:
response = complete("...", model_path="model.gguf", temperature=0.3)
  1. Use greedy decoding for deterministic output:
response = complete("...", model_path="model.gguf", temperature=0.0)
  1. Verify model integrity:
from cyllama.llama.llama_cpp import GGUFContext
ctx = GGUFContext.from_file("model.gguf")
print(ctx.get_val_str("general.architecture"))

Chat Format Issues

Cause: Model expects specific chat format that isn't being applied.

Solution: Use the chat() function which applies proper formatting:

from cyllama import chat

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello"}
]

# chat() applies the model's expected format
response = chat(messages, model_path="model.gguf")

GPU Issues

Metal Not Working (macOS)

Symptoms: Generation runs on CPU despite having Apple Silicon.

Solutions:

  1. Verify Metal support:
system_profiler SPDisplaysDataType | grep -i metal
  1. Reinstall Xcode tools:
xcode-select --install
  1. Check build used Metal:
make show-backends
# Should show GGML_METAL=1
  1. Rebuild with Metal:
make reset
make build-metal

CUDA Not Found (Linux)

Symptoms: Build fails or GPU not used on NVIDIA systems.

Solutions:

  1. Set CUDA paths:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Rebuild with CUDA:
make reset
make build-cuda
  1. Verify CUDA installation:
nvcc --version
nvidia-smi

CUDA DLLs Not Found (Windows)

Symptoms: ImportError or DLL load failed when importing cyllama on Windows with a CUDA build.

Cause: CUDA toolkit DLLs (e.g. cublas64_13.dll) are not on the DLL search path.

Cyllama automatically discovers CUDA DLL paths when built with GGML_CUDA=1, but the discovery may fail if:

  • CUDA toolkit is installed in a non-standard location
  • Neither CUDA_PATH nor CUDA_HOME is set
  • nvcc is not on PATH

Solutions:

  1. Set the CUDA_PATH environment variable:
$env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
  1. Add CUDA bin to PATH:
$env:PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin;$env:PATH"
  1. Verify the build detected CUDA:
from cyllama import _backend
print(_backend.cuda)  # Should be True

If False, the package was not built with CUDA support. Rebuild with GGML_CUDA=1.

Vulkan Issues

Symptoms: Vulkan backend not loading.

Solutions:

  1. Install Vulkan SDK:
  2. Linux: sudo apt install vulkan-tools libvulkan-dev
  3. macOS: Install from LunarG

  4. Verify Vulkan:

vulkaninfo | head -20
  1. Rebuild:
make build-vulkan

Agent Issues

Agent Loops Forever

Cause: Agent stuck in reasoning loop.

Solutions:

from cyllama.agents import ReActAgent

agent = ReActAgent(
    llm=llm,
    tools=tools,
    max_iterations=5,              # Hard limit
    detect_loops=True,             # Enable loop detection
    max_consecutive_same_action=2, # Stop after 2 identical actions
    max_consecutive_same_tool=3,   # Stop after 3 calls to same tool
)

Agent Can't Parse Tool Calls

Cause: Model not following expected format.

Solutions:

  1. Use ConstrainedAgent for guaranteed parsing:
from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=tools)
# Grammar constraints ensure valid JSON output
  1. Use a better model: Larger models (7B+) are more reliable at tool calling.

  2. Simplify tool definitions:

@tool
def simple_tool(query: str) -> str:  # Simple, clear signature
    """Search for information."""     # Clear docstring
    return f"Results: {query}"

Tool Execution Errors

Cause: Tool function throws exception.

Solution: Add error handling in tools:

@tool
def safe_calculate(expression: str) -> str:
    """Safely evaluate a math expression."""
    try:
        result = eval(expression)
        return str(result)
    except Exception as e:
        return f"Error: Could not evaluate '{expression}': {e}"

Async Issues

"RuntimeError: Event loop is already running"

Cause: Trying to use asyncio.run() inside an existing event loop (e.g., Jupyter).

Solution:

# In Jupyter notebooks, use:
import nest_asyncio
nest_asyncio.apply()

# Or use await directly:
response = await complete_async("...", model_path="model.gguf")

Async Tasks Not Running Concurrently

Cause: AsyncLLM uses a lock to serialize access (by design, for thread safety).

Solution: For true parallelism, use multiple AsyncLLM instances:

import asyncio
from cyllama import AsyncLLM

async def parallel_generation():
    # Create multiple instances for parallel inference
    async with AsyncLLM("model.gguf") as llm1, \
               AsyncLLM("model.gguf") as llm2:

        task1 = llm1("Prompt 1")
        task2 = llm2("Prompt 2")

        results = await asyncio.gather(task1, task2)

Whisper Issues

"Invalid audio format"

Cause: Audio not in correct format (16kHz, mono, float32).

Solution:

import numpy as np

def prepare_audio(samples, sample_rate):
    """Convert audio to Whisper-compatible format."""
    # Resample to 16kHz if needed
    if sample_rate != 16000:
        # Use scipy or librosa for resampling
        from scipy import signal
        samples = signal.resample(samples, int(len(samples) * 16000 / sample_rate))

    # Convert to mono if stereo
    if len(samples.shape) > 1:
        samples = samples.mean(axis=1)

    # Convert to float32 in [-1, 1]
    samples = samples.astype(np.float32)
    if samples.max() > 1.0 or samples.min() < -1.0:
        samples = samples / max(abs(samples.max()), abs(samples.min()))

    return samples

Stable Diffusion Issues

"Module not available"

Cause: Built without Stable Diffusion support.

Solution:

# Rebuild with SD support
WITH_STABLEDIFFUSION=1 make reset
WITH_STABLEDIFFUSION=1 make build

Images are Blank or Corrupted

Cause: Wrong model type or incompatible settings.

Solutions:

  1. For SDXL Turbo models:
images = text_to_image(
    model_path="sd_xl_turbo.gguf",
    prompt="...",
    sample_steps=4,   # Turbo uses fewer steps
    cfg_scale=1.0     # Turbo uses low CFG
)
  1. For standard SD models:
images = text_to_image(
    model_path="sd_v1_5.gguf",
    prompt="...",
    sample_steps=20,  # More steps
    cfg_scale=7.0     # Higher CFG
)

Performance Tips

Slow First Generation

Cause: Model loading and context creation on first call.

Solution: Use the LLM class to keep the model loaded:

from cyllama import LLM

llm = LLM("model.gguf")  # Load once

# Subsequent calls are fast
for prompt in prompts:
    response = llm(prompt)

High Memory Usage

Solutions:

  1. Close resources when done:
llm = LLM("model.gguf")
# ... use llm ...
llm.close()  # Free memory
  1. Use context managers:
with LLM("model.gguf") as llm:
    response = llm("...")
# Automatically freed
  1. Use batch processing for multiple prompts:
from cyllama import batch_generate

responses = batch_generate(prompts, model_path="model.gguf")

Getting Help

If you're still having issues:

  1. Check the logs: Run with verbose=True for detailed output
  2. Search existing issues: GitHub Issues
  3. Open a new issue: Include your platform, Python version, error message, and minimal reproduction code