Building cyllama with Different Backends¶

cyllama supports multiple GPU acceleration backends through llama.cpp. This guide shows you how to build with different backends using either the Makefile or the Python build manager (scripts/manage.py).

Quick Start¶

Default Build (Metal on macOS, CPU-only on Linux)¶

Using Makefile:

make build

Using manage.py:

python3 scripts/manage.py build --llama-cpp

Build with Specific Backend¶

Using Makefile:

# CUDA (NVIDIA GPUs)
make build-cuda

# Vulkan (Cross-platform GPU)
make build-vulkan

# CPU-only (no GPU)
make build-cpu

# SYCL (Intel GPUs)
make build-sycl

# HIP/ROCm (AMD GPUs)
make build-hip

# All backends (Metal + CUDA + Vulkan)
make build-all

Using manage.py:

# CUDA (NVIDIA GPUs)
python3 scripts/manage.py build --llama-cpp --cuda

# Vulkan (Cross-platform GPU)
python3 scripts/manage.py build --llama-cpp --vulkan

# CPU-only (no GPU)
python3 scripts/manage.py build --llama-cpp --cpu-only

# SYCL (Intel GPUs)
python3 scripts/manage.py build --llama-cpp --sycl

# HIP/ROCm (AMD GPUs)
python3 scripts/manage.py build --llama-cpp --hip

# Multiple backends (CUDA + Vulkan)
python3 scripts/manage.py build --llama-cpp --cuda --vulkan

# Metal on macOS (default, or explicit)
python3 scripts/manage.py build --llama-cpp --metal

Environment Variable Control¶

You can fine-tune backend selection using environment variables (works with both Makefile and manage.py):

Using Makefile:

# Enable specific backends
export GGML_CUDA=1
export GGML_VULKAN=1
make build

# Disable Metal on macOS
export GGML_METAL=0
make build

# Build with multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build

Using manage.py:

# Environment variables work the same way
export GGML_CUDA=1
export GGML_VULKAN=1
python3 scripts/manage.py build --llama-cpp

# Or combine with command-line flags (flags override env vars)
export GGML_METAL=1
python3 scripts/manage.py build --llama-cpp --cuda  # Enables both Metal and CUDA

Available Backend Flags¶

These flags apply uniformly to all components (llama.cpp, whisper.cpp, stable-diffusion.cpp):

Variable	Default	Description
`GGML_METAL`	`1`	Apple Metal (macOS GPU)
`GGML_CUDA`	`0`	NVIDIA CUDA
`GGML_VULKAN`	`0`	Vulkan (cross-platform GPU)
`GGML_SYCL`	`0`	Intel SYCL (oneAPI)
`GGML_HIP`	`0`	AMD ROCm/HIP
`GGML_OPENCL`	`0`	OpenCL (Adreno, mobile GPUs)
`SD_USE_VENDORED_GGML`	`0`	Link stable-diffusion against its own vendored ggml instead of llama.cpp's

Backend Requirements¶

CUDA (NVIDIA GPUs)¶

Requirements:

NVIDIA GPU with compute capability 6.0+
CUDA Toolkit 11.0+ installed
nvcc compiler in PATH

Install CUDA:

# Ubuntu/Debian
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6

# Verify installation
nvcc --version

Build:

export GGML_CUDA=1
make build

Vulkan (Cross-platform GPU)¶

Requirements:

Vulkan-capable GPU (NVIDIA, AMD, Intel, or Apple)
Vulkan SDK installed
Vulkan headers in system include path

Install Vulkan SDK:

# Ubuntu/Debian
sudo apt-get install -y libvulkan-dev vulkan-tools

# macOS
brew install vulkan-headers vulkan-loader molten-vk

# Verify installation
vulkaninfo --summary

Build:

export GGML_VULKAN=1
make build

Metal (Apple Silicon/macOS)¶

Requirements:

macOS 11.0+ (Big Sur or later)
Apple Silicon (M1/M2/M3) or Intel Mac with AMD GPU
Xcode Command Line Tools

Build (enabled by default on macOS):

make build
# or explicitly:
export GGML_METAL=1
make build

SYCL (Intel GPUs)¶

Requirements:

Intel GPU (Iris Xe, Arc, or Flex)
Intel oneAPI Base Toolkit installed

Install Intel oneAPI:

# Ubuntu/Debian
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt-get update
sudo apt-get install -y intel-basekit

# Setup environment
source /opt/intel/oneapi/setvars.sh

Build:

export GGML_SYCL=1
make build

HIP/ROCm (AMD GPUs)¶

Requirements:

AMD GPU with ROCm support (gfx90a, gfx942, gfx1100, or newer)
ROCm 6.3+ installed

Install ROCm:

# Ubuntu 22.04+
sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/6.3.3/ubuntu/jammy/amdgpu-install_6.3.60303-1_all.deb
sudo apt-get install -y ./amdgpu-install_6.3.60303-1_all.deb
sudo amdgpu-install --usecase=rocm

# Verify installation
rocm-smi

Build:

export GGML_HIP=1
make build

Multi-Backend Builds¶

You can build with multiple backends simultaneously:

# CUDA + Vulkan + Metal (on macOS)
export GGML_METAL=1 GGML_CUDA=1 GGML_VULKAN=1
make build

# CUDA + Vulkan (on Linux)
export GGML_CUDA=1 GGML_VULKAN=1
make build

At runtime, llama.cpp will automatically select the best available backend, or you can specify one explicitly via the model configuration.

Checking Your Build¶

After building, you can verify which backends were compiled:

# Show current backend configuration
make show-backends

# Check compiled libraries
ls -lh thirdparty/llama.cpp/lib/

Expected libraries for each backend:

Backend	Library
CPU	`libggml-cpu.a` (always built)
Metal	`libggml-metal.a` + `libggml-blas.a`
CUDA	`libggml-cuda.a`
Vulkan	`libggml-vulkan.a`
SYCL	`libggml-sycl.a`
HIP/ROCm	`libggml-hip.a`
OpenCL	`libggml-opencl.a`

Troubleshooting¶

"nvcc: command not found" (CUDA)¶

Make sure CUDA toolkit is installed and in your PATH:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

"vulkan/vulkan.h: No such file" (Vulkan)¶

Install Vulkan SDK:

# Ubuntu/Debian
sudo apt-get install -y libvulkan-dev

# macOS
brew install vulkan-headers molten-vk

"cannot find -lcuda" (CUDA linking error)¶

Add CUDA library path:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Metal not working on macOS¶

Ensure you have the latest Xcode Command Line Tools:

xcode-select --install

Performance is slow with GPU backend¶

Make sure you're enabling GPU offloading in your model configuration:

from cyllama import LLM

model = LLM(
    model_path="model.gguf",
    n_gpu_layers=99,  # Offload all layers to GPU
)

Clean Rebuild¶

If you encounter issues after changing backends:

# Clean everything and rebuild
make reset
export GGML_CUDA=1  # or your desired backend
make build

Performance Comparison¶

Approximate relative performance (inference speed):

Backend	Relative Speed	Notes
CPU only	1x (baseline)	Good for small models
Metal (M1/M2/M3)	5-15x	Best on Apple Silicon
CUDA (RTX 4090)	10-30x	Best on NVIDIA GPUs
Vulkan	5-20x	Good cross-platform option
SYCL (Arc A770)	3-8x	Intel GPUs
HIP/ROCm	8-25x	AMD GPUs

Performance varies greatly based on model size, quantization, and hardware.

Recommended Backends by Platform¶

Platform	Recommended Backend	Alternative
Apple Silicon Mac	Metal	Vulkan
Linux + NVIDIA GPU	CUDA	Vulkan
Linux + AMD GPU	HIP/ROCm	Vulkan
Linux + Intel GPU	SYCL	Vulkan
Windows + NVIDIA GPU	CUDA	Vulkan
Windows + AMD GPU	Vulkan	HIP/ROCm

Advanced: Custom CMake Flags¶

Cyllama has a two-stage build process:

Dependency build (scripts/manage.py): Builds llama.cpp, whisper.cpp, stable-diffusion.cpp as static libraries
Extension build (CMakeLists.txt): Builds Cython extensions with scikit-build-core, linking against the static libraries

Customizing Dependency Build¶

Use scripts/manage.py with environment variables or flags:

# Example: Build llama.cpp with CUDA for specific architectures
CMAKE_CUDA_ARCHITECTURES="86-real;89-real" python3 scripts/manage.py build --llama-cpp --cuda

Customizing Extension Build¶

For the Cython extension build, pass CMake args via scikit-build-core:

# Pass CMake args during wheel build
CMAKE_ARGS="-DGGML_METAL=ON" uv build --wheel

Building cyllama with Different Backends¶

Quick Start¶

Default Build (Metal on macOS, CPU-only on Linux)¶

Build with Specific Backend¶

Environment Variable Control¶

Available Backend Flags¶

Backend Requirements¶

CUDA (NVIDIA GPUs)¶

Vulkan (Cross-platform GPU)¶

Metal (Apple Silicon/macOS)¶

SYCL (Intel GPUs)¶

HIP/ROCm (AMD GPUs)¶

Multi-Backend Builds¶

Checking Your Build¶

Troubleshooting¶

"nvcc: command not found" (CUDA)¶

"vulkan/vulkan.h: No such file" (Vulkan)¶

"cannot find -lcuda" (CUDA linking error)¶

Metal not working on macOS¶

Performance is slow with GPU backend¶

Clean Rebuild¶

Performance Comparison¶

Recommended Backends by Platform¶

Advanced: Custom CMake Flags¶

Customizing Dependency Build¶

Customizing Extension Build¶

References¶