Building cyllama with Different Backends¶
cyllama supports multiple GPU acceleration backends through llama.cpp. This guide shows you how to build with different backends using either the Makefile or the Python build manager (scripts/manage.py).
Quick Start¶
Default Build (Metal on macOS, CPU-only on Linux)¶
Using Makefile:
Using manage.py:
Build with Specific Backend¶
Using Makefile:
# CUDA (NVIDIA GPUs)
make build-cuda
# Vulkan (Cross-platform GPU)
make build-vulkan
# CPU-only (no GPU)
make build-cpu
# SYCL (Intel GPUs)
make build-sycl
# HIP/ROCm (AMD GPUs)
make build-hip
# All backends (Metal + CUDA + Vulkan)
make build-all
Using manage.py:
# CUDA (NVIDIA GPUs)
python3 scripts/manage.py build --llama-cpp --cuda
# Vulkan (Cross-platform GPU)
python3 scripts/manage.py build --llama-cpp --vulkan
# CPU-only (no GPU)
python3 scripts/manage.py build --llama-cpp --cpu-only
# SYCL (Intel GPUs)
python3 scripts/manage.py build --llama-cpp --sycl
# HIP/ROCm (AMD GPUs)
python3 scripts/manage.py build --llama-cpp --hip
# Multiple backends (CUDA + Vulkan)
python3 scripts/manage.py build --llama-cpp --cuda --vulkan
# Metal on macOS (default, or explicit)
python3 scripts/manage.py build --llama-cpp --metal
Environment Variable Control¶
You can fine-tune backend selection using environment variables (works with both Makefile and manage.py):
Using Makefile:
# Enable specific backends
export GGML_CUDA=1
export GGML_VULKAN=1
make build
# Disable Metal on macOS
export GGML_METAL=0
make build
# Build with multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build
Using manage.py:
# Environment variables work the same way
export GGML_CUDA=1
export GGML_VULKAN=1
python3 scripts/manage.py build --llama-cpp
# Or combine with command-line flags (flags override env vars)
export GGML_METAL=1
python3 scripts/manage.py build --llama-cpp --cuda # Enables both Metal and CUDA
Available Backend Flags¶
These flags apply uniformly to all components (llama.cpp, whisper.cpp, stable-diffusion.cpp):
| Variable | Default | Description |
|---|---|---|
GGML_METAL |
1 |
Apple Metal (macOS GPU) |
GGML_CUDA |
0 |
NVIDIA CUDA |
GGML_VULKAN |
0 |
Vulkan (cross-platform GPU) |
GGML_SYCL |
0 |
Intel SYCL (oneAPI) |
GGML_HIP |
0 |
AMD ROCm/HIP |
GGML_OPENCL |
0 |
OpenCL (Adreno, mobile GPUs) |
SD_USE_VENDORED_GGML |
0 |
Link stable-diffusion against its own vendored ggml instead of llama.cpp's |
Backend Requirements¶
CUDA (NVIDIA GPUs)¶
Requirements:
- NVIDIA GPU with compute capability 6.0+
- CUDA Toolkit 11.0+ installed
nvcccompiler in PATH
Install CUDA:
# Ubuntu/Debian
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6
# Verify installation
nvcc --version
Build:
Vulkan (Cross-platform GPU)¶
Requirements:
- Vulkan-capable GPU (NVIDIA, AMD, Intel, or Apple)
- Vulkan SDK installed
- Vulkan headers in system include path
Install Vulkan SDK:
# Ubuntu/Debian
sudo apt-get install -y libvulkan-dev vulkan-tools
# macOS
brew install vulkan-headers vulkan-loader molten-vk
# Verify installation
vulkaninfo --summary
Build:
Metal (Apple Silicon/macOS)¶
Requirements:
- macOS 11.0+ (Big Sur or later)
- Apple Silicon (M1/M2/M3) or Intel Mac with AMD GPU
- Xcode Command Line Tools
Build (enabled by default on macOS):
SYCL (Intel GPUs)¶
Requirements:
- Intel GPU (Iris Xe, Arc, or Flex)
- Intel oneAPI Base Toolkit installed
Install Intel oneAPI:
# Ubuntu/Debian
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt-get update
sudo apt-get install -y intel-basekit
# Setup environment
source /opt/intel/oneapi/setvars.sh
Build:
HIP/ROCm (AMD GPUs)¶
Requirements:
- AMD GPU with ROCm support (gfx90a, gfx942, gfx1100, or newer)
- ROCm 6.3+ installed
Install ROCm:
# Ubuntu 22.04+
sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/6.3.3/ubuntu/jammy/amdgpu-install_6.3.60303-1_all.deb
sudo apt-get install -y ./amdgpu-install_6.3.60303-1_all.deb
sudo amdgpu-install --usecase=rocm
# Verify installation
rocm-smi
Build:
Multi-Backend Builds¶
You can build with multiple backends simultaneously:
# CUDA + Vulkan + Metal (on macOS)
export GGML_METAL=1 GGML_CUDA=1 GGML_VULKAN=1
make build
# CUDA + Vulkan (on Linux)
export GGML_CUDA=1 GGML_VULKAN=1
make build
At runtime, llama.cpp will automatically select the best available backend, or you can specify one explicitly via the model configuration.
Checking Your Build¶
After building, you can verify which backends were compiled:
# Show current backend configuration
make show-backends
# Check compiled libraries
ls -lh thirdparty/llama.cpp/lib/
Expected libraries for each backend:
| Backend | Library |
|---|---|
| CPU | libggml-cpu.a (always built) |
| Metal | libggml-metal.a + libggml-blas.a |
| CUDA | libggml-cuda.a |
| Vulkan | libggml-vulkan.a |
| SYCL | libggml-sycl.a |
| HIP/ROCm | libggml-hip.a |
| OpenCL | libggml-opencl.a |
Troubleshooting¶
"nvcc: command not found" (CUDA)¶
Make sure CUDA toolkit is installed and in your PATH:
"vulkan/vulkan.h: No such file" (Vulkan)¶
Install Vulkan SDK:
"cannot find -lcuda" (CUDA linking error)¶
Add CUDA library path:
Metal not working on macOS¶
Ensure you have the latest Xcode Command Line Tools:
Performance is slow with GPU backend¶
Make sure you're enabling GPU offloading in your model configuration:
from cyllama import LLM
model = LLM(
model_path="model.gguf",
n_gpu_layers=99, # Offload all layers to GPU
)
Clean Rebuild¶
If you encounter issues after changing backends:
Performance Comparison¶
Approximate relative performance (inference speed):
| Backend | Relative Speed | Notes |
|---|---|---|
| CPU only | 1x (baseline) | Good for small models |
| Metal (M1/M2/M3) | 5-15x | Best on Apple Silicon |
| CUDA (RTX 4090) | 10-30x | Best on NVIDIA GPUs |
| Vulkan | 5-20x | Good cross-platform option |
| SYCL (Arc A770) | 3-8x | Intel GPUs |
| HIP/ROCm | 8-25x | AMD GPUs |
Performance varies greatly based on model size, quantization, and hardware.
Recommended Backends by Platform¶
| Platform | Recommended Backend | Alternative |
|---|---|---|
| Apple Silicon Mac | Metal | Vulkan |
| Linux + NVIDIA GPU | CUDA | Vulkan |
| Linux + AMD GPU | HIP/ROCm | Vulkan |
| Linux + Intel GPU | SYCL | Vulkan |
| Windows + NVIDIA GPU | CUDA | Vulkan |
| Windows + AMD GPU | Vulkan | HIP/ROCm |
Advanced: Custom CMake Flags¶
Cyllama has a two-stage build process:
- Dependency build (
scripts/manage.py): Builds llama.cpp, whisper.cpp, stable-diffusion.cpp as static libraries - Extension build (
CMakeLists.txt): Builds Cython extensions with scikit-build-core, linking against the static libraries
Customizing Dependency Build¶
Use scripts/manage.py with environment variables or flags:
# Example: Build llama.cpp with CUDA for specific architectures
CMAKE_CUDA_ARCHITECTURES="86-real;89-real" python3 scripts/manage.py build --llama-cpp --cuda
Customizing Extension Build¶
For the Cython extension build, pass CMake args via scikit-build-core: