Skip to content

Stable Diffusion Integration

Cyllama wraps stable-diffusion.cpp to provide image and video generation capabilities in Python.

Note: Build with WITH_STABLEDIFFUSION=1 to enable this module. By default, stable-diffusion.cpp links against llama.cpp's ggml. To use stable-diffusion.cpp's own vendored ggml instead, set SD_USE_VENDORED_GGML=1.

Overview

The stable diffusion module provides Python bindings to stable-diffusion.cpp, enabling:

  • Text-to-image generation
  • Image-to-image transformation
  • Inpainting with masks
  • ControlNet guided generation
  • Video generation (with compatible models like Wan, CogVideoX)
  • ESRGAN image upscaling
  • Model format conversion

Quick Start

Text-to-Image

from cyllama.sd import text_to_image

images = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)

images[0].save("output.png")

With Model Reuse

For generating multiple images, reuse the context:

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"

with SDContext(params) as ctx:
    for prompt in ["a cat", "a dog", "a bird"]:
        images = ctx.generate(
            prompt=prompt,
            sample_steps=4,
            cfg_scale=1.0
        )
        images[0].save(f"{prompt.replace(' ', '_')}.png")

API Reference

Convenience Functions

text_to_image()

Generate images from a text prompt.

def text_to_image(
    model_path: str,
    prompt: str,
    negative_prompt: str = "",
    width: int = 512,
    height: int = 512,
    seed: int = -1,
    batch_count: int = 1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    sample_method: SampleMethod = None,
    scheduler: Scheduler = None,
    n_threads: int = -1,
    vae_path: str = None,
    taesd_path: str = None,
    clip_l_path: str = None,
    clip_g_path: str = None,
    t5xxl_path: str = None,
    control_net_path: str = None,
    lora_model_dir: str = None,
    clip_skip: int = -1,
    eta: float = 0.0,
    slg_scale: float = 0.0,
    vae_tiling: bool = False,
    offload_to_cpu: bool = False,
    keep_clip_on_cpu: bool = False,
    keep_vae_on_cpu: bool = False,
    diffusion_flash_attn: bool = False
) -> List[SDImage]
Parameter Type Default Description
model_path str required Path to model file
prompt str required Text prompt
negative_prompt str "" What to avoid
width int 512 Output width
height int 512 Output height
seed int -1 Random seed (-1 = random)
batch_count int 1 Number of images
sample_steps int 20 Sampling steps
cfg_scale float 7.0 CFG guidance scale
sample_method SampleMethod None Sampling method
scheduler Scheduler None Noise scheduler
clip_skip int -1 CLIP layers to skip
n_threads int -1 Thread count (-1 = auto)
eta float 0.0 Eta for DDIM/TCD samplers
slg_scale float 0.0 Skip layer guidance scale
vae_tiling bool False Enable VAE tiling for large images
offload_to_cpu bool False Offload weights to CPU (low VRAM)
diffusion_flash_attn bool False Use flash attention

image_to_image()

Transform an existing image with text guidance.

def image_to_image(
    model_path: str,
    init_image: SDImage,
    prompt: str,
    negative_prompt: str = "",
    strength: float = 0.75,
    seed: int = -1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    ...
) -> List[SDImage]

The strength parameter (0.0-1.0) controls how much to transform the input image.

SDContext

Main context class for model loading and generation.

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd-v1-5.gguf"
params.n_threads = 4

ctx = SDContext(params)

# Check if loaded successfully
if ctx.is_valid:
    images = ctx.generate(
        prompt="a beautiful landscape",
        negative_prompt="blurry, ugly",
        width=512,
        height=512,
        sample_steps=20,
        cfg_scale=7.0,
        flow_shift=0.0
    )

Methods:

Method Description
generate(...) Generate images from text prompt
generate_video(...) Generate video frames (requires video model)
get_default_sample_method() Get model's default sampler
get_default_scheduler() Get model's default scheduler
is_valid Check if context is valid

SDContextParams

Configuration for model loading.

params = SDContextParams()

# Model paths
params.model_path = "model.gguf"              # Main model
params.diffusion_model_path = "unet.gguf"     # Diffusion model (for split models)
params.vae_path = "vae.safetensors"           # VAE model
params.clip_l_path = "clip_l.safetensors"     # CLIP-L (SDXL/SD3)
params.clip_g_path = "clip_g.safetensors"     # CLIP-G (SDXL/SD3)
params.clip_vision_path = "clip_vision.safetensors"  # CLIP vision
params.t5xxl_path = "t5xxl.safetensors"       # T5-XXL (SD3/FLUX)
params.llm_path = "qwen.gguf"                 # LLM encoder (FLUX2)
params.llm_vision_path = "qwen_vision.gguf"   # LLM vision encoder
params.taesd_path = "taesd.safetensors"       # TAESD for fast preview
params.control_net_path = "controlnet.gguf"   # ControlNet model
params.photo_maker_path = "photomaker.bin"    # PhotoMaker model
params.high_noise_diffusion_model_path = "..."  # High-noise model (Wan2.2 MoE)
params.lora_model_dir = "loras/"              # LoRA directory
params.embedding_dir = "embeddings/"          # Embeddings directory
params.tensor_type_rules = "^vae\\.=f16"      # Mixed precision rules

# Numeric/enum parameters
params.n_threads = 4                          # Thread count
params.wtype = SDType.F16                     # Weight type
params.rng_type = RngType.CUDA                # RNG type
params.sampler_rng_type = RngType.CPU         # Sampler RNG type
params.prediction = Prediction.DEFAULT        # Prediction type
params.lora_apply_mode = LoraApplyMode.AUTO   # LoRA application mode
params.chroma_t5_mask_pad = 0                 # Chroma T5 mask pad

# Boolean flags
params.vae_decode_only = True                 # VAE decode only (faster)
params.enable_mmap = True                     # Enable memory-mapped loading
params.offload_params_to_cpu = False          # Offload to CPU (low VRAM)
params.keep_clip_on_cpu = False               # Keep CLIP on CPU
params.keep_vae_on_cpu = False                # Keep VAE on CPU
params.keep_control_net_on_cpu = False        # Keep ControlNet on CPU
params.diffusion_flash_attn = False           # Flash attention
params.diffusion_conv_direct = False          # Direct convolution
params.vae_conv_direct = False                # VAE direct convolution
params.tae_preview_only = False               # TAESD for preview only
params.circular_x = False                     # Circular padding X (tileable)
params.circular_y = False                     # Circular padding Y (tileable)
params.qwen_image_zero_cond_t = False         # Zero conditioning for Qwen
params.chroma_use_dit_mask = True             # DiT mask for Chroma
params.chroma_use_t5_mask = False             # T5 mask for Chroma

SDImage

Image wrapper with numpy and PIL integration, plus file I/O.

from cyllama.sd import SDImage

# Load from file (PNG, JPEG, BMP, TGA, GIF, PSD, HDR, PIC supported)
img = SDImage.load("input.png")
img = SDImage.load("input.jpg", channels=3)  # Force RGB

# Properties
print(img.width, img.height, img.channels)
print(img.shape)   # (H, W, C)
print(img.is_valid)

# Save to file (PNG, JPEG, BMP supported)
img.save("output.png")
img.save("output.jpg", quality=90)
img.save("output.bmp")

# Convert to numpy (requires numpy)
arr = img.to_numpy()  # Returns (H, W, C) uint8 array

# Create from numpy
img = SDImage.from_numpy(arr)

# Convert to PIL (requires Pillow)
pil_img = img.to_pil()

SDImageGenParams

Detailed generation parameters for advanced control.

from cyllama.sd import SDImageGenParams, SDImage

params = SDImageGenParams()
params.prompt = "a cute cat"
params.negative_prompt = "ugly, blurry"
params.width = 512
params.height = 512
params.seed = 42
params.batch_count = 1
params.strength = 0.75           # For img2img
params.clip_skip = -1
params.control_strength = 0.9    # ControlNet strength

# VAE tiling for large images
params.vae_tiling_enabled = True
params.vae_tile_size = (512, 512)
params.vae_tile_overlap = 0.5

# EasyCache acceleration
params.easycache_enabled = True
params.easycache_threshold = 0.1
params.easycache_range = (0.0, 1.0)

# Set init image for img2img
init_img = SDImage.load("input.png")
params.set_init_image(init_img)

# Set mask for inpainting
mask_img = SDImage.load("mask.png")
params.set_mask_image(mask_img)

# Set control image for ControlNet
params.set_control_image(control_img, strength=0.8)

# Access sample parameters
sample = params.sample_params
sample.sample_steps = 20
sample.cfg_scale = 7.0
sample.sample_method = SampleMethod.EULER
sample.scheduler = Scheduler.KARRAS
sample.eta = 0.0
sample.slg_scale = 2.5           # Skip layer guidance
sample.slg_layer_start = 0.01
sample.slg_layer_end = 0.2
sample.img_cfg_scale = 1.5       # Image CFG (inpaint)
sample.distilled_guidance = 3.5  # For FLUX

SDSampleParams

Sampling configuration.

from cyllama.sd import SDSampleParams, SampleMethod, Scheduler

params = SDSampleParams()
params.sample_method = SampleMethod.EULER_A
params.scheduler = Scheduler.KARRAS
params.sample_steps = 20
params.cfg_scale = 7.0
params.eta = 0.0                  # Noise multiplier
params.shifted_timestep = 0       # NitroFusion models
params.flow_shift = 0.0           # Flow shift (SD3.x/Wan)
params.img_cfg_scale = 1.5        # Image guidance
params.distilled_guidance = 3.5   # FLUX guidance
params.slg_scale = 0.0            # Skip layer guidance
params.slg_layer_start = 0.01
params.slg_layer_end = 0.2

Upscaler

ESRGAN-based image upscaling.

from cyllama.sd import Upscaler, SDImage

# Load upscaler model
upscaler = Upscaler(
    "models/esrgan-x4.bin",
    n_threads=4,
    offload_to_cpu=False,
    direct=False
)

# Check upscale factor
print(f"Factor: {upscaler.upscale_factor}x")

# Upscale an image
img = SDImage.load("input.png")
upscaled = upscaler.upscale(img)
upscaled.save("upscaled.png")

# Multiple upscale passes
for _ in range(2):
    img = upscaler.upscale(img)  # 16x total

Enums

SampleMethod

Sampling methods for diffusion:

Value Description
EULER Euler method
EULER_A Euler ancestral
HEUN Heun's method
DPM2 DPM-2
DPMPP2S_A DPM++ 2S ancestral
DPMPP2M DPM++ 2M
DPMPP2Mv2 DPM++ 2M v2
IPNDM IPNDM
IPNDM_V IPNDM-V
LCM Latent Consistency Model
DDIM_TRAILING DDIM trailing
TCD TCD

Scheduler

Noise schedulers:

Value Description
DISCRETE Discrete scheduler
KARRAS Karras scheduler
EXPONENTIAL Exponential scheduler
AYS AYS scheduler
GITS GITS scheduler
SGM_UNIFORM SGM uniform
SIMPLE Simple scheduler
SMOOTHSTEP Smoothstep scheduler
LCM LCM scheduler

Prediction

Prediction types:

Value Description
DEFAULT Auto-detect from model
EPS Epsilon prediction
V V-prediction
EDM_V EDM V-prediction
SD3_FLOW SD3 flow matching
FLUX_FLOW FLUX flow matching
FLUX2_FLOW FLUX2 flow matching

SDType

Data types for quantization:

  • Float: F32, F16, BF16
  • 4-bit: Q4_0, Q4_1, Q4_K
  • 5-bit: Q5_0, Q5_1, Q5_K
  • 8-bit: Q8_0, Q8_1, Q8_K
  • K-quants: Q2_K, Q3_K, Q6_K

LoraApplyMode

LoRA application modes:

Value Description
AUTO Auto-detect best mode
IMMEDIATELY Apply at load time
AT_RUNTIME Apply during generation

PreviewMode

Preview modes during generation:

Value Description
NONE No preview
PROJ Projection preview
TAE TAESD preview
VAE Full VAE preview

Callbacks

Set callbacks for logging, progress, and preview during generation.

from cyllama.sd import (
    set_log_callback,
    set_progress_callback,
    set_preview_callback,
    LogLevel,
    PreviewMode
)

# Log callback
def log_cb(level: LogLevel, text: str):
    level_names = {0: 'DEBUG', 1: 'INFO', 2: 'WARN', 3: 'ERROR'}
    print(f'[{level_names.get(level, level)}] {text}', end='')

set_log_callback(log_cb)

# Progress callback
def progress_cb(step: int, steps: int, time_ms: float):
    pct = (step / steps) * 100 if steps > 0 else 0
    print(f'Step {step}/{steps} ({pct:.1f}%) - {time_ms:.2f}s')

set_progress_callback(progress_cb)

# Preview callback
def preview_cb(step: int, frames: list, is_noisy: bool):
    if frames:
        frames[0].save(f"preview_{step}.png")

set_preview_callback(
    preview_cb,
    mode=PreviewMode.TAE,
    interval=5,
    denoised=True,
    noisy=False
)

# Clear callbacks
set_log_callback(None)
set_progress_callback(None)
set_preview_callback(None)

Model Conversion

Convert models between formats with optional quantization.

from cyllama.sd import convert_model, SDType

convert_model(
    input_path="sd-v1-5.safetensors",
    output_path="sd-v1-5-q4_0.gguf",
    output_type=SDType.Q4_0,
    vae_path="vae-ft-mse.safetensors",  # Optional
    tensor_type_rules="^vae\\.=f16"      # Optional mixed precision
)

ControlNet Preprocessing

Apply Canny edge detection for ControlNet conditioning.

from cyllama.sd import SDImage, canny_preprocess

img = SDImage.load("photo.png")

# Apply Canny preprocessing (modifies image in place)
success = canny_preprocess(
    img,
    high_threshold=0.8,
    low_threshold=0.1,
    weak=0.5,
    strong=1.0,
    inverse=False
)

img.save("edges.png")

CLI Tool

Command-line interface with subcommands for all operations.

txt2img - Text to Image

python -m cyllama.sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png \
    --steps 4 --cfg-scale 1.0

# Using diffusion model directly (FLUX, etc.)
python -m cyllama.sd txt2img \
    --diffusion-model models/flux-dev.gguf \
    --vae models/ae.safetensors \
    --clip-l models/clip_l.safetensors \
    --t5xxl models/t5xxl.gguf \
    --prompt "a photo of a cat" \
    -W 1024 -H 1024

# With memory optimization
python -m cyllama.sd txt2img \
    --diffusion-model models/flux.gguf \
    --vae models/ae.safetensors \
    --llm models/qwen.gguf \
    --offload-to-cpu \
    --diffusion-fa \
    --prompt "a lovely cat" \
    -W 512 -H 1024

img2img - Image to Image

python -m cyllama.sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7 \
    --output styled.png

inpaint - Inpainting

python -m cyllama.sd inpaint \
    --model models/sd-inpaint.gguf \
    --init-img photo.png \
    --mask mask.png \
    --prompt "a red hat" \
    --output inpainted.png

controlnet - ControlNet Guided Generation

python -m cyllama.sd controlnet \
    --model models/sd-v1-5.gguf \
    --control-net models/control_canny.gguf \
    --control-image edges.png \
    --prompt "a beautiful landscape" \
    --control-strength 0.9

# With automatic Canny preprocessing
python -m cyllama.sd controlnet \
    --model models/sd-v1-5.gguf \
    --control-net models/control_canny.gguf \
    --control-image photo.png \
    --canny \
    --prompt "anime style"

video - Video Generation

# Text to video
python -m cyllama.sd video \
    --model models/wan2.1.gguf \
    --prompt "a cat walking" \
    --video-frames 16 \
    --fps 24

# Image to video
python -m cyllama.sd video \
    --model models/wan2.1.gguf \
    --init-img first_frame.png \
    --prompt "camera slowly zooms in" \
    --video-frames 24

# Frame interpolation
python -m cyllama.sd video \
    --model models/wan2.1.gguf \
    --init-img start.png \
    --end-img end.png \
    --video-frames 16

upscale - Image Upscaling

python -m cyllama.sd upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_4x.png

# Multiple passes
python -m cyllama.sd upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_16x.png \
    --repeats 2

convert - Model Conversion

python -m cyllama.sd convert \
    --input sd-v1-5.safetensors \
    --output sd-v1-5-q4_0.gguf \
    --type q4_0

# With VAE baking
python -m cyllama.sd convert \
    --input sdxl-base.safetensors \
    --output sdxl-q8_0.gguf \
    --type q8_0 \
    --vae sdxl-vae.safetensors

info - System Information

python -m cyllama.sd info

CLI Options Reference

Model Options (most subcommands):

Option Description
--model, -m Main model file
--diffusion-model Diffusion model (for split architectures)
--vae VAE model
--taesd TAESD model (fast preview)
--clip-l CLIP-L model (SDXL/SD3)
--clip-g CLIP-G model (SDXL/SD3)
--clip-vision CLIP vision model
--t5xxl T5-XXL model (SD3/FLUX)
--llm LLM text encoder (FLUX2)
--llm-vision LLM vision encoder
--control-net ControlNet model
--lora-dir LoRA models directory
--embd-dir Embeddings directory

Generation Options:

Option Description
--prompt, -p Text prompt
--negative, -n Negative prompt
--output, -o Output file path
--width, -W Image width
--height, -H Image height
--steps Sampling steps
--cfg-scale CFG guidance scale
--seed, -s Random seed (-1 = random)
--batch, -b Batch count
--clip-skip CLIP layers to skip

Sampler Options:

Option Description
--sampler Sampling method
--scheduler Noise scheduler
--eta Eta for DDIM/TCD
--rng RNG type (std_default, cuda, cpu)
--prediction Prediction type override

Guidance Options:

Option Description
--slg-scale Skip layer guidance scale
--skip-layer-start SLG start point
--skip-layer-end SLG end point
--guidance Distilled guidance (FLUX)
--img-cfg-scale Image CFG scale

Memory Options:

Option Description
--threads, -t Thread count
--offload-to-cpu Offload weights to CPU
--clip-on-cpu Keep CLIP on CPU
--vae-on-cpu Keep VAE on CPU
--control-net-cpu Keep ControlNet on CPU
--diffusion-fa Flash attention
--vae-tiling Enable VAE tiling

Other Options:

Option Description
--verbose, -v Verbose output
--progress Show progress
--preview Preview mode (none, proj, tae, vae)

Supported Models

Model Family Examples Notes
SD 1.x/2.x sd-v1-5, sd-v2-1 Standard models
SDXL sdxl-base, sdxl-turbo Use cfg_scale=1.0, steps=1-4 for turbo
SD3/SD3.5 sd3-medium, sd3.5-large May need T5-XXL encoder
FLUX flux.1-dev, flux.1-schnell Needs clip_l + t5xxl or llm
FLUX2 flux2-* Uses LLM encoder (Qwen)
Wan/CogVideoX wan-2.1, cogvideox Video generation
LoRA *.safetensors Place in lora_model_dir
ControlNet control_* Use with control images
ESRGAN esrgan-x4 Upscaling only

Utility Functions

from cyllama.sd import (
    get_num_cores,
    get_system_info,
    type_name,
    sample_method_name,
    scheduler_name
)

print(f"CPU cores: {get_num_cores()}")
print(get_system_info())
print(type_name(SDType.Q4_0))           # "q4_0"
print(sample_method_name(SampleMethod.EULER))  # "euler"
print(scheduler_name(Scheduler.KARRAS))  # "karras"

Performance Tips

  1. Use turbo models for fast generation (1-4 steps, cfg_scale=1.0)
  2. Quantize models to Q4_0 or Q8_0 for memory efficiency
  3. Reuse SDContext when generating multiple images
  4. Set n_threads to match physical CPU cores
  5. Use --offload-to-cpu for low VRAM GPUs
  6. Enable --diffusion-fa (flash attention) for faster inference
  7. Use --vae-tiling for generating large images
  8. Use progress callback to track long generations

Troubleshooting

Model Loading Errors

import os
if not os.path.exists(model_path):
    raise FileNotFoundError(f"Model not found: {model_path}")

Out of Memory

  • Use smaller model (SD 1.5 vs SDXL)
  • Use quantized model (Q4_0 vs F16)
  • Reduce image dimensions
  • Reduce batch_count
  • Enable --offload-to-cpu
  • Enable --vae-tiling for large images

Slow Generation

  • Use turbo/LCM models with fewer steps
  • Enable flash attention (--diffusion-fa)
  • Increase n_threads
  • Use direct convolution (--diffusion-conv-direct)

FLUX/SD3 Models Not Working

  • Ensure you have the required encoders (clip_l, t5xxl)
  • For FLUX2, use --llm instead of --t5xxl
  • Check prediction type matches model

See Also