Stable Diffusion Integration¶

Cyllama wraps stable-diffusion.cpp to provide image and video generation capabilities in Python.

Note: Build with WITH_STABLEDIFFUSION=1 to enable this module. By default, stable-diffusion.cpp links against llama.cpp's ggml. To use stable-diffusion.cpp's own vendored ggml instead, set SD_USE_VENDORED_GGML=1.

Overview¶

The stable diffusion module provides Python bindings to stable-diffusion.cpp, enabling:

Text-to-image generation
Image-to-image transformation
Inpainting with masks
ControlNet guided generation
Video generation (with compatible models like Wan, CogVideoX)
ESRGAN image upscaling
Model format conversion

Quick Start¶

Text-to-Image¶

from cyllama.sd import text_to_image

images = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)

images[0].save("output.png")

With Model Reuse¶

For generating multiple images, reuse the context:

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"

with SDContext(params) as ctx:
    for prompt in ["a cat", "a dog", "a bird"]:
        images = ctx.generate(
            prompt=prompt,
            sample_steps=4,
            cfg_scale=1.0
        )
        images[0].save(f"{prompt.replace(' ', '_')}.png")

API Reference¶

Convenience Functions¶

text_to_image()¶

Generate images from a text prompt.

def text_to_image(
    model_path: str,
    prompt: str,
    negative_prompt: str = "",
    width: int = 512,
    height: int = 512,
    seed: int = -1,
    batch_count: int = 1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    sample_method: SampleMethod = None,
    scheduler: Scheduler = None,
    n_threads: int = -1,
    vae_path: str = None,
    taesd_path: str = None,
    clip_l_path: str = None,
    clip_g_path: str = None,
    t5xxl_path: str = None,
    control_net_path: str = None,
    lora_model_dir: str = None,
    clip_skip: int = -1,
    eta: float = 0.0,
    slg_scale: float = 0.0,
    vae_tiling: bool = False,
    offload_to_cpu: bool = False,
    keep_clip_on_cpu: bool = False,
    keep_vae_on_cpu: bool = False,
    diffusion_flash_attn: bool = False
) -> List[SDImage]

Parameter	Type	Default	Description
`model_path`	str	required	Path to model file
`prompt`	str	required	Text prompt
`negative_prompt`	str	""	What to avoid
`width`	int	512	Output width
`height`	int	512	Output height
`seed`	int	-1	Random seed (-1 = random)
`batch_count`	int	1	Number of images
`sample_steps`	int	20	Sampling steps
`cfg_scale`	float	7.0	CFG guidance scale
`sample_method`	SampleMethod	None	Sampling method
`scheduler`	Scheduler	None	Noise scheduler
`clip_skip`	int	-1	CLIP layers to skip
`n_threads`	int	-1	Thread count (-1 = auto)
`eta`	float	0.0	Eta for DDIM/TCD samplers
`slg_scale`	float	0.0	Skip layer guidance scale
`vae_tiling`	bool	False	Enable VAE tiling for large images
`offload_to_cpu`	bool	False	Offload weights to CPU (low VRAM)
`diffusion_flash_attn`	bool	False	Use flash attention

image_to_image()¶

Transform an existing image with text guidance.

def image_to_image(
    model_path: str,
    init_image: SDImage,
    prompt: str,
    negative_prompt: str = "",
    strength: float = 0.75,
    seed: int = -1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    ...
) -> List[SDImage]

The strength parameter (0.0-1.0) controls how much to transform the input image.

SDContext¶

Main context class for model loading and generation.

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd-v1-5.gguf"
params.n_threads = 4

ctx = SDContext(params)

# Check if loaded successfully
if ctx.is_valid:
    images = ctx.generate(
        prompt="a beautiful landscape",
        negative_prompt="blurry, ugly",
        width=512,
        height=512,
        sample_steps=20,
        cfg_scale=7.0,
        flow_shift=0.0
    )

Methods:

Method	Description
`generate(...)`	Generate images from text prompt
`generate_video(...)`	Generate video frames (requires video model)
`get_default_sample_method()`	Get model's default sampler
`get_default_scheduler()`	Get model's default scheduler
`is_valid`	Check if context is valid

SDContextParams¶

Configuration for model loading.

params = SDContextParams()

# Model paths
params.model_path = "model.gguf"              # Main model
params.diffusion_model_path = "unet.gguf"     # Diffusion model (for split models)
params.vae_path = "vae.safetensors"           # VAE model
params.clip_l_path = "clip_l.safetensors"     # CLIP-L (SDXL/SD3)
params.clip_g_path = "clip_g.safetensors"     # CLIP-G (SDXL/SD3)
params.clip_vision_path = "clip_vision.safetensors"  # CLIP vision
params.t5xxl_path = "t5xxl.safetensors"       # T5-XXL (SD3/FLUX)
params.llm_path = "qwen.gguf"                 # LLM encoder (FLUX2)
params.llm_vision_path = "qwen_vision.gguf"   # LLM vision encoder
params.taesd_path = "taesd.safetensors"       # TAESD for fast preview
params.control_net_path = "controlnet.gguf"   # ControlNet model
params.photo_maker_path = "photomaker.bin"    # PhotoMaker model
params.high_noise_diffusion_model_path = "..."  # High-noise model (Wan2.2 MoE)
params.lora_model_dir = "loras/"              # LoRA directory
params.embedding_dir = "embeddings/"          # Embeddings directory
params.tensor_type_rules = "^vae\\.=f16"      # Mixed precision rules

# Numeric/enum parameters
params.n_threads = 4                          # Thread count
params.wtype = SDType.F16                     # Weight type
params.rng_type = RngType.CUDA                # RNG type
params.sampler_rng_type = RngType.CPU         # Sampler RNG type
params.prediction = Prediction.DEFAULT        # Prediction type
params.lora_apply_mode = LoraApplyMode.AUTO   # LoRA application mode
params.chroma_t5_mask_pad = 0                 # Chroma T5 mask pad

# Boolean flags
params.vae_decode_only = True                 # VAE decode only (faster)
params.enable_mmap = True                     # Enable memory-mapped loading
params.offload_params_to_cpu = False          # Offload to CPU (low VRAM)
params.keep_clip_on_cpu = False               # Keep CLIP on CPU
params.keep_vae_on_cpu = False                # Keep VAE on CPU
params.keep_control_net_on_cpu = False        # Keep ControlNet on CPU
params.diffusion_flash_attn = False           # Flash attention
params.diffusion_conv_direct = False          # Direct convolution
params.vae_conv_direct = False                # VAE direct convolution
params.tae_preview_only = False               # TAESD for preview only
params.circular_x = False                     # Circular padding X (tileable)
params.circular_y = False                     # Circular padding Y (tileable)
params.qwen_image_zero_cond_t = False         # Zero conditioning for Qwen
params.chroma_use_dit_mask = True             # DiT mask for Chroma
params.chroma_use_t5_mask = False             # T5 mask for Chroma

SDImage¶

Image wrapper with numpy and PIL integration, plus file I/O.

from cyllama.sd import SDImage

# Load from file (PNG, JPEG, BMP, TGA, GIF, PSD, HDR, PIC supported)
img = SDImage.load("input.png")
img = SDImage.load("input.jpg", channels=3)  # Force RGB

# Properties
print(img.width, img.height, img.channels)
print(img.shape)   # (H, W, C)
print(img.is_valid)

# Save to file (PNG, JPEG, BMP supported)
img.save("output.png")
img.save("output.jpg", quality=90)
img.save("output.bmp")

# Convert to numpy (requires numpy)
arr = img.to_numpy()  # Returns (H, W, C) uint8 array

# Create from numpy
img = SDImage.from_numpy(arr)

# Convert to PIL (requires Pillow)
pil_img = img.to_pil()

SDImageGenParams¶

Detailed generation parameters for advanced control.

from cyllama.sd import SDImageGenParams, SDImage

params = SDImageGenParams()
params.prompt = "a cute cat"
params.negative_prompt = "ugly, blurry"
params.width = 512
params.height = 512
params.seed = 42
params.batch_count = 1
params.strength = 0.75           # For img2img
params.clip_skip = -1
params.control_strength = 0.9    # ControlNet strength

# VAE tiling for large images
params.vae_tiling_enabled = True
params.vae_tile_size = (512, 512)
params.vae_tile_overlap = 0.5

# EasyCache acceleration
params.easycache_enabled = True
params.easycache_threshold = 0.1
params.easycache_range = (0.0, 1.0)

# Set init image for img2img
init_img = SDImage.load("input.png")
params.set_init_image(init_img)

# Set mask for inpainting
mask_img = SDImage.load("mask.png")
params.set_mask_image(mask_img)

# Set control image for ControlNet
params.set_control_image(control_img, strength=0.8)

# Access sample parameters
sample = params.sample_params
sample.sample_steps = 20
sample.cfg_scale = 7.0
sample.sample_method = SampleMethod.EULER
sample.scheduler = Scheduler.KARRAS
sample.eta = 0.0
sample.slg_scale = 2.5           # Skip layer guidance
sample.slg_layer_start = 0.01
sample.slg_layer_end = 0.2
sample.img_cfg_scale = 1.5       # Image CFG (inpaint)
sample.distilled_guidance = 3.5  # For FLUX

SDSampleParams¶

Sampling configuration.

from cyllama.sd import SDSampleParams, SampleMethod, Scheduler

params = SDSampleParams()
params.sample_method = SampleMethod.EULER_A
params.scheduler = Scheduler.KARRAS
params.sample_steps = 20
params.cfg_scale = 7.0
params.eta = 0.0                  # Noise multiplier
params.shifted_timestep = 0       # NitroFusion models
params.flow_shift = 0.0           # Flow shift (SD3.x/Wan)
params.img_cfg_scale = 1.5        # Image guidance
params.distilled_guidance = 3.5   # FLUX guidance
params.slg_scale = 0.0            # Skip layer guidance
params.slg_layer_start = 0.01
params.slg_layer_end = 0.2

Upscaler¶

ESRGAN-based image upscaling.

from cyllama.sd import Upscaler, SDImage

# Load upscaler model
upscaler = Upscaler(
    "models/esrgan-x4.bin",
    n_threads=4,
    offload_to_cpu=False,
    direct=False
)

# Check upscale factor
print(f"Factor: {upscaler.upscale_factor}x")

# Upscale an image
img = SDImage.load("input.png")
upscaled = upscaler.upscale(img)
upscaled.save("upscaled.png")

# Multiple upscale passes
for _ in range(2):
    img = upscaler.upscale(img)  # 16x total

Enums¶

SampleMethod¶

Sampling methods for diffusion:

Value	Description
`EULER`	Euler method
`EULER_A`	Euler ancestral
`HEUN`	Heun's method
`DPM2`	DPM-2
`DPMPP2S_A`	DPM++ 2S ancestral
`DPMPP2M`	DPM++ 2M
`DPMPP2Mv2`	DPM++ 2M v2
`IPNDM`	IPNDM
`IPNDM_V`	IPNDM-V
`LCM`	Latent Consistency Model
`DDIM_TRAILING`	DDIM trailing
`TCD`	TCD

Scheduler¶

Noise schedulers:

Value	Description
`DISCRETE`	Discrete scheduler
`KARRAS`	Karras scheduler
`EXPONENTIAL`	Exponential scheduler
`AYS`	AYS scheduler
`GITS`	GITS scheduler
`SGM_UNIFORM`	SGM uniform
`SIMPLE`	Simple scheduler
`SMOOTHSTEP`	Smoothstep scheduler
`LCM`	LCM scheduler

Prediction¶

Prediction types:

Value	Description
`DEFAULT`	Auto-detect from model
`EPS`	Epsilon prediction
`V`	V-prediction
`EDM_V`	EDM V-prediction
`SD3_FLOW`	SD3 flow matching
`FLUX_FLOW`	FLUX flow matching
`FLUX2_FLOW`	FLUX2 flow matching

SDType¶

Data types for quantization:

Float: F32, F16, BF16
4-bit: Q4_0, Q4_1, Q4_K
5-bit: Q5_0, Q5_1, Q5_K
8-bit: Q8_0, Q8_1, Q8_K
K-quants: Q2_K, Q3_K, Q6_K

LoraApplyMode¶

LoRA application modes:

Value	Description
`AUTO`	Auto-detect best mode
`IMMEDIATELY`	Apply at load time
`AT_RUNTIME`	Apply during generation

PreviewMode¶

Preview modes during generation:

Value	Description
`NONE`	No preview
`PROJ`	Projection preview
`TAE`	TAESD preview
`VAE`	Full VAE preview

Callbacks¶

Set callbacks for logging, progress, and preview during generation.

from cyllama.sd import (
    set_log_callback,
    set_progress_callback,
    set_preview_callback,
    LogLevel,
    PreviewMode
)

# Log callback
def log_cb(level: LogLevel, text: str):
    level_names = {0: 'DEBUG', 1: 'INFO', 2: 'WARN', 3: 'ERROR'}
    print(f'[{level_names.get(level, level)}] {text}', end='')

set_log_callback(log_cb)

# Progress callback
def progress_cb(step: int, steps: int, time_ms: float):
    pct = (step / steps) * 100 if steps > 0 else 0
    print(f'Step {step}/{steps} ({pct:.1f}%) - {time_ms:.2f}s')

set_progress_callback(progress_cb)

# Preview callback
def preview_cb(step: int, frames: list, is_noisy: bool):
    if frames:
        frames[0].save(f"preview_{step}.png")

set_preview_callback(
    preview_cb,
    mode=PreviewMode.TAE,
    interval=5,
    denoised=True,
    noisy=False
)

# Clear callbacks
set_log_callback(None)
set_progress_callback(None)
set_preview_callback(None)

Model Conversion¶

Convert models between formats with optional quantization.

from cyllama.sd import convert_model, SDType

convert_model(
    input_path="sd-v1-5.safetensors",
    output_path="sd-v1-5-q4_0.gguf",
    output_type=SDType.Q4_0,
    vae_path="vae-ft-mse.safetensors",  # Optional
    tensor_type_rules="^vae\\.=f16"      # Optional mixed precision
)

ControlNet Preprocessing¶

Apply Canny edge detection for ControlNet conditioning.

from cyllama.sd import SDImage, canny_preprocess

img = SDImage.load("photo.png")

# Apply Canny preprocessing (modifies image in place)
success = canny_preprocess(
    img,
    high_threshold=0.8,
    low_threshold=0.1,
    weak=0.5,
    strong=1.0,
    inverse=False
)

img.save("edges.png")

CLI Tool¶

Command-line interface with subcommands for all operations.

txt2img - Text to Image¶

python -m cyllama.sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png \
    --steps 4 --cfg-scale 1.0

# Using diffusion model directly (FLUX, etc.)
python -m cyllama.sd txt2img \
    --diffusion-model models/flux-dev.gguf \
    --vae models/ae.safetensors \
    --clip-l models/clip_l.safetensors \
    --t5xxl models/t5xxl.gguf \
    --prompt "a photo of a cat" \
    -W 1024 -H 1024

# With memory optimization
python -m cyllama.sd txt2img \
    --diffusion-model models/flux.gguf \
    --vae models/ae.safetensors \
    --llm models/qwen.gguf \
    --offload-to-cpu \
    --diffusion-fa \
    --prompt "a lovely cat" \
    -W 512 -H 1024

img2img - Image to Image¶

python -m cyllama.sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7 \
    --output styled.png

inpaint - Inpainting¶

python -m cyllama.sd inpaint \
    --model models/sd-inpaint.gguf \
    --init-img photo.png \
    --mask mask.png \
    --prompt "a red hat" \
    --output inpainted.png

controlnet - ControlNet Guided Generation¶

python -m cyllama.sd controlnet \
    --model models/sd-v1-5.gguf \
    --control-net models/control_canny.gguf \
    --control-image edges.png \
    --prompt "a beautiful landscape" \
    --control-strength 0.9

# With automatic Canny preprocessing
python -m cyllama.sd controlnet \
    --model models/sd-v1-5.gguf \
    --control-net models/control_canny.gguf \
    --control-image photo.png \
    --canny \
    --prompt "anime style"

video - Video Generation¶

# Text to video
python -m cyllama.sd video \
    --model models/wan2.1.gguf \
    --prompt "a cat walking" \
    --video-frames 16 \
    --fps 24

# Image to video
python -m cyllama.sd video \
    --model models/wan2.1.gguf \
    --init-img first_frame.png \
    --prompt "camera slowly zooms in" \
    --video-frames 24

# Frame interpolation
python -m cyllama.sd video \
    --model models/wan2.1.gguf \
    --init-img start.png \
    --end-img end.png \
    --video-frames 16

upscale - Image Upscaling¶

python -m cyllama.sd upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_4x.png

# Multiple passes
python -m cyllama.sd upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_16x.png \
    --repeats 2

convert - Model Conversion¶

python -m cyllama.sd convert \
    --input sd-v1-5.safetensors \
    --output sd-v1-5-q4_0.gguf \
    --type q4_0

# With VAE baking
python -m cyllama.sd convert \
    --input sdxl-base.safetensors \
    --output sdxl-q8_0.gguf \
    --type q8_0 \
    --vae sdxl-vae.safetensors

info - System Information¶

python -m cyllama.sd info

CLI Options Reference¶

Model Options (most subcommands):

Option	Description
`--model`, `-m`	Main model file
`--diffusion-model`	Diffusion model (for split architectures)
`--vae`	VAE model
`--taesd`	TAESD model (fast preview)
`--clip-l`	CLIP-L model (SDXL/SD3)
`--clip-g`	CLIP-G model (SDXL/SD3)
`--clip-vision`	CLIP vision model
`--t5xxl`	T5-XXL model (SD3/FLUX)
`--llm`	LLM text encoder (FLUX2)
`--llm-vision`	LLM vision encoder
`--control-net`	ControlNet model
`--lora-dir`	LoRA models directory
`--embd-dir`	Embeddings directory

Generation Options:

Option	Description
`--prompt`, `-p`	Text prompt
`--negative`, `-n`	Negative prompt
`--output`, `-o`	Output file path
`--width`, `-W`	Image width
`--height`, `-H`	Image height
`--steps`	Sampling steps
`--cfg-scale`	CFG guidance scale
`--seed`, `-s`	Random seed (-1 = random)
`--batch`, `-b`	Batch count
`--clip-skip`	CLIP layers to skip

Sampler Options:

Option	Description
`--sampler`	Sampling method
`--scheduler`	Noise scheduler
`--eta`	Eta for DDIM/TCD
`--rng`	RNG type (std_default, cuda, cpu)
`--prediction`	Prediction type override

Guidance Options:

Option	Description
`--slg-scale`	Skip layer guidance scale
`--skip-layer-start`	SLG start point
`--skip-layer-end`	SLG end point
`--guidance`	Distilled guidance (FLUX)
`--img-cfg-scale`	Image CFG scale

Memory Options:

Option	Description
`--threads`, `-t`	Thread count
`--offload-to-cpu`	Offload weights to CPU
`--clip-on-cpu`	Keep CLIP on CPU
`--vae-on-cpu`	Keep VAE on CPU
`--control-net-cpu`	Keep ControlNet on CPU
`--diffusion-fa`	Flash attention
`--vae-tiling`	Enable VAE tiling

Other Options:

Option	Description
`--verbose`, `-v`	Verbose output
`--progress`	Show progress
`--preview`	Preview mode (none, proj, tae, vae)

Supported Models¶

Model Family	Examples	Notes
SD 1.x/2.x	sd-v1-5, sd-v2-1	Standard models
SDXL	sdxl-base, sdxl-turbo	Use cfg_scale=1.0, steps=1-4 for turbo
SD3/SD3.5	sd3-medium, sd3.5-large	May need T5-XXL encoder
FLUX	flux.1-dev, flux.1-schnell	Needs clip_l + t5xxl or llm
FLUX2	flux2-*	Uses LLM encoder (Qwen)
Wan/CogVideoX	wan-2.1, cogvideox	Video generation
LoRA	*.safetensors	Place in lora_model_dir
ControlNet	control_*	Use with control images
ESRGAN	esrgan-x4	Upscaling only

Utility Functions¶

from cyllama.sd import (
    get_num_cores,
    get_system_info,
    type_name,
    sample_method_name,
    scheduler_name
)

print(f"CPU cores: {get_num_cores()}")
print(get_system_info())
print(type_name(SDType.Q4_0))           # "q4_0"
print(sample_method_name(SampleMethod.EULER))  # "euler"
print(scheduler_name(Scheduler.KARRAS))  # "karras"

Performance Tips¶

Use turbo models for fast generation (1-4 steps, cfg_scale=1.0)
Quantize models to Q4_0 or Q8_0 for memory efficiency
Reuse SDContext when generating multiple images
Set n_threads to match physical CPU cores
Use --offload-to-cpu for low VRAM GPUs
Enable --diffusion-fa (flash attention) for faster inference
Use --vae-tiling for generating large images
Use progress callback to track long generations

Troubleshooting¶

Model Loading Errors¶

import os
if not os.path.exists(model_path):
    raise FileNotFoundError(f"Model not found: {model_path}")

Out of Memory¶

Use smaller model (SD 1.5 vs SDXL)
Use quantized model (Q4_0 vs F16)
Reduce image dimensions
Reduce batch_count
Enable --offload-to-cpu
Enable --vae-tiling for large images

Slow Generation¶

Use turbo/LCM models with fewer steps
Enable flash attention (--diffusion-fa)
Increase n_threads
Use direct convolution (--diffusion-conv-direct)

FLUX/SD3 Models Not Working¶

Ensure you have the required encoders (clip_l, t5xxl)
For FLUX2, use --llm instead of --t5xxl
Check prediction type matches model