Whisper.cpp Integration¶

Cyllama wraps whisper.cpp to provide automatic speech recognition (ASR) capabilities in Python.

Overview¶

The whisper module provides Python bindings to whisper.cpp, enabling:

Speech-to-text transcription
Multi-language support (100+ languages)
Translation to English
Word-level timestamps
Voice activity detection (VAD)
GPU acceleration (Metal, CUDA)

Quick Start¶

Basic Transcription¶

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model
ctx = WhisperContext("models/ggml-base.en.bin")

# Load audio as float32 samples at 16kHz
# (Use your preferred audio library: scipy, soundfile, librosa, etc.)
samples = load_audio_as_float32("audio.wav")  # Shape: (n_samples,)

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
n_segments = ctx.full_n_segments()
for i in range(n_segments):
    text = ctx.full_get_segment_text(i)
    t0 = ctx.full_get_segment_t0(i)  # Start time in centiseconds
    t1 = ctx.full_get_segment_t1(i)  # End time in centiseconds
    print(f"[{t0/100:.2f}s - {t1/100:.2f}s] {text}")

With Language Detection¶

from cyllama.whisper import WhisperContext, WhisperFullParams

ctx = WhisperContext("models/ggml-base.bin")  # Multilingual model

params = WhisperFullParams()
params.language = None  # Auto-detect language

ctx.full(samples, params)

# Get detected language
lang_id = ctx.full_lang_id()
lang_name = ctx.lang_str_full(lang_id)
print(f"Detected language: {lang_name}")

Translation to English¶

params = WhisperFullParams()
params.translate = True  # Translate to English
params.language = "de"   # Source language (German)

ctx.full(samples, params)

API Reference¶

Constants¶

from cyllama.whisper import WHISPER

WHISPER.SAMPLE_RATE   # 16000 - Required sample rate
WHISPER.N_FFT         # FFT size
WHISPER.HOP_LENGTH    # Hop length for STFT
WHISPER.CHUNK_SIZE    # Chunk size for processing

WhisperContext¶

The main context class for model loading and inference.

from cyllama.whisper import WhisperContext, WhisperContextParams

# Basic loading
ctx = WhisperContext("models/ggml-base.bin")

# With parameters
params = WhisperContextParams()
params.use_gpu = True
params.flash_attn = True
params.gpu_device = 0

ctx = WhisperContext("models/ggml-base.bin", params)

Methods:

Method	Description
`full(samples, params)`	Run full transcription pipeline
`full_n_segments()`	Get number of transcribed segments
`full_get_segment_text(i)`	Get text of segment i
`full_get_segment_t0(i)`	Get start time of segment i (centiseconds)
`full_get_segment_t1(i)`	Get end time of segment i (centiseconds)
`full_n_tokens(i)`	Get number of tokens in segment i
`full_get_token_text(i, j)`	Get text of token j in segment i
`full_get_token_id(i, j)`	Get ID of token j in segment i
`full_get_token_p(i, j)`	Get probability of token j in segment i
`full_lang_id()`	Get detected language ID

Model Information:

Method	Description
`is_multilingual()`	Check if model supports multiple languages
`n_vocab()`	Get vocabulary size
`n_text_ctx()`	Get text context size
`n_audio_ctx()`	Get audio context size
`model_type_readable()`	Get model type as string ("base", "small", etc.)

Tokenization:

Method	Description
`tokenize(text)`	Convert text to token IDs
`token_to_str(id)`	Convert token ID to text
`token_count(text)`	Count tokens in text

WhisperContextParams¶

Configuration for context creation.

from cyllama.whisper import WhisperContextParams

params = WhisperContextParams()
params.use_gpu = True           # Use GPU acceleration
params.flash_attn = True        # Use flash attention
params.gpu_device = 0           # GPU device index
params.dtw_token_timestamps = False  # Enable DTW for precise timestamps

WhisperFullParams¶

Configuration for transcription.

from cyllama.whisper import WhisperFullParams, WhisperSamplingStrategy

params = WhisperFullParams()

# Sampling strategy
params.strategy = WhisperSamplingStrategy.GREEDY  # or BEAM_SEARCH

# Threading
params.n_threads = 4

# Language
params.language = "en"    # Set language (None for auto-detect)
params.translate = False  # Translate to English

# Timing
params.offset_ms = 0      # Start offset in milliseconds
params.duration_ms = 0    # Duration (0 = full audio)

# Output control
params.no_timestamps = False
params.single_segment = False
params.print_progress = False
params.print_realtime = False
params.print_timestamps = True

# Token timestamps
params.token_timestamps = False
params.temperature = 0.0

WhisperVadParams¶

Voice activity detection parameters.

from cyllama.whisper import WhisperVadParams

vad = WhisperVadParams()
vad.threshold = 0.6              # VAD threshold (0-1)
vad.min_speech_duration_ms = 250 # Minimum speech duration
vad.min_silence_duration_ms = 100  # Minimum silence duration
vad.max_speech_duration_s = 30.0 # Maximum speech segment
vad.speech_pad_ms = 30           # Padding around speech
vad.samples_overlap = 0.0        # Sample overlap

Sampling Strategies¶

from cyllama.whisper import WhisperSamplingStrategy

WhisperSamplingStrategy.GREEDY      # Fast, deterministic
WhisperSamplingStrategy.BEAM_SEARCH # Better quality, slower

Language Functions¶

from cyllama.whisper import lang_id, lang_str, lang_str_full, lang_max_id

# Get language ID from code
id = lang_id("en")  # Returns 0

# Get language code from ID
code = lang_str(0)  # Returns "en"

# Get full language name
name = lang_str_full(0)  # Returns "english"

# Get maximum language ID
max_id = lang_max_id()  # Returns ~100

Module Functions¶

from cyllama.whisper import version, print_system_info

# Get whisper.cpp version
ver = version()

# Get system info (CPU features, etc.)
info = print_system_info()

Audio Preparation¶

Whisper requires:

Sample rate: 16000 Hz (mono)
Format: Float32 normalized to [-1.0, 1.0]

Using scipy¶

from scipy.io import wavfile
import numpy as np

def load_audio(path: str) -> np.ndarray:
    rate, data = wavfile.read(path)

    # Convert to mono if stereo
    if len(data.shape) > 1:
        data = data.mean(axis=1)

    # Convert to float32
    if data.dtype == np.int16:
        data = data.astype(np.float32) / 32768.0
    elif data.dtype == np.int32:
        data = data.astype(np.float32) / 2147483648.0

    # Resample to 16kHz if needed
    if rate != 16000:
        from scipy import signal
        num_samples = int(len(data) * 16000 / rate)
        data = signal.resample(data, num_samples)

    return data.astype(np.float32)

Using soundfile¶

import soundfile as sf
import numpy as np

def load_audio(path: str) -> np.ndarray:
    data, rate = sf.read(path, dtype='float32')

    # Convert to mono
    if len(data.shape) > 1:
        data = data.mean(axis=1)

    # Resample if needed
    if rate != 16000:
        import resampy
        data = resampy.resample(data, rate, 16000)

    return data.astype(np.float32)

Common Patterns¶

Transcription with Timestamps¶

def transcribe_with_timestamps(audio_path: str, model_path: str) -> list:
    ctx = WhisperContext(model_path)
    samples = load_audio(audio_path)

    params = WhisperFullParams()
    params.print_timestamps = True

    ctx.full(samples, params)

    results = []
    for i in range(ctx.full_n_segments()):
        results.append({
            "start": ctx.full_get_segment_t0(i) / 100.0,
            "end": ctx.full_get_segment_t1(i) / 100.0,
            "text": ctx.full_get_segment_text(i).strip()
        })

    return results

Word-Level Timestamps¶

def transcribe_with_word_timestamps(audio_path: str, model_path: str) -> list:
    params = WhisperContextParams()
    params.dtw_token_timestamps = True

    ctx = WhisperContext(model_path, params)
    samples = load_audio(audio_path)

    fparams = WhisperFullParams()
    fparams.token_timestamps = True

    ctx.full(samples, fparams)

    words = []
    for i in range(ctx.full_n_segments()):
        for j in range(ctx.full_n_tokens(i)):
            token_data = ctx.full_get_token_data(i, j)
            text = ctx.full_get_token_text(i, j)
            if text.strip():
                words.append({
                    "word": text,
                    "start": token_data.t0 / 100.0,
                    "end": token_data.t1 / 100.0,
                    "probability": token_data.p
                })

    return words

Batch Processing¶

def transcribe_batch(audio_paths: list, model_path: str) -> dict:
    ctx = WhisperContext(model_path)
    params = WhisperFullParams()

    results = {}
    for path in audio_paths:
        samples = load_audio(path)
        ctx.full(samples, params)

        text = ""
        for i in range(ctx.full_n_segments()):
            text += ctx.full_get_segment_text(i)

        results[path] = text.strip()

    return results

Streaming Transcription¶

For real-time or streaming audio, process in chunks:

def transcribe_stream(audio_stream, model_path: str, chunk_seconds: float = 30.0):
    ctx = WhisperContext(model_path)
    params = WhisperFullParams()
    params.single_segment = True

    chunk_samples = int(16000 * chunk_seconds)
    buffer = np.array([], dtype=np.float32)

    for chunk in audio_stream:
        buffer = np.concatenate([buffer, chunk])

        if len(buffer) >= chunk_samples:
            ctx.full(buffer[:chunk_samples], params)

            for i in range(ctx.full_n_segments()):
                yield ctx.full_get_segment_text(i)

            # Keep overlap for continuity
            buffer = buffer[chunk_samples - 1600:]  # 100ms overlap

Model Selection¶

Model	Size	Memory	Speed	Quality
tiny	75 MB	~400 MB	Fastest	Basic
base	142 MB	~500 MB	Fast	Good
small	466 MB	~1 GB	Medium	Better
medium	1.5 GB	~2.5 GB	Slow	Great
large-v3	3 GB	~5 GB	Slowest	Best
large-v3-turbo	1.6 GB	~3 GB	Medium	Great

English-only models (.en suffix) are faster and more accurate for English:

ggml-tiny.en.bin
ggml-base.en.bin
ggml-small.en.bin
ggml-medium.en.bin

Download models from Hugging Face.

Performance Tips¶

Use GPU: Enable use_gpu=True in context params
Use Flash Attention: Enable flash_attn=True for faster inference
Match model to task: Use .en models for English-only content
Batch by length: Group similar-length audio for consistent memory usage
Thread count: Set n_threads to match physical CPU cores

Troubleshooting¶

Model Loading Errors¶

# Check model path exists
import os
if not os.path.exists(model_path):
    raise FileNotFoundError(f"Model not found: {model_path}")

# Check model format
if not model_path.endswith('.bin'):
    print("Warning: Whisper models should be .bin format (ggml)")

Audio Issues¶

# Verify audio format
print(f"Sample rate: {rate}")
print(f"Dtype: {samples.dtype}")
print(f"Shape: {samples.shape}")
print(f"Range: [{samples.min():.2f}, {samples.max():.2f}]")

# Should be: 16000, float32, (N,), [-1.0, 1.0]

Memory Issues¶

# Use smaller model
ctx = WhisperContext("models/ggml-tiny.bin")

# Disable GPU if VRAM limited
params = WhisperContextParams()
params.use_gpu = False