Whisper.cpp Integration¶
Cyllama wraps whisper.cpp to provide automatic speech recognition (ASR) capabilities in Python.
Overview¶
The whisper module provides Python bindings to whisper.cpp, enabling:
- Speech-to-text transcription
- Multi-language support (100+ languages)
- Translation to English
- Word-level timestamps
- Voice activity detection (VAD)
- GPU acceleration (Metal, CUDA)
Quick Start¶
Basic Transcription¶
from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np
# Load model
ctx = WhisperContext("models/ggml-base.en.bin")
# Load audio as float32 samples at 16kHz
# (Use your preferred audio library: scipy, soundfile, librosa, etc.)
samples = load_audio_as_float32("audio.wav") # Shape: (n_samples,)
# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)
# Get results
n_segments = ctx.full_n_segments()
for i in range(n_segments):
text = ctx.full_get_segment_text(i)
t0 = ctx.full_get_segment_t0(i) # Start time in centiseconds
t1 = ctx.full_get_segment_t1(i) # End time in centiseconds
print(f"[{t0/100:.2f}s - {t1/100:.2f}s] {text}")
With Language Detection¶
from cyllama.whisper import WhisperContext, WhisperFullParams
ctx = WhisperContext("models/ggml-base.bin") # Multilingual model
params = WhisperFullParams()
params.language = None # Auto-detect language
ctx.full(samples, params)
# Get detected language
lang_id = ctx.full_lang_id()
lang_name = ctx.lang_str_full(lang_id)
print(f"Detected language: {lang_name}")
Translation to English¶
params = WhisperFullParams()
params.translate = True # Translate to English
params.language = "de" # Source language (German)
ctx.full(samples, params)
API Reference¶
Constants¶
from cyllama.whisper import WHISPER
WHISPER.SAMPLE_RATE # 16000 - Required sample rate
WHISPER.N_FFT # FFT size
WHISPER.HOP_LENGTH # Hop length for STFT
WHISPER.CHUNK_SIZE # Chunk size for processing
WhisperContext¶
The main context class for model loading and inference.
from cyllama.whisper import WhisperContext, WhisperContextParams
# Basic loading
ctx = WhisperContext("models/ggml-base.bin")
# With parameters
params = WhisperContextParams()
params.use_gpu = True
params.flash_attn = True
params.gpu_device = 0
ctx = WhisperContext("models/ggml-base.bin", params)
Methods:
| Method | Description |
|---|---|
full(samples, params) |
Run full transcription pipeline |
full_n_segments() |
Get number of transcribed segments |
full_get_segment_text(i) |
Get text of segment i |
full_get_segment_t0(i) |
Get start time of segment i (centiseconds) |
full_get_segment_t1(i) |
Get end time of segment i (centiseconds) |
full_n_tokens(i) |
Get number of tokens in segment i |
full_get_token_text(i, j) |
Get text of token j in segment i |
full_get_token_id(i, j) |
Get ID of token j in segment i |
full_get_token_p(i, j) |
Get probability of token j in segment i |
full_lang_id() |
Get detected language ID |
Model Information:
| Method | Description |
|---|---|
is_multilingual() |
Check if model supports multiple languages |
n_vocab() |
Get vocabulary size |
n_text_ctx() |
Get text context size |
n_audio_ctx() |
Get audio context size |
model_type_readable() |
Get model type as string ("base", "small", etc.) |
Tokenization:
| Method | Description |
|---|---|
tokenize(text) |
Convert text to token IDs |
token_to_str(id) |
Convert token ID to text |
token_count(text) |
Count tokens in text |
WhisperContextParams¶
Configuration for context creation.
from cyllama.whisper import WhisperContextParams
params = WhisperContextParams()
params.use_gpu = True # Use GPU acceleration
params.flash_attn = True # Use flash attention
params.gpu_device = 0 # GPU device index
params.dtw_token_timestamps = False # Enable DTW for precise timestamps
WhisperFullParams¶
Configuration for transcription.
from cyllama.whisper import WhisperFullParams, WhisperSamplingStrategy
params = WhisperFullParams()
# Sampling strategy
params.strategy = WhisperSamplingStrategy.GREEDY # or BEAM_SEARCH
# Threading
params.n_threads = 4
# Language
params.language = "en" # Set language (None for auto-detect)
params.translate = False # Translate to English
# Timing
params.offset_ms = 0 # Start offset in milliseconds
params.duration_ms = 0 # Duration (0 = full audio)
# Output control
params.no_timestamps = False
params.single_segment = False
params.print_progress = False
params.print_realtime = False
params.print_timestamps = True
# Token timestamps
params.token_timestamps = False
params.temperature = 0.0
WhisperVadParams¶
Voice activity detection parameters.
from cyllama.whisper import WhisperVadParams
vad = WhisperVadParams()
vad.threshold = 0.6 # VAD threshold (0-1)
vad.min_speech_duration_ms = 250 # Minimum speech duration
vad.min_silence_duration_ms = 100 # Minimum silence duration
vad.max_speech_duration_s = 30.0 # Maximum speech segment
vad.speech_pad_ms = 30 # Padding around speech
vad.samples_overlap = 0.0 # Sample overlap
Sampling Strategies¶
from cyllama.whisper import WhisperSamplingStrategy
WhisperSamplingStrategy.GREEDY # Fast, deterministic
WhisperSamplingStrategy.BEAM_SEARCH # Better quality, slower
Language Functions¶
from cyllama.whisper import lang_id, lang_str, lang_str_full, lang_max_id
# Get language ID from code
id = lang_id("en") # Returns 0
# Get language code from ID
code = lang_str(0) # Returns "en"
# Get full language name
name = lang_str_full(0) # Returns "english"
# Get maximum language ID
max_id = lang_max_id() # Returns ~100
Module Functions¶
from cyllama.whisper import version, print_system_info
# Get whisper.cpp version
ver = version()
# Get system info (CPU features, etc.)
info = print_system_info()
Audio Preparation¶
Whisper requires:
- Sample rate: 16000 Hz (mono)
- Format: Float32 normalized to [-1.0, 1.0]
Using scipy¶
from scipy.io import wavfile
import numpy as np
def load_audio(path: str) -> np.ndarray:
rate, data = wavfile.read(path)
# Convert to mono if stereo
if len(data.shape) > 1:
data = data.mean(axis=1)
# Convert to float32
if data.dtype == np.int16:
data = data.astype(np.float32) / 32768.0
elif data.dtype == np.int32:
data = data.astype(np.float32) / 2147483648.0
# Resample to 16kHz if needed
if rate != 16000:
from scipy import signal
num_samples = int(len(data) * 16000 / rate)
data = signal.resample(data, num_samples)
return data.astype(np.float32)
Using soundfile¶
import soundfile as sf
import numpy as np
def load_audio(path: str) -> np.ndarray:
data, rate = sf.read(path, dtype='float32')
# Convert to mono
if len(data.shape) > 1:
data = data.mean(axis=1)
# Resample if needed
if rate != 16000:
import resampy
data = resampy.resample(data, rate, 16000)
return data.astype(np.float32)
Common Patterns¶
Transcription with Timestamps¶
def transcribe_with_timestamps(audio_path: str, model_path: str) -> list:
ctx = WhisperContext(model_path)
samples = load_audio(audio_path)
params = WhisperFullParams()
params.print_timestamps = True
ctx.full(samples, params)
results = []
for i in range(ctx.full_n_segments()):
results.append({
"start": ctx.full_get_segment_t0(i) / 100.0,
"end": ctx.full_get_segment_t1(i) / 100.0,
"text": ctx.full_get_segment_text(i).strip()
})
return results
Word-Level Timestamps¶
def transcribe_with_word_timestamps(audio_path: str, model_path: str) -> list:
params = WhisperContextParams()
params.dtw_token_timestamps = True
ctx = WhisperContext(model_path, params)
samples = load_audio(audio_path)
fparams = WhisperFullParams()
fparams.token_timestamps = True
ctx.full(samples, fparams)
words = []
for i in range(ctx.full_n_segments()):
for j in range(ctx.full_n_tokens(i)):
token_data = ctx.full_get_token_data(i, j)
text = ctx.full_get_token_text(i, j)
if text.strip():
words.append({
"word": text,
"start": token_data.t0 / 100.0,
"end": token_data.t1 / 100.0,
"probability": token_data.p
})
return words
Batch Processing¶
def transcribe_batch(audio_paths: list, model_path: str) -> dict:
ctx = WhisperContext(model_path)
params = WhisperFullParams()
results = {}
for path in audio_paths:
samples = load_audio(path)
ctx.full(samples, params)
text = ""
for i in range(ctx.full_n_segments()):
text += ctx.full_get_segment_text(i)
results[path] = text.strip()
return results
Streaming Transcription¶
For real-time or streaming audio, process in chunks:
def transcribe_stream(audio_stream, model_path: str, chunk_seconds: float = 30.0):
ctx = WhisperContext(model_path)
params = WhisperFullParams()
params.single_segment = True
chunk_samples = int(16000 * chunk_seconds)
buffer = np.array([], dtype=np.float32)
for chunk in audio_stream:
buffer = np.concatenate([buffer, chunk])
if len(buffer) >= chunk_samples:
ctx.full(buffer[:chunk_samples], params)
for i in range(ctx.full_n_segments()):
yield ctx.full_get_segment_text(i)
# Keep overlap for continuity
buffer = buffer[chunk_samples - 1600:] # 100ms overlap
Model Selection¶
| Model | Size | Memory | Speed | Quality |
|---|---|---|---|---|
| tiny | 75 MB | ~400 MB | Fastest | Basic |
| base | 142 MB | ~500 MB | Fast | Good |
| small | 466 MB | ~1 GB | Medium | Better |
| medium | 1.5 GB | ~2.5 GB | Slow | Great |
| large-v3 | 3 GB | ~5 GB | Slowest | Best |
| large-v3-turbo | 1.6 GB | ~3 GB | Medium | Great |
English-only models (.en suffix) are faster and more accurate for English:
ggml-tiny.en.binggml-base.en.binggml-small.en.binggml-medium.en.bin
Download models from Hugging Face.
Performance Tips¶
- Use GPU: Enable
use_gpu=Truein context params - Use Flash Attention: Enable
flash_attn=Truefor faster inference - Match model to task: Use
.enmodels for English-only content - Batch by length: Group similar-length audio for consistent memory usage
- Thread count: Set
n_threadsto match physical CPU cores
Troubleshooting¶
Model Loading Errors¶
# Check model path exists
import os
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model not found: {model_path}")
# Check model format
if not model_path.endswith('.bin'):
print("Warning: Whisper models should be .bin format (ggml)")
Audio Issues¶
# Verify audio format
print(f"Sample rate: {rate}")
print(f"Dtype: {samples.dtype}")
print(f"Shape: {samples.shape}")
print(f"Range: [{samples.min():.2f}, {samples.max():.2f}]")
# Should be: 16000, float32, (N,), [-1.0, 1.0]
Memory Issues¶
# Use smaller model
ctx = WhisperContext("models/ggml-tiny.bin")
# Disable GPU if VRAM limited
params = WhisperContextParams()
params.use_gpu = False