Summary¶

Key takeaways from each section of the cyllama documentation.

Llama.cpp Integration¶

The llama.cpp integration provides multiple layers of API access:

High-Level API: complete(), chat(), and LLM class for quick prototyping
Streaming: Token-by-token output with callbacks for real-time applications
Batch Processing: 3-10x throughput improvements for bulk workloads
Server Implementations: OpenAI-compatible REST APIs via EmbeddedServer or PythonServer
Advanced Features: Speculative decoding (2-3x speedup), context caching, GGUF manipulation

Key configuration options include temperature, top_p, top_k for sampling control, and n_gpu_layers, n_ctx, n_batch for performance tuning.

Whisper.cpp Integration¶

Speech recognition capabilities include:

Transcription: Convert audio to text with segment-level timestamps
Translation: Automatic translation to English from 100+ languages
Word Timestamps: Token-level timing via DTW alignment
Voice Activity Detection: Filter silence and detect speech segments

Audio must be 16kHz mono float32 format. Model selection trades off speed vs. accuracy (tiny to large-v3).

Stable Diffusion Integration¶

Image generation supports multiple model architectures:

Text-to-Image: Generate images from text prompts
Image-to-Image: Transform existing images with text guidance
Video Generation: Frame sequences with Wan/CogVideoX models
Upscaling: ESRGAN-based resolution enhancement

Key parameters include sample_steps, cfg_scale, sample_method, and scheduler.

Agent Framework¶

Three agent architectures for different reliability requirements:

Agent	Use Case	Key Feature
ReActAgent	General-purpose	Natural reasoning trace
ConstrainedAgent	Smaller models	Grammar-enforced JSON output
ContractAgent	Critical applications	Runtime pre/post conditions

Tools are defined with the @tool decorator. Contracts use @pre, @post, and contract_assert().

Best Practices¶

Reuse model instances - Load once, generate many times
Match GPU layers to VRAM - Use estimate_gpu_layers() for optimal settings
Stream long outputs - Better user experience for verbose responses
Use batch processing - Significant throughput gains for multiple prompts
Choose the right agent - ConstrainedAgent for reliability, ReActAgent for flexibility

Next Steps¶

Explore the API Reference for detailed function signatures
Try the Cookbook for practical patterns and recipes
Check the GitHub repository for examples and updates