inferna overview¶

inferna is a zero-dependency Python library for local LLM inference which uses nanobind to wrap the following high-performance inference engines:

llama.cpp: text-to-text, text-to-speech and multimodel
whisper.cpp: automatic speech recognition
stable-diffusion.cpp: text-to-image and text-to-video

Core Features¶

High-level API - complete(), chat(), LLM class for quick prototyping
Low-level API - Direct access to llama.cpp, whisper.cpp, and stable-diffusion.cpp internals
Streaming - Token-by-token output with callbacks
Batch processing - Process multiple prompts 3-10x faster
GPU acceleration - Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform) backends
Memory tools - Estimate GPU layers and VRAM usage
OpenAI-compatible servers - EmbeddedServer (C/Mongoose) and PythonServer implementations

Agent Framework¶

ReActAgent - Reasoning + Acting with tool calling
ConstrainedAgent - Grammar-enforced tool calls (100% valid output)
ContractAgent - Pre/post conditions on tools (C++26-inspired contracts)

Additional Capabilities¶

Speculative decoding - 2-3x speedup with draft models
GGUF utilities - Read/write model metadata
JSON schema grammars - Structured output generation

Integrations¶

OpenAI-compatible API - Drop-in client replacement
LangChain - Full LLM interface implementation
ACP/MCP support - Agent and Model Context Protocols

Architecture¶

Inferna is structured as a layered stack. At the bottom, three C/C++ inference engines handle the heavy computation. nanobind bindings (_*_native.cpp files) expose these engines to Python with minimal overhead, fronted by thin Python facade modules (llama_cpp.py, whisper_cpp.py, stable_diffusion.py) that callers import. On top of the bindings, a high-level API provides simple functions like complete() and chat(), while framework modules (agents, RAG, servers, integrations) compose these primitives into higher-level capabilities.

Architecture Diagram

Layer Breakdown¶

Layer	Components	Role
High-Level API	`api.py`, `batching.py`, `memory.py`	Simple Python interface for generation, batch processing, and memory estimation
Frameworks	`agents/`, `rag/`, `integrations/`, `llama/server/`	ReAct/Constrained/Contract agents, RAG pipeline, OpenAI/LangChain compatibility, HTTP servers
Native Bindings	`_llama_native.cpp` (+ `_mtmd`/`_tts`/`_enums` companion TUs), `_whisper_native.cpp`, `_sd_native.cpp`	Direct nanobind C++ bindings reading upstream headers directly; includes multimodal, TTS, and enum constants
Python Facades	`llama_cpp.py`, `whisper_cpp.py`, `stable_diffusion.py`, `embedded.py`	Re-export native classes + integer enum constants + pure-Python helpers (downloads, n-gram cache, speculative decoder) under stable import paths
C/C++ Engines	llama.cpp, whisper.cpp, stable-diffusion.cpp	Core inference: text generation, speech recognition, image generation
Hardware Backends	Metal, CUDA, Vulkan, CPU	GPU/CPU acceleration selected at build time

Data Flow¶

User calls a high-level function (e.g., complete("prompt", model_path="model.gguf"))
The API layer loads the model via the nanobind bindings, which allocate C++ context objects
Tokens are sampled in C++ and streamed back through nanobind to Python callbacks
Framework modules (agents, RAG) orchestrate multiple calls to the API layer, adding tool use, retrieval, or structured output on top

Key Design Decisions¶

nanobind over Cython/ctypes/pybind11: nanobind reads the upstream C/C++ headers directly (no parallel .pxd declarations to keep in sync with llama.cpp/whisper.cpp/sd.cpp bumps), gives idiomatic C++ ergonomics (std::vector, std::optional, lambdas, RAII), and produces ~6% smaller wheels than the prior Cython build. Each upstream is bound via a primary _<name>_native.cpp plus optional companion TUs for sub-areas (multimodal, TTS, enum constants), all linked into one extension module per upstream. See NANOBIND.md for the migration post-mortem.
Zero Python dependencies: The core library has no runtime dependencies beyond Python itself. Optional integrations (LangChain, OpenAI compat) import lazily.
Dual server strategy: EmbeddedServer runs an in-process Mongoose HTTP server (vendored in src/inferna/llama/server/mongoose.{c,h} and exposed through nanobind) and serves the upstream llama-server chat web UI alongside the OpenAI-compatible JSON API; PythonServer offers a pure-Python http.server alternative without the web UI for debugging or wheel-less environments. Both share the same ServerConfig and JSON endpoint surface.

Quick Example¶

from inferna import complete

response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7
)
print(response)

Requirements¶

Python 3.10+
macOS, Linux, or Windows
GGUF model files (download from HuggingFace)

repo: https://github.com/shakfu/inferna