Inferna API Reference¶

Version: 0.2.11 Date: April 2026

Complete API reference for inferna, a high-performance Python library for LLM inference built on llama.cpp.

Table of Contents¶

High-Level Generation API
Async API
Framework Integrations
Memory Utilities
Core llama.cpp API
Advanced Features
Server Implementations
Multimodal Support
Whisper Integration
Stable Diffusion Integration

High-Level Generation API¶

The high-level API provides simple, Pythonic functions and classes for text generation.

`complete()`¶

One-shot text generation function.

def complete(
    prompt: str,
    model_path: str,
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    **kwargs
) -> Response | Iterator[str]

Parameters:

prompt (str): Input text prompt
model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Generation configuration object
stream (bool): If True, return iterator of text chunks
**kwargs: Override config parameters (temperature, max_tokens, etc.)

Returns:

Response: Response object with text and stats (if stream=False)
Iterator[str]: Iterator of text chunks (if stream=True)

Example:

from inferna import complete

response = complete(
    "What is Python?",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)

# Streaming
for chunk in complete("Tell me a story", model_path="models/llama.gguf", stream=True):
    print(chunk, end="", flush=True)

`chat()`¶

Chat-style generation with message history. Automatically applies the model's built-in chat template.

def chat(
    messages: List[Dict[str, str]],
    model_path: str,
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    template: Optional[str] = None,
    **kwargs
) -> str | Iterator[str]

Parameters:

messages (List[Dict]): List of message dicts with 'role' and 'content' keys
model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Generation configuration
stream (bool): Enable streaming output
template (str, optional): Chat template name to use. If None, uses model's default.
**kwargs: Override config parameters

Returns:

Response: Response object with text and stats (if stream=False)
Iterator[str]: Iterator of text chunks (if stream=True)

Example:

from inferna import chat

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]

response = chat(messages, model_path="models/llama.gguf")

# With explicit template
response = chat(messages, model_path="models/llama.gguf", template="chatml")

`apply_chat_template()`¶

Apply a chat template to format messages into a prompt string.

def apply_chat_template(
    messages: List[Dict[str, str]],
    model_path: str,
    template: Optional[str] = None,
    add_generation_prompt: bool = True,
    verbose: bool = False,
) -> str

Parameters:

messages (List[Dict]): List of message dicts with 'role' and 'content' keys
model_path (str): Path to GGUF model file
template (str, optional): Template name or string. If None, uses model's default.
add_generation_prompt (bool): Add assistant prompt prefix (default: True)
verbose (bool): Enable detailed logging

Returns:

str: Formatted prompt string

Supported Templates:

llama2, llama3, llama4
chatml (Qwen, Yi, etc.)
mistral-v1, mistral-v3, mistral-v7
phi3, phi4
deepseek, deepseek2, deepseek3
gemma, falcon3, command-r, vicuna, zephyr, and more

Example:

from inferna.api import apply_chat_template

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
]

prompt = apply_chat_template(messages, "models/llama.gguf")
print(prompt)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>
# Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

`get_chat_template()`¶

Get the chat template string from a model.

def get_chat_template(
    model_path: str,
    template_name: Optional[str] = None
) -> str

Parameters:

model_path (str): Path to GGUF model file
template_name (str, optional): Specific template name to retrieve

Returns:

str: Template string (Jinja-style), or empty string if not found

Example:

from inferna.api import get_chat_template

template = get_chat_template("models/llama.gguf")
print(template)  # Shows the Jinja-style template

`Response` Class¶

Structured response object returned by generation functions.

@dataclass
class Response:
    text: str                           # Generated text content
    stats: Optional[GenerationStats]    # Generation statistics
    finish_reason: str = "stop"         # Why generation stopped
    model: str = ""                     # Model path used

Attributes:

text (str): The generated text content
stats (GenerationStats, optional): Statistics including timing and token counts
finish_reason (str): Reason for completion ("stop", "length", etc.)
model (str): Path to the model used

String Compatibility:

Response implements the string protocol for backward compatibility:

str(response) returns response.text
response == "string" compares with text
len(response) returns text length
for char in response: iterates over text characters
"substring" in response checks text containment
response + " more" concatenates text

Methods:

`to_dict()`¶

Convert response to dictionary.

def to_dict(self) -> Dict[str, Any]

`to_json()`¶

Convert response to JSON string.

def to_json(self, indent: Optional[int] = None) -> str

Example:

from inferna import complete

response = complete("What is Python?", model_path="model.gguf")

# Use as string (backward compatible)
print(response)  # Prints text
if "programming" in response:
    print("Mentioned programming!")

# Access structured data
print(f"Finish reason: {response.finish_reason}")
if response.stats:
    print(f"Tokens/sec: {response.stats.tokens_per_second:.1f}")

# Serialize
data = response.to_dict()
json_str = response.to_json(indent=2)

`GenerationStats` Class¶

Statistics from a generation run.

@dataclass
class GenerationStats:
    prompt_tokens: int       # Number of tokens in prompt
    generated_tokens: int    # Number of tokens generated
    total_time: float        # Total generation time (seconds)
    tokens_per_second: float # Generation speed
    prompt_time: float       # Time for prompt processing
    generation_time: float   # Time for token generation

`LLM` Class¶

Reusable generator with model caching for improved performance.

class LLM:
    def __init__(
        self,
        model_path: str,
        config: Optional[GenerationConfig] = None,
        verbose: bool = False
    )

Parameters:

model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Default generation configuration
verbose (bool): Print detailed information during generation

Methods:

`call()`¶

Generate text from a prompt.

def __call__(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    on_token: Optional[Callable[[str], None]] = None
) -> Response | Iterator[str]

Parameters:

prompt (str): Input text
config (GenerationConfig, optional): Override instance config
stream (bool): Enable streaming
on_token (Callable, optional): Callback for each token

Returns:

Response: Response object with text and stats (if stream=False)
Iterator[str]: Iterator of text chunks (if stream=True)

`chat()`¶

Generate a response from chat messages using the model's chat template.

def chat(
    self,
    messages: List[Dict[str, str]],
    config: Optional[GenerationConfig] = None,
    stream: bool = False,
    template: Optional[str] = None
) -> str | Iterator[str]

Parameters:

messages (List[Dict]): List of message dicts with 'role' and 'content' keys
config (GenerationConfig, optional): Override instance config
stream (bool): Enable streaming
template (str, optional): Chat template name to use

`get_chat_template()`¶

Get the chat template string from the loaded model.

def get_chat_template(
    self,
    template_name: Optional[str] = None
) -> str

Example:

from inferna import LLM, GenerationConfig

gen = LLM("models/llama.gguf")

# Simple generation
response = gen("What is Python?")

# With custom config
config = GenerationConfig(temperature=0.9, max_tokens=100)
response = gen("Tell me a joke", config=config)

# With statistics
response, stats = gen.generate_with_stats("Question?")
print(f"Generated {stats.generated_tokens} tokens in {stats.total_time:.2f}s")
print(f"Speed: {stats.tokens_per_second:.2f} tokens/sec")

# Chat with template
messages = [{"role": "user", "content": "Hello!"}]
response = gen.chat(messages)

# Get template
template = gen.get_chat_template()

MCP client methods¶

Since 0.2.11 LLM can attach to Model Context Protocol servers and drive a tool-calling loop against their tools:

def add_mcp_server(
    self,
    name: str,
    *,
    command: Optional[str] = None,
    args: Optional[list[str]] = None,
    env: Optional[dict[str, str]] = None,
    cwd: Optional[str] = None,
    url: Optional[str] = None,
    headers: Optional[dict[str, str]] = None,
    transport: Optional["McpTransportType"] = None,
    request_timeout: Optional[float] = None,
    shutdown_timeout: Optional[float] = None,
) -> None
def remove_mcp_server(self, name: str) -> None
def list_mcp_tools(self) -> list["McpTool"]
def list_mcp_resources(self) -> list["McpResource"]
def call_mcp_tool(self, name: str, arguments: dict) -> Any
def read_mcp_resource(self, uri: str) -> str
def chat_with_tools(
    self,
    messages: list[dict],
    *,
    tools: Optional[list["Tool"]] = None,
    use_mcp: bool = True,
    max_iterations: int = 8,
    verbose: bool = False,
    system_prompt: Optional[str] = None,
    generation_config: Optional[GenerationConfig] = None,
) -> str

See MCP Client for stdio/HTTP quick-start, per-method semantics, and examples of mixing local Tools with MCP tools.

`GenerationConfig` Dataclass¶

Configuration for text generation.

@dataclass
class GenerationConfig:
    max_tokens: int = 512
    temperature: float = 0.8
    top_k: int = 40
    top_p: float = 0.95
    min_p: float = 0.05
    repeat_penalty: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    penalty_last_n: int = 64
    mirostat: int = 0
    mirostat_tau: float = 5.0
    mirostat_eta: float = 0.1
    typical_p: float = 1.0
    typical_min_keep: int = 1
    xtc_probability: float = 0.0
    xtc_threshold: float = 0.1
    dynatemp_range: float = 0.0
    dynatemp_exponent: float = 1.0
    logit_bias: Optional[Dict[int, float]] = None
    n_gpu_layers: int = -1
    n_ctx: Optional[int] = None
    n_batch: int = 2048
    seed: int = 0xFFFFFFFF
    stop_sequences: List[str] = field(default_factory=list)
    add_bos: bool = True
    parse_special: bool = True

Attributes:

max_tokens: Maximum tokens to generate (default: 512)
temperature: Sampling temperature, 0.0 = greedy (default: 0.8)
top_k: Top-k sampling parameter (default: 40)
top_p: Top-p (nucleus) sampling (default: 0.95)
min_p: Minimum probability threshold (default: 0.05)
repeat_penalty: Penalty for repeating tokens (default: 1.0, disabled)
frequency_penalty: OpenAI-style frequency penalty applied to the most recent penalty_last_n tokens (default: 0.0, disabled)
presence_penalty: OpenAI-style presence penalty applied to the most recent penalty_last_n tokens (default: 0.0, disabled)
penalty_last_n: Number of recent tokens considered by the penalty samplers. 0 = disabled, -1 = full context window (default: 64)
mirostat: Mirostat sampling mode. 0 = off, 1 = v1, 2 = v2. When non-zero, the top-k/top-p/min-p/dist tail of the chain is replaced with temp -> mirostat[_v2] (default: 0)
mirostat_tau: Mirostat target entropy (default: 5.0)
mirostat_eta: Mirostat learning rate (default: 0.1)
typical_p: Locally-typical sampling threshold. 1.0 = disabled (default: 1.0)
typical_min_keep: Minimum tokens kept after typical truncation (default: 1)
xtc_probability: Probability of applying XTC ("Exclude Top Choices") truncation. 0.0 = disabled (default: 0.0)
xtc_threshold: Probability cutoff above which top tokens become candidates for XTC removal (default: 0.1)
dynatemp_range: Dynamic temperature range. 0.0 = use plain temperature; > 0 swaps add_temp for add_temp_ext (default: 0.0)
dynatemp_exponent: Dynamic temperature exponent (default: 1.0)
logit_bias: Optional {token_id: bias} map applied to the raw logits before any sampler stage. None = no bias. Matches the OpenAI logit_bias shape (default: None)
n_gpu_layers: GPU layers to offload (default: -1 = all)
n_ctx: Context window size, None = auto (default: None)
n_batch: Batch size for processing (default: 2048)
seed: Random seed (default: 0xFFFFFFFF sentinel = let llama.cpp pick a random seed)
stop_sequences: Strings that stop generation (default: [])
add_bos: Add beginning-of-sequence token (default: True)
parse_special: Parse special tokens in prompt (default: True)

`GenerationStats` Dataclass¶

Statistics from a generation run.

@dataclass
class GenerationStats:
    prompt_tokens: int
    generated_tokens: int
    total_time: float
    tokens_per_second: float
    prompt_time: float = 0.0
    generation_time: float = 0.0

Async API¶

The async API provides non-blocking generation for use in async applications (FastAPI, aiohttp, etc.).

`AsyncLLM` Class¶

Async wrapper around the LLM class for non-blocking text generation.

class AsyncLLM:
    def __init__(
        self,
        model_path: str,
        config: Optional[GenerationConfig] = None,
        verbose: bool = False,
        **kwargs
    )

Parameters:

model_path (str): Path to GGUF model file
config (GenerationConfig, optional): Generation configuration
verbose (bool): Print detailed information during generation
**kwargs: Generation parameters (temperature, max_tokens, etc.)

Methods:

`call()` / `generate()`¶

Generate text asynchronously.

async def __call__(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None,
    **kwargs
) -> str

`stream()`¶

Stream generated text chunks asynchronously.

async def stream(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None,
    **kwargs
) -> AsyncIterator[str]

`generate_with_stats()`¶

Generate text and return statistics.

async def generate_with_stats(
    self,
    prompt: str,
    config: Optional[GenerationConfig] = None
) -> Tuple[str, GenerationStats]

Example:

import asyncio
from inferna import AsyncLLM

async def main():
    # Context manager ensures cleanup
    async with AsyncLLM("model.gguf", temperature=0.7) as llm:
        # Simple generation
        response = await llm("What is Python?")
        print(response)

        # Streaming
        async for chunk in llm.stream("Tell me a story"):
            print(chunk, end="", flush=True)

        # With stats
        text, stats = await llm.generate_with_stats("Question?")
        print(f"Generated {stats.generated_tokens} tokens")

asyncio.run(main())

`complete_async()`¶

Async convenience function for one-off text completion.

async def complete_async(
    prompt: str,
    model_path: str,
    config: Optional[GenerationConfig] = None,
    verbose: bool = False,
    **kwargs
) -> str

Example:

response = await complete_async(
    "What is Python?",
    model_path="model.gguf",
    temperature=0.7
)

`chat_async()`¶

Async convenience function for chat-style generation.

async def chat_async(
    messages: List[Dict[str, str]],
    model_path: str,
    config: Optional[GenerationConfig] = None,
    verbose: bool = False,
    **kwargs
) -> str

Example:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = await chat_async(messages, model_path="model.gguf")

`stream_complete_async()`¶

Async streaming completion for one-off use.

async def stream_complete_async(
    prompt: str,
    model_path: str,
    config: Optional[GenerationConfig] = None,
    verbose: bool = False,
    **kwargs
) -> AsyncIterator[str]

Example:

async for chunk in stream_complete_async("Tell me a story", "model.gguf"):
    print(chunk, end="", flush=True)

Framework Integrations¶

OpenAI-Compatible API¶

Drop-in replacement for OpenAI Python client.

`OpenAICompatibleClient` Class¶

from inferna.integrations.openai_compat import OpenAICompatibleClient

class OpenAICompatibleClient:
    def __init__(
        self,
        model_path: str,
        temperature: float = 0.7,
        max_tokens: int = 512,
        n_gpu_layers: int = -1
    )

Attributes:

chat: Chat completions interface

Example:

from inferna.integrations.openai_compat import OpenAICompatibleClient

client = OpenAICompatibleClient(model_path="models/llama.gguf")

response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)

# Streaming
for chunk in client.chat.completions.create(
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

LangChain Integration¶

Full LangChain LLM interface implementation.

`InfernaLLM` Class¶

from inferna.integrations import InfernaLLM

class InfernaLLM(LLM):
    model_path: str
    temperature: float = 0.7
    max_tokens: int = 512
    top_k: int = 40
    top_p: float = 0.95
    repeat_penalty: float = 1.0
    n_gpu_layers: int = -1

Example:

from inferna.integrations import InfernaLLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = InfernaLLM(model_path="models/llama.gguf", temperature=0.7)

prompt = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms:"
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="quantum computing")

# With streaming
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = InfernaLLM(
    model_path="models/llama.gguf",
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

Memory Utilities¶

Tools for estimating and optimizing GPU memory usage.

`estimate_gpu_layers()`¶

Estimate optimal number of GPU layers for available VRAM.

def estimate_gpu_layers(
    model_path: str,
    available_vram_mb: int,
    n_ctx: int = 2048,
    n_batch: int = 512
) -> MemoryEstimate

Parameters:

model_path (str): Path to GGUF model file
available_vram_mb (int): Available VRAM in megabytes
n_ctx (int): Context window size
n_batch (int): Batch size

Returns:

MemoryEstimate: Object with recommended settings

Example:

from inferna import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="models/llama.gguf",
    available_vram_mb=8000,  # 8GB VRAM
    n_ctx=2048
)

print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
print(f"Estimated VRAM usage: {estimate.vram / 1024 / 1024:.2f} MB")

`estimate_memory_usage()`¶

Estimate total memory requirements for model loading.

def estimate_memory_usage(
    model_path: str,
    n_ctx: int = 2048,
    n_batch: int = 512,
    n_gpu_layers: int = 0
) -> MemoryEstimate

`MemoryEstimate` Dataclass¶

Memory estimation results.

@dataclass
class MemoryEstimate:
    layers: int                          # Total layers
    graph_size: int                      # Computation graph size
    vram: int                            # VRAM usage (bytes)
    vram_kv: int                         # KV cache VRAM (bytes)
    total_size: int                      # Total memory (bytes)
    tensor_split: Optional[List[int]]    # Multi-GPU split

Core llama.cpp API¶

Low-level nanobind wrappers for direct llama.cpp access.

Core Classes¶

`LlamaModel`¶

Represents a loaded GGUF model.

from inferna.llama.llama_cpp import LlamaModel, LlamaModelParams

params = LlamaModelParams()
params.n_gpu_layers = -1
params.use_mmap = True
params.use_mlock = False

model = LlamaModel("models/llama.gguf", params)

# Properties
print(model.n_params)      # Total parameters
print(model.n_layers)      # Number of layers
print(model.n_embd)        # Embedding dimension
print(model.n_vocab)       # Vocabulary size

# Methods
vocab = model.get_vocab()  # Get vocabulary
model.free()               # Free resources

`LlamaContext`¶

Inference context for model.

from inferna.llama.llama_cpp import LlamaContext, LlamaContextParams

ctx_params = LlamaContextParams()
ctx_params.n_ctx = 2048
ctx_params.n_batch = 512
ctx_params.n_threads = 4
ctx_params.n_threads_batch = 4

ctx = LlamaContext(model, ctx_params)

# Decode batch
from inferna.llama.llama_cpp import llama_batch_get_one
batch = llama_batch_get_one(tokens)
ctx.decode(batch)

# KV cache management
ctx.kv_cache_clear()
ctx.kv_cache_seq_rm(seq_id, p0, p1)
ctx.kv_cache_seq_add(seq_id, p0, p1, delta)

# Performance
ctx.print_perf_data()

`LlamaSampler`¶

Sampling strategies for token generation.

from inferna.llama.llama_cpp import LlamaSampler, LlamaSamplerChainParams

sampler_params = LlamaSamplerChainParams()
sampler = LlamaSampler(sampler_params)

# Add sampling methods
sampler.add_top_k(40)
sampler.add_top_p(0.95, 1)
sampler.add_temp(0.7)
sampler.add_dist(seed)

# Sample token
token_id = sampler.sample(ctx, idx)

# Reset state
sampler.reset()

`LlamaVocab`¶

Vocabulary and tokenization.

vocab = model.get_vocab()

# Tokenization
tokens = vocab.tokenize("Hello world", add_special=True, parse_special=True)

# Detokenization
text = vocab.detokenize(tokens)
piece = vocab.token_to_piece(token_id, special=True)

# Special tokens
print(vocab.bos)           # Begin-of-sequence token
print(vocab.eos)           # End-of-sequence token
print(vocab.eot)           # End-of-turn token
print(vocab.n_vocab)       # Vocabulary size

# Check token types
is_eog = vocab.is_eog(token_id)
is_control = vocab.is_control(token_id)

`LlamaBatch`¶

Efficient batch processing.

from inferna.llama.llama_cpp import LlamaBatch

# Create batch
batch = LlamaBatch(n_tokens=512, embd=0, n_seq_max=1)

# Add token
batch.add(token_id, pos, seq_ids=[0], logits=True)

# Clear batch
batch.clear()

# Convenience function
from inferna.llama.llama_cpp import llama_batch_get_one
batch = llama_batch_get_one(tokens, pos_offset=0)

Backend Management¶

from inferna.llama.llama_cpp import (
    ggml_backend_load_all,
    ggml_backend_offload_supported,
    ggml_backend_metal_set_n_cb
)

# Load all available backends (Metal, CUDA, etc.)
ggml_backend_load_all()

# Check GPU support
if ggml_backend_offload_supported():
    print("GPU offload supported")

# Configure Metal (macOS)
ggml_backend_metal_set_n_cb(2)  # Number of command buffers

Advanced Features¶

GGUF File Manipulation¶

Inspect and modify GGUF model files.

`GGUFContext` Class¶

from inferna.llama.llama_cpp import GGUFContext

# Read existing file
ctx = GGUFContext.from_file("model.gguf")

# Get metadata
metadata = ctx.get_all_metadata()
print(metadata['general.architecture'])
print(metadata['general.name'])

value = ctx.get_val_str("general.architecture")

# Create new file
ctx = GGUFContext.empty()
ctx.set_val_str("custom.key", "value")
ctx.set_val_u32("custom.number", 42)
ctx.write_to_file("custom.gguf", write_tensors=False)

# Modify existing
ctx = GGUFContext.from_file("model.gguf")
ctx.set_val_str("custom.metadata", "updated")
ctx.write_to_file("modified.gguf")

JSON Schema to Grammar¶

Convert JSON schemas to llama.cpp grammar format for structured output. This is implemented in pure Python (vendored from llama.cpp) with no C++ dependency.

from inferna.llama.llama_cpp import json_schema_to_grammar

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "email": {"type": "string"}
    },
    "required": ["name", "age"]
}

grammar = json_schema_to_grammar(schema)

# Use with generation
from inferna.llama.llama_cpp import LlamaSampler
sampler = LlamaSampler()
sampler.add_grammar(grammar)

Model Download¶

Download models from HuggingFace with Ollama-style tags.

from inferna.llama.llama_cpp import download_model, list_cached_models

# Download from HuggingFace
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:q4",
    cache_dir="~/.cache/inferna/models"
)

# List cached models
models = list_cached_models()
for model in models:
    print(f"{model['user']}/{model['model']}:{model['tag']}")
    print(f"  Path: {model['path']}")
    print(f"  Size: {model['size'] / 1024 / 1024:.2f} MB")

# Direct URL download
download_model(
    url="https://example.com/model.gguf",
    output_path="models/custom.gguf"
)

N-gram Cache¶

Pattern-based token prediction for 2-10x speedup on repetitive text.

from inferna.llama.llama_cpp import NgramCache

# Create cache
cache = NgramCache()

# Learn patterns from token sequences
tokens = [1, 2, 3, 4, 5, 6, 7, 8]
cache.update(tokens, ngram_min=2, ngram_max=4)

# Predict likely continuations
input_tokens = [1, 2, 3]
draft_tokens = cache.draft(input_tokens, n_draft=16)

# Save/load cache
cache.save("patterns.bin")
loaded_cache = NgramCache.from_file("patterns.bin")

# Clear cache
cache.clear()

Speculative Decoding¶

Use draft model for 2-3x inference speedup.

from inferna.llama.llama_cpp import (
    LlamaModel, LlamaContext, LlamaModelParams, LlamaContextParams,
    Speculative, SpeculativeParams
)

# Load target and draft models
model_target = LlamaModel("models/large.gguf", LlamaModelParams())
model_draft = LlamaModel("models/small.gguf", LlamaModelParams())

ctx_params = LlamaContextParams()
ctx_params.n_ctx = 2048

ctx_target = LlamaContext(model_target, ctx_params)

# Configure speculative parameters
params = SpeculativeParams(
    n_max=16,        # Maximum number of draft tokens
    n_reuse=8,       # Tokens to reuse
    p_min=0.75       # Minimum acceptance probability
)

# Create speculative decoding instance
spec = Speculative(params, ctx_target)

# Check compatibility
if spec.is_compat():
    print("Models are compatible for speculative decoding")

    # Begin a speculative decoding round
    spec.begin()

    # Generate draft tokens
    prompt_tokens = [1, 2, 3]
    last_token = prompt_tokens[-1]
    draft_tokens = spec.draft(prompt_tokens, last_token)

    # Accept verified tokens
    spec.accept()

    # Print performance statistics
    spec.print_stats()

Parameters:

n_max: Maximum number of tokens to draft (default: 16)
n_reuse: Number of tokens to reuse from previous draft (default: 8)
p_min: Minimum acceptance probability (default: 0.75)

Methods:

Method	Description
`is_compat()`	Check if target and draft models are compatible
`begin()`	Begin a speculative decoding round
`draft(...)`	Generate draft tokens from the draft model
`accept()`	Accept verified tokens after evaluation
`print_stats()`	Print speculative decoding performance statistics

Server Implementations¶

Three OpenAI-compatible server implementations.

Embedded Server (C/Mongoose) — recommended¶

Mongoose-backed HTTP server with built-in chat web UI and SSE streaming. Uses Python worker threads for token generation so streamed tokens flush to the wire as they're produced. Configured via ServerConfig.

from inferna.llama.server.python import ServerConfig
from inferna.llama.server.embedded import EmbeddedServer, start_embedded_server

# Convenience helper — builds the config and starts the server
server = start_embedded_server(
    model_path="models/llama.gguf",
    host="127.0.0.1",
    port=8080,
    n_ctx=2048,
    n_gpu_layers=-1,
    n_parallel=2,
    model_alias="my-llama",  # shown in the web UI's Model field
)
# Server is now accepting requests; point a browser at http://127.0.0.1:8080/
# for the chat UI, or use any OpenAI-compatible client against /v1/...
server.wait_for_shutdown()  # blocks until SIGINT/SIGTERM
server.stop()

Or build the config explicitly:

config = ServerConfig(
    model_path="models/llama.gguf",
    host="127.0.0.1",
    port=8080,
    n_ctx=2048,
    n_parallel=2,
    embedding=True,                      # enables /v1/embeddings
    embedding_model_path="models/bge-small-en-v1.5-q8_0.gguf",
)

with EmbeddedServer(config) as server:
    server.wait_for_shutdown()

Python Server (fallback)¶

Pure-Python server using stdlib http.server. Same /v1/... JSON API as EmbeddedServer but no web UI and no SSE worker-thread fan-out.

from inferna.llama.server.python import ServerConfig, PythonServer, start_python_server

# Convenience helper
server = start_python_server(model_path="models/llama.gguf", port=8080)
# server runs in a background thread; main thread is free to do other work

# Or as a context manager
with PythonServer(ServerConfig(model_path="models/llama.gguf")) as server:
    import time
    while True:
        time.sleep(1)

Using the server with the OpenAI client¶

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="my-llama",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

LlamaServer (subprocess wrapper)¶

Manages an external llama-server binary as a child process — useful if you want llama.cpp's reference server (e.g. for features inferna's embedded server doesn't yet expose) but want lifecycle management from Python.

from inferna.llama.server.launcher import LlamaServer, LauncherServerConfig

config = LauncherServerConfig(
    model_path="models/llama.gguf",
    host="127.0.0.1",
    port=8080,
)
server = LlamaServer(config, server_binary="bin/llama-server")
server.start()
if server.is_running():
    print("running")
server.stop()

Multimodal Support¶

LLAVA and other vision-language models.

from inferna.llama.mtmd.multimodal import (
    LlavaImageEmbed,
    load_mmproj,
    process_image
)

# Load multimodal projector
mmproj = load_mmproj("models/mmproj.gguf")

# Process image
image_embed = process_image(
    ctx=ctx,
    image_path="image.jpg",
    mmproj=mmproj
)

# Use in generation
# Image embeddings are automatically integrated into context

Whisper Integration¶

Speech-to-text transcription using whisper.cpp. See Whisper.cpp Integration for complete documentation.

Quick Start¶

from inferna.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model
ctx = WhisperContext("models/ggml-base.en.bin")

# Audio must be 16kHz mono float32
samples = load_audio_as_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
params.language = "en"
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    t0 = ctx.full_get_segment_t0(i) / 100.0  # centiseconds to seconds
    t1 = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{t0:.2f}s - {t1:.2f}s] {text}")

Key Classes¶

Class	Description
`WhisperContext`	Main context for model loading and inference
`WhisperContextParams`	Configuration for context creation
`WhisperFullParams`	Configuration for transcription
`WhisperVadParams`	Voice activity detection parameters

WhisperContext Methods¶

Method	Description
`full(samples, params)`	Run transcription on float32 audio samples
`full_n_segments()`	Get number of transcribed segments
`full_get_segment_text(i)`	Get text of segment i
`full_get_segment_t0(i)`	Get start time (centiseconds)
`full_get_segment_t1(i)`	Get end time (centiseconds)
`full_lang_id()`	Get detected language ID
`is_multilingual()`	Check if model supports multiple languages

Audio Requirements¶

Sample rate: 16000 Hz
Channels: Mono
Format: Float32 normalized to [-1.0, 1.0]

Stable Diffusion Integration¶

Image generation using stable-diffusion.cpp. Supports SD 1.x/2.x, SDXL, SD3, FLUX, video generation (Wan/CogVideoX), and ESRGAN upscaling.

Note: Build with WITH_STABLEDIFFUSION=1 to enable this module.

The module is exposed as inferna.sd (CLI: python -m inferna.sd). For broader narrative documentation, see docs/stable_diffusion.md; this section is the API reference.

Quick Start¶

from inferna.sd import text_to_image

# Simple text-to-image generation
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)

# text_to_image returns a single SDImage; text_to_images returns a List[SDImage]
image.save("output.png")

`text_to_image()`¶

Convenience function that creates a context, generates one image, and tears the context down. Returns a single SDImage. For batches use text_to_images().

def text_to_image(
    model_path: str,
    prompt: str,
    negative_prompt: str = "",
    width: int = 512,
    height: int = 512,
    seed: int = -1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    sample_method: SampleMethod = SampleMethod.COUNT,
    scheduler: Scheduler = Scheduler.COUNT,
    n_threads: int = -1,
    vae_path: Optional[str] = None,
    taesd_path: Optional[str] = None,
    clip_l_path: Optional[str] = None,
    clip_g_path: Optional[str] = None,
    t5xxl_path: Optional[str] = None,
    control_net_path: Optional[str] = None,
    clip_skip: int = -1,
    eta: float = float('inf'),
    slg_scale: float = 0.0,
    vae_tiling: bool = False,
    hires_fix: bool = False,
    hires_scale: float = 2.0,
    offload_to_cpu: bool = False,
    keep_clip_on_cpu: bool = False,
    keep_vae_on_cpu: bool = False,
    diffusion_flash_attn: bool = False
) -> SDImage

SampleMethod.COUNT and Scheduler.COUNT are auto-detect sentinels — the C library picks based on the loaded model. eta=float('inf') resolves to a method-specific default. hires_fix=True enables hires-fix two-pass generation with default latent upscale; for finer control use SDImageGenParams.set_hires_fix(...).

`text_to_images()`¶

Same as text_to_image() but returns List[SDImage] and accepts batch_count: int = 1. Each image in the batch uses an incremented seed, producing variants of the same prompt.

`image_to_image()`¶

Img2img convenience function. Note: builds a context with vae_decode_only=False so the encoder is available.

def image_to_image(
    model_path: str,
    init_image: Union[SDImage, str],
    prompt: str,
    negative_prompt: str = "",
    strength: float = 0.75,
    seed: int = -1,
    sample_steps: int = 20,
    cfg_scale: float = 7.0,
    sample_method: SampleMethod = SampleMethod.COUNT,
    scheduler: Scheduler = Scheduler.COUNT,
    n_threads: int = -1,
    vae_path: Optional[str] = None,
    clip_skip: int = -1
) -> List[SDImage]

init_image accepts either an SDImage or a filesystem path; output dimensions are taken from the init image.

`SDContext`¶

Persistent generation context — load the model once, generate many times.

from inferna.sd import SDContext, SDContextParams, SampleMethod, Scheduler

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

with SDContext(params) as ctx:
    images = ctx.generate(
        prompt="a beautiful landscape",
        negative_prompt="blurry, ugly",
        width=512, height=512,
        sample_steps=4, cfg_scale=1.0,
        sample_method=SampleMethod.EULER,  # or COUNT for auto-detect
        scheduler=Scheduler.DISCRETE,
        hires_fix=False,
    )

SDContext.generate(...) accepts the same kwargs as text_to_image() plus batch_count, init_image, mask_image, control_image, control_strength, strength, and flow_shift. Returns List[SDImage].

Properties:

is_valid (bool): Context loaded successfully.
supports_image_generation (bool): Model can run generate() (false for video-only models).
supports_video_generation (bool): Model can run generate_video().

Methods:

generate(**kwargs) -> List[SDImage]: Text/img2img/inpaint/ControlNet generation.
generate_with_params(params: SDImageGenParams) -> List[SDImage]: Low-level entry point taking a fully populated params object — needed for advanced features (LoRAs, reference images, Photo Maker, hires-fix model upscalers, full cache configuration).
generate_video(**kwargs) -> List[SDImage]: Video frame generation (requires video-capable model).
default_sample_method(sample_method=None) -> SampleMethod: Model's preferred sampler.
default_scheduler(sample_method=None) -> Scheduler: Model's preferred scheduler.

`SDContextParams`¶

Configuration for model loading.

params = SDContextParams()
params.model_path = "model.gguf"          # Main model
params.vae_path = "vae.safetensors"       # Optional VAE
params.taesd_path = "taesd.safetensors"   # Optional TAESD (fast previews)
params.clip_l_path = "clip_l.safetensors" # Optional CLIP-L (SDXL/SD3)
params.clip_g_path = "clip_g.safetensors" # Optional CLIP-G (SDXL/SD3)
params.t5xxl_path = "t5xxl.safetensors"   # Optional T5-XXL (SD3/FLUX)
params.control_net_path = "cn.safetensors" # Optional ControlNet
params.n_threads = 4
params.vae_decode_only = True             # Set False for img2img
params.diffusion_flash_attn = False
params.offload_params_to_cpu = False      # Low-VRAM mode
params.keep_clip_on_cpu = False
params.keep_vae_on_cpu = False
params.wtype = SDType.COUNT               # COUNT = auto-detect
params.rng_type = RngType.CUDA

`SDImage`¶

Image wrapper with numpy and PIL integration.

from inferna.sd import SDImage
import numpy as np

arr = np.zeros((512, 512, 3), dtype=np.uint8)
img = SDImage.from_numpy(arr)

print(img.width, img.height, img.channels)

arr = img.to_numpy()       # (H, W, C) uint8
pil_img = img.to_pil()     # requires Pillow

img.save("output.png")
img = SDImage.load("input.png")

`SDImageGenParams`¶

Full generation parameters; pass to SDContext.generate_with_params(). The text_to_image() convenience function only exposes a curated subset — drop down to this class for LoRAs, reference images, Photo Maker, full cache control, hires-fix model upscalers, etc.

from inferna.sd import SDImageGenParams, SDImage, HiresUpscaler

params = SDImageGenParams()
params.prompt = "a cute cat"
params.negative_prompt = "ugly, blurry"
params.width = 512
params.height = 512
params.seed = 42
params.batch_count = 1
params.strength = 0.75           # For img2img
params.clip_skip = -1

# VAE tiling
params.vae_tiling_enabled = True
params.vae_tile_size = (512, 512)
params.vae_tile_overlap = 0.5

# Cache acceleration (legacy easycache_* aliases also available)
params.cache_mode = 1            # 0=disabled, 1=easycache, 2=ucache, 3=dbcache, 4=taylorseer, 5=cache_dit
params.cache_threshold = 0.1
params.cache_range = (0.0, 1.0)

# Hires-fix two-pass generation
params.set_hires_fix(
    enabled=True,
    upscaler=HiresUpscaler.LATENT,   # or LANCZOS, NEAREST, MODEL, ...
    scale=2.0,
    denoising_strength=0.7,
)
# ...individual setters also work:
# params.hires_enabled = True
# params.hires_target_size = (1024, 1024)
# params.hires_model_path = "/path/to/upscaler.gguf"  # required for HiresUpscaler.MODEL

# img2img / inpaint / ControlNet
params.set_init_image(SDImage.load("input.png"))
params.set_mask_image(SDImage.load("mask.png"))
params.set_control_image(control_img, strength=0.8)

# LoRAs and reference images
params.set_loras([{"path": "lora.safetensors", "multiplier": 0.8}])
params.set_ref_images([ref_img1, ref_img2])

# Sample params (delegated to nested SDSampleParams)
sample = params.sample_params
sample.sample_steps = 20
sample.cfg_scale = 7.0
sample.sample_method = SampleMethod.COUNT
sample.scheduler = Scheduler.COUNT

See docs/stable_diffusion.md for the full property catalog (Photo Maker, ControlNet refs, full cache configuration, all hires-fix fields).

`SDSampleParams`¶

Sampling configuration. Usually accessed as gen_params.sample_params rather than instantiated directly.

from inferna.sd import SDSampleParams, SampleMethod, Scheduler

params = SDSampleParams()
params.sample_method = SampleMethod.COUNT
params.scheduler = Scheduler.COUNT
params.sample_steps = 20
params.cfg_scale = 7.0
params.eta = float('inf')        # inf = method-specific default
params.slg_scale = 0.0           # Skip layer guidance
params.flow_shift = float('inf') # Flow shift (SD3.x / Wan)

`Upscaler`¶

ESRGAN-based image upscaling.

from inferna.sd import Upscaler, SDImage

upscaler = Upscaler(
    "models/esrgan-x4.bin",
    n_threads=4,
    offload_to_cpu=False,
    direct=False,         # direct convolution
    tile_size=0,          # 0 = default
)

print(f"Factor: {upscaler.upscale_factor}x")

img = SDImage.load("input.png")
upscaled = upscaler.upscale(img)               # use model's native factor
upscaled = upscaler.upscale(img, factor=2)     # or override
upscaled.save("upscaled.png")

Upscaler is also usable as a context manager (with Upscaler(...) as up:).

`convert_model()`¶

Convert models between formats / quantize.

from inferna.sd import convert_model, SDType

convert_model(
    input_path="sd-v1-5.safetensors",
    output_path="sd-v1-5-q4_0.gguf",
    output_type=SDType.Q4_0,
    vae_path="vae-ft-mse.safetensors",   # optional
    tensor_type_rules=None,              # optional per-tensor type rules
    convert_name=False,                  # convert tensor names
)

Raises FileNotFoundError if the input is missing, RuntimeError on conversion failure.

`canny_preprocess()`¶

Canny edge detection for ControlNet conditioning. Modifies the image in place.

from inferna.sd import SDImage, canny_preprocess

img = SDImage.load("photo.png")
success = canny_preprocess(
    img,
    high_threshold=0.8,
    low_threshold=0.1,
    weak=0.5,
    strong=1.0,
    inverse=False,
)

Callbacks¶

from inferna.sd import (
    set_log_callback,
    set_progress_callback,
    set_preview_callback,
    PreviewMode,
)

# Logging: callback receives (LogLevel, str)
def log_cb(level, text):
    print(f'[{level.name}] {text}', end='')
set_log_callback(log_cb)

# Progress: callback receives (step, total_steps, time_seconds)
def progress_cb(step, steps, time_s):
    pct = (step / steps) * 100 if steps > 0 else 0
    print(f'Step {step}/{steps} ({pct:.1f}%) - {time_s:.2f}s')
set_progress_callback(progress_cb)

# Preview: callback receives (step, frames: List[SDImage], is_noisy: bool)
def preview_cb(step, frames, is_noisy):
    for i, frame in enumerate(frames):
        frame.save(f"preview_{step}_{i}.png")
set_preview_callback(
    preview_cb,
    mode=PreviewMode.TAE,
    interval=5,
    denoised=True,
    noisy=False,
)

# Pass None to clear any of them.
set_log_callback(None)
set_progress_callback(None)
set_preview_callback(None)

Enums¶

SampleMethod

EULER, EULER_A, HEUN, DPM2, DPMPP2S_A, DPMPP2M, DPMPP2Mv2
IPNDM, IPNDM_V, LCM, DDIM_TRAILING, TCD
RES_MULTISTEP, RES_2S, ER_SDE
COUNT (auto-detect sentinel)

Scheduler

DISCRETE, KARRAS, EXPONENTIAL, AYS, GITS
SGM_UNIFORM, SIMPLE, SMOOTHSTEP, KL_OPTIMAL, LCM, BONG_TANGENT
COUNT (auto-detect sentinel)

Prediction

EPS, V, EDM_V, FLOW, FLUX_FLOW, FLUX2_FLOW, COUNT

SDType: Data types for model weights / quantization

F32, F16, BF16
Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1
Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K
COUNT (auto-detect sentinel)

RngType: STD_DEFAULT, CUDA, CPU

LogLevel: DEBUG, INFO, WARN, ERROR

PreviewMode: NONE, PROJ, TAE, VAE

LoraApplyMode: AUTO, IMMEDIATELY, AT_RUNTIME

HiresUpscaler: hires-fix upscaler modes

NONE
LATENT, LATENT_NEAREST, LATENT_NEAREST_EXACT, LATENT_ANTIALIASED, LATENT_BICUBIC, LATENT_BICUBIC_ANTIALIASED
LANCZOS, NEAREST
MODEL (external upscaler model — set hires_model_path)

Utility Functions¶

from inferna.sd import (
    get_num_cores,
    get_system_info,
    type_name,
    sample_method_name,
    scheduler_name,
    ggml_backend_load_all,
)

ggml_backend_load_all()  # call before get_system_info() so GPU backends register
print(f"CPU cores: {get_num_cores()}")
print(get_system_info())

print(type_name(SDType.Q4_0))                  # "q4_0"
print(sample_method_name(SampleMethod.EULER))  # "euler"
print(scheduler_name(Scheduler.KARRAS))        # "karras"

CLI Tool¶

# txt2img (alias: generate)
python -m inferna.sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png \
    --steps 4 --cfg 1.0

# img2img / inpaint / ControlNet / video
python -m inferna.sd img2img --model M --init INPUT --prompt "..." --output OUT
python -m inferna.sd inpaint --model M --init INPUT --mask MASK --prompt "..." --output OUT
python -m inferna.sd controlnet --model M --control-net CN --control-image C --prompt "..." --output OUT
python -m inferna.sd video --model M --prompt "..." --output frames/

# Upscale image
python -m inferna.sd upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_4x.png

# Convert model
python -m inferna.sd convert \
    --input sd-v1-5.safetensors \
    --output sd-v1-5-q4_0.gguf \
    --type q4_0

# Show system info
python -m inferna.sd info

Supported Models¶

SD 1.x/2.x: Standard Stable Diffusion models
SDXL/SDXL Turbo: Stable Diffusion XL (use cfg_scale=1.0, steps=1-4 for Turbo)
SD3/SD3.5: Stable Diffusion 3.x
FLUX: FLUX.1 models (dev, schnell)
Wan/CogVideoX: Video generation models (use generate_video())
LoRA: Low-rank adaptation files
ControlNet: Conditional generation with control images
ESRGAN: Image upscaling models

Error Handling¶

All inferna functions raise appropriate Python exceptions:

from inferna import complete, LLM

try:
    response = complete("Hello", model_path="nonexistent.gguf")
except FileNotFoundError:
    print("Model file not found")
except RuntimeError as e:
    print(f"Runtime error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

# LLM with error handling
try:
    gen = LLM("models/llama.gguf")
    response = gen("What is Python?")
except Exception as e:
    print(f"Generation failed: {e}")

Type Hints¶

All functions include comprehensive type hints for IDE support:

from typing import List, Dict, Optional, Iterator, Callable, Tuple
from inferna import (
    complete,          # str | Iterator[str]
    chat,              # str | Iterator[str]
    LLM,               # class
    GenerationConfig,  # @dataclass
)

Performance Tips¶

1. Model Reuse¶

# BAD: Reloads model each time (slow)
for prompt in prompts:
    response = complete(prompt, model_path="model.gguf")

# GOOD: Reuses loaded model (fast)
gen = LLM("model.gguf")
for prompt in prompts:
    response = gen(prompt)

2. Batch Processing¶

from inferna import batch_generate, GenerationConfig

# BAD: Sequential processing
responses = [generate(p, model_path="model.gguf") for p in prompts]

# GOOD: Parallel batch processing (3-10x faster)
prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(
    prompts,
    model_path="model.gguf",
    n_seq_max=8,  # Max parallel sequences
    config=GenerationConfig(max_tokens=50, temperature=0.7)
)

3. GPU Offloading¶

# Estimate optimal layers
from inferna import estimate_gpu_layers

estimate = estimate_gpu_layers("model.gguf", available_vram_mb=8000)

# Use recommended settings
config = GenerationConfig(n_gpu_layers=estimate.n_gpu_layers)
gen = LLM("model.gguf", config=config)

4. Context Sizing¶

# Auto-size context (recommended)
config = GenerationConfig(n_ctx=None, max_tokens=200)

# Manual sizing (for control)
config = GenerationConfig(n_ctx=2048, max_tokens=200)

5. Streaming for Long Outputs¶

# Non-streaming: waits for complete response
response = complete("Write a long essay", model_path="model.gguf", max_tokens=2000)

# Streaming: see output as it generates
for chunk in complete("Write a long essay", model_path="model.gguf",
                     max_tokens=2000, stream=True):
    print(chunk, end="", flush=True)

Version Compatibility¶

Python: >=3.10 (tested on 3.13)
llama.cpp: b8833
Platform: macOS, Linux, Windows

Inferna API Reference¶

Table of Contents¶

High-Level Generation API¶

complete()¶

chat()¶

apply_chat_template()¶

get_chat_template()¶

Response Class¶

to_dict()¶

to_json()¶

GenerationStats Class¶

LLM Class¶

__call__()¶

chat()¶

get_chat_template()¶

MCP client methods¶

GenerationConfig Dataclass¶

GenerationStats Dataclass¶

Async API¶

AsyncLLM Class¶

__call__() / generate()¶

stream()¶

generate_with_stats()¶

complete_async()¶

chat_async()¶

stream_complete_async()¶

Framework Integrations¶

OpenAI-Compatible API¶

OpenAICompatibleClient Class¶

LangChain Integration¶

InfernaLLM Class¶

Memory Utilities¶

estimate_gpu_layers()¶

estimate_memory_usage()¶

MemoryEstimate Dataclass¶

Core llama.cpp API¶

Core Classes¶

LlamaModel¶

LlamaContext¶

LlamaSampler¶

LlamaVocab¶

LlamaBatch¶

Backend Management¶

Advanced Features¶

GGUF File Manipulation¶

GGUFContext Class¶

JSON Schema to Grammar¶

Model Download¶

N-gram Cache¶

Speculative Decoding¶

Server Implementations¶

Embedded Server (C/Mongoose) — recommended¶

Python Server (fallback)¶

Using the server with the OpenAI client¶

LlamaServer (subprocess wrapper)¶

Multimodal Support¶

Whisper Integration¶

Quick Start¶

Key Classes¶

WhisperContext Methods¶

Audio Requirements¶

Stable Diffusion Integration¶

Quick Start¶

text_to_image()¶

text_to_images()¶

image_to_image()¶

SDContext¶

SDContextParams¶

SDImage¶

SDImageGenParams¶

SDSampleParams¶

Upscaler¶

convert_model()¶

canny_preprocess()¶

Callbacks¶

Enums¶

Utility Functions¶

CLI Tool¶

Supported Models¶

Error Handling¶

`complete()`¶

`chat()`¶

`apply_chat_template()`¶

`get_chat_template()`¶

`Response` Class¶

`to_dict()`¶

`to_json()`¶

`GenerationStats` Class¶

`LLM` Class¶

`call()`¶

`chat()`¶

`get_chat_template()`¶

`GenerationConfig` Dataclass¶

`GenerationStats` Dataclass¶

`AsyncLLM` Class¶

`call()` / `generate()`¶

`stream()`¶

`generate_with_stats()`¶

`complete_async()`¶

`chat_async()`¶

`stream_complete_async()`¶

`OpenAICompatibleClient` Class¶

`InfernaLLM` Class¶

`estimate_gpu_layers()`¶

`estimate_memory_usage()`¶

`MemoryEstimate` Dataclass¶

`LlamaModel`¶

`LlamaContext`¶

`LlamaSampler`¶

`LlamaVocab`¶

`LlamaBatch`¶

`GGUFContext` Class¶

`text_to_image()`¶

`text_to_images()`¶

`image_to_image()`¶

`SDContext`¶

`SDContextParams`¶

`SDImage`¶

`SDImageGenParams`¶

`SDSampleParams`¶

`Upscaler`¶

`convert_model()`¶

`canny_preprocess()`¶