Skip to content

Inferna Server Usage Examples

Inferna ships an embedded OpenAI-compatible HTTP server with a built-in chat web UI (a rebrand of llama.cpp's reference webui), plus a pure-Python fallback for environments without the compiled mongoose extension.

Server Types

1. Embedded Server (EmbeddedServer)

  • C networking via the Mongoose library, bound to Python through nanobind
  • Single-threaded poll loop on the main thread; per-stream worker threads for concurrent token generation
  • Serves the upstream llama-server web UI at GET / (gzipped at build time, served with Content-Encoding: gzip)
  • Compiled as part of the standard make build

2. Python Server (PythonServer)

  • Pure Python HTTP server (stdlib http.server)
  • No compiled mongoose dependency
  • Useful for development and environments where the embedded extension can't be loaded

Basic Usage

Start the Embedded Server (default)

python -m inferna.llama.server -m models/Llama-3.2-1B-Instruct-Q8_0.gguf

Open http://127.0.0.1:8080/ in a browser to use the chat UI.

Use the Python Server

python -m inferna.llama.server -m models/Llama-3.2-1B-Instruct-Q8_0.gguf --server-type python

CLI Flags

Flag Default Description
-m, --model (required) Path to a .gguf model file
--host 127.0.0.1 Bind address (use 0.0.0.0 to expose on the LAN)
--port 8080 Port to listen on
--ctx-size 2048 Context window size in tokens
--gpu-layers -1 GPU layers to offload (-1 = all)
--n-parallel 1 Number of concurrent processing slots
--model-alias (filename stem) Identifier shown in the UI's "Model" field and /v1/models[].id. Defaults to the model filename without extension.
--mongoose-log-level 1 (errors only) Mongoose internal log verbosity. 0=none, 1=errors only (default), 2=info, 3=debug (every accept/read/write/close — useful for HTTP-level debugging), 4=verbose
--server-type embedded embedded or python

Advanced Configuration

LAN-accessible server with multiple slots

python -m inferna.llama.server \
    -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-parallel 4 \
    --ctx-size 4096

GPU offload + custom display name

python -m inferna.llama.server \
    -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
    --gpu-layers 32 \
    --ctx-size 4096 \
    --model-alias my-llama

Verbose mongoose tracing for HTTP debugging

python -m inferna.llama.server -m models/llama.gguf --mongoose-log-level 3

Logging

By default the server emits one access-log line per HTTP request on the inferna.llama.server.embedded.access stdlib logger:

INFO:inferna.llama.server.embedded.access:GET /props 200 285B 0.1ms
INFO:inferna.llama.server.embedded.access:POST /v1/chat/completions 200 242B 0.4ms
INFO:inferna.llama.server.embedded.access:stream-done conn=14a82e4f0 model=Llama-3.2-1B-Instruct-Q8_0 bytes=8563 elapsed=917.2ms

Streaming chat completions emit two lines: one when the dispatcher returns (covers headers + the role-only opener), and a second stream-done (or stream-cancel if the client dropped) when the SSE stream finishes, with the cumulative byte count and end-to-end timing.

Mongoose's built-in tracer is silenced by default — raise it with --mongoose-log-level 3 if you need to see the underlying connection events.

Web UI

Open the server's root URL (/) in a browser. The UI is the upstream llama-server SPA, served from the inferna package as gzipped static assets:

Route Content
GET / and /index.html UI shell (HTML)
GET /bundle.css UI stylesheet
GET /bundle.js UI bundle (Svelte SPA, ~6.5 MB raw / ~1.7 MB gzipped)
GET /loading.html UI loading screen

The UI calls /props, /v1/models, and /v1/chat/completions (with stream: true) to operate. Cancellation, conversation history (in IndexedDB), and per-conversation parameter overrides are upstream features that work out of the box.

API Endpoints

Health

curl http://127.0.0.1:8080/health

Server properties (UI bootstrap)

curl http://127.0.0.1:8080/props | python3 -m json.tool

Returns model metadata the UI uses to render its sidebar:

{
  "default_generation_settings": {"n_ctx": 2048, "temperature": 0.8, "top_p": 0.9, "min_p": 0.05},
  "total_slots": 1,
  "model_path": "models/Llama-3.2-1B-Instruct-Q8_0.gguf",
  "model_alias": "Llama-3.2-1B-Instruct-Q8_0",
  "chat_template": "",
  "build_info": "inferna",
  "n_ctx": 2048,
  "n_ctx_train": 2048
}

Slots

curl http://127.0.0.1:8080/slots
[{"id": 0, "is_processing": false, "task_id": null}]

Metrics

curl http://127.0.0.1:8080/metrics

Returns an empty Prometheus exposition (200 with Content-Type: text/plain; version=0.0.4). The UI calls this; populating it with real series is a future enhancement.

Models

curl http://127.0.0.1:8080/v1/models

Chat completion (non-streaming)

curl -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 100
  }'

Chat completion (streaming)

curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

The response is OpenAI-shape Server-Sent Events:

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}],...}

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":" 1"},"finish_reason":null}],...}

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":", 2"},"finish_reason":null}],...}

...

data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],...}

data: [DONE]

Tokens arrive on the wire as they're generated (worker-thread + main-poll-loop drain), so SSE clients see real-time streaming.

max_tokens defaults to "until EOS or context limit" when omitted, null, 0, or any negative value (the upstream webui's -1 "unlimited" convention is honored). Pass a positive integer to cap.

Embeddings

The /v1/embeddings endpoint is available when the server is started with embedding=True in ServerConfig (Python API; the CLI does not yet expose this flag).

# Single text
curl -X POST http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "hello world", "model": "nomic-embed-text"}'

# Batch input
curl -X POST http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["first text", "second text"], "model": "nomic-embed-text"}'

Response format:

{
  "object": "list",
  "data": [
    {"object": "embedding", "embedding": [0.1, 0.2, ...], "index": 0}
  ],
  "model": "nomic-embed-text",
  "usage": {"prompt_tokens": 3, "total_tokens": 3}
}

Embedding Server Configuration

To serve embeddings, enable the embedding flag on ServerConfig and optionally specify a dedicated embedding model:

from inferna.llama.server.python import ServerConfig
from inferna.llama.server.embedded import EmbeddedServer

config = ServerConfig(
    model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
    embedding=True,

    # Optional: dedicated embedding model (defaults to model_path)
    embedding_model_path="models/bge-small-en-v1.5-q8_0.gguf",

    # Embedding-specific parameters (same options as the Embedder class)
    embedding_n_ctx=512,          # Context size (match model training)
    embedding_n_batch=512,
    embedding_n_gpu_layers=-1,    # -1 = all
    embedding_pooling="mean",     # mean, cls, last, none
    embedding_normalize=True,
)

with EmbeddedServer(config) as server:
    server.wait_for_shutdown()

The same config works with PythonServer:

from inferna.llama.server.python import PythonServer

with PythonServer(config) as server:
    import time
    while True:
        time.sleep(1)

Embedding Configuration Reference

Parameter Default Description
embedding False Enable /v1/embeddings endpoint
embedding_model_path None Path to embedding model (uses model_path if None)
embedding_n_ctx 512 Context size for embedding model
embedding_n_batch 512 Batch size for embedding model
embedding_n_gpu_layers -1 GPU layers for embedding model (-1 = all)
embedding_pooling "mean" Pooling strategy: mean, cls, last, none
embedding_normalize True L2 normalize output embeddings

When to Use Each Server

Use EmbeddedServer (default) when

  • You want the built-in web UI
  • Multiple concurrent users / streams
  • Production deployments
  • Throughput matters

Use PythonServer when

  • The compiled mongoose extension isn't available (sdist install on a platform without a wheel, etc.)
  • Debugging the HTTP layer with stdlib tooling
  • You don't need the web UI

The two servers expose the same JSON API (/v1/models, /v1/chat/completions, /v1/embeddings, /health); only EmbeddedServer serves the web UI and the /props / /slots / /metrics endpoints the UI requires.