Inferna Server Usage Examples¶
Inferna ships an embedded OpenAI-compatible HTTP server with a built-in chat web UI (a rebrand of llama.cpp's reference webui), plus a pure-Python fallback for environments without the compiled mongoose extension.
Server Types¶
1. Embedded Server (EmbeddedServer)¶
- C networking via the Mongoose library, bound to Python through nanobind
- Single-threaded poll loop on the main thread; per-stream worker threads for concurrent token generation
- Serves the upstream llama-server web UI at
GET /(gzipped at build time, served withContent-Encoding: gzip) - Compiled as part of the standard
make build
2. Python Server (PythonServer)¶
- Pure Python HTTP server (stdlib
http.server) - No compiled mongoose dependency
- Useful for development and environments where the embedded extension can't be loaded
Basic Usage¶
Start the Embedded Server (default)¶
Open http://127.0.0.1:8080/ in a browser to use the chat UI.
Use the Python Server¶
CLI Flags¶
| Flag | Default | Description |
|---|---|---|
-m, --model |
(required) | Path to a .gguf model file |
--host |
127.0.0.1 |
Bind address (use 0.0.0.0 to expose on the LAN) |
--port |
8080 |
Port to listen on |
--ctx-size |
2048 |
Context window size in tokens |
--gpu-layers |
-1 |
GPU layers to offload (-1 = all) |
--n-parallel |
1 |
Number of concurrent processing slots |
--model-alias |
(filename stem) | Identifier shown in the UI's "Model" field and /v1/models[].id. Defaults to the model filename without extension. |
--mongoose-log-level |
1 (errors only) |
Mongoose internal log verbosity. 0=none, 1=errors only (default), 2=info, 3=debug (every accept/read/write/close — useful for HTTP-level debugging), 4=verbose |
--server-type |
embedded |
embedded or python |
Advanced Configuration¶
LAN-accessible server with multiple slots¶
python -m inferna.llama.server \
-m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-parallel 4 \
--ctx-size 4096
GPU offload + custom display name¶
python -m inferna.llama.server \
-m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
--gpu-layers 32 \
--ctx-size 4096 \
--model-alias my-llama
Verbose mongoose tracing for HTTP debugging¶
Logging¶
By default the server emits one access-log line per HTTP request on the inferna.llama.server.embedded.access stdlib logger:
INFO:inferna.llama.server.embedded.access:GET /props 200 285B 0.1ms
INFO:inferna.llama.server.embedded.access:POST /v1/chat/completions 200 242B 0.4ms
INFO:inferna.llama.server.embedded.access:stream-done conn=14a82e4f0 model=Llama-3.2-1B-Instruct-Q8_0 bytes=8563 elapsed=917.2ms
Streaming chat completions emit two lines: one when the dispatcher returns (covers headers + the role-only opener), and a second stream-done (or stream-cancel if the client dropped) when the SSE stream finishes, with the cumulative byte count and end-to-end timing.
Mongoose's built-in tracer is silenced by default — raise it with --mongoose-log-level 3 if you need to see the underlying connection events.
Web UI¶
Open the server's root URL (/) in a browser. The UI is the upstream llama-server SPA, served from the inferna package as gzipped static assets:
| Route | Content |
|---|---|
GET / and /index.html |
UI shell (HTML) |
GET /bundle.css |
UI stylesheet |
GET /bundle.js |
UI bundle (Svelte SPA, ~6.5 MB raw / ~1.7 MB gzipped) |
GET /loading.html |
UI loading screen |
The UI calls /props, /v1/models, and /v1/chat/completions (with stream: true) to operate. Cancellation, conversation history (in IndexedDB), and per-conversation parameter overrides are upstream features that work out of the box.
API Endpoints¶
Health¶
Server properties (UI bootstrap)¶
Returns model metadata the UI uses to render its sidebar:
{
"default_generation_settings": {"n_ctx": 2048, "temperature": 0.8, "top_p": 0.9, "min_p": 0.05},
"total_slots": 1,
"model_path": "models/Llama-3.2-1B-Instruct-Q8_0.gguf",
"model_alias": "Llama-3.2-1B-Instruct-Q8_0",
"chat_template": "",
"build_info": "inferna",
"n_ctx": 2048,
"n_ctx_train": 2048
}
Slots¶
Metrics¶
Returns an empty Prometheus exposition (200 with Content-Type: text/plain; version=0.0.4). The UI calls this; populating it with real series is a future enhancement.
Models¶
Chat completion (non-streaming)¶
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 100
}'
Chat completion (streaming)¶
curl -N -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'
The response is OpenAI-shape Server-Sent Events:
data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}],...}
data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":" 1"},"finish_reason":null}],...}
data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":", 2"},"finish_reason":null}],...}
...
data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],...}
data: [DONE]
Tokens arrive on the wire as they're generated (worker-thread + main-poll-loop drain), so SSE clients see real-time streaming.
max_tokens defaults to "until EOS or context limit" when omitted, null, 0, or any negative value (the upstream webui's -1 "unlimited" convention is honored). Pass a positive integer to cap.
Embeddings¶
The /v1/embeddings endpoint is available when the server is started with embedding=True in ServerConfig (Python API; the CLI does not yet expose this flag).
# Single text
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "hello world", "model": "nomic-embed-text"}'
# Batch input
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": ["first text", "second text"], "model": "nomic-embed-text"}'
Response format:
{
"object": "list",
"data": [
{"object": "embedding", "embedding": [0.1, 0.2, ...], "index": 0}
],
"model": "nomic-embed-text",
"usage": {"prompt_tokens": 3, "total_tokens": 3}
}
Embedding Server Configuration¶
To serve embeddings, enable the embedding flag on ServerConfig and optionally specify a dedicated embedding model:
from inferna.llama.server.python import ServerConfig
from inferna.llama.server.embedded import EmbeddedServer
config = ServerConfig(
model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
embedding=True,
# Optional: dedicated embedding model (defaults to model_path)
embedding_model_path="models/bge-small-en-v1.5-q8_0.gguf",
# Embedding-specific parameters (same options as the Embedder class)
embedding_n_ctx=512, # Context size (match model training)
embedding_n_batch=512,
embedding_n_gpu_layers=-1, # -1 = all
embedding_pooling="mean", # mean, cls, last, none
embedding_normalize=True,
)
with EmbeddedServer(config) as server:
server.wait_for_shutdown()
The same config works with PythonServer:
from inferna.llama.server.python import PythonServer
with PythonServer(config) as server:
import time
while True:
time.sleep(1)
Embedding Configuration Reference¶
| Parameter | Default | Description |
|---|---|---|
embedding |
False |
Enable /v1/embeddings endpoint |
embedding_model_path |
None |
Path to embedding model (uses model_path if None) |
embedding_n_ctx |
512 |
Context size for embedding model |
embedding_n_batch |
512 |
Batch size for embedding model |
embedding_n_gpu_layers |
-1 |
GPU layers for embedding model (-1 = all) |
embedding_pooling |
"mean" |
Pooling strategy: mean, cls, last, none |
embedding_normalize |
True |
L2 normalize output embeddings |
When to Use Each Server¶
Use EmbeddedServer (default) when¶
- You want the built-in web UI
- Multiple concurrent users / streams
- Production deployments
- Throughput matters
Use PythonServer when¶
- The compiled mongoose extension isn't available (sdist install on a platform without a wheel, etc.)
- Debugging the HTTP layer with stdlib tooling
- You don't need the web UI
The two servers expose the same JSON API (/v1/models, /v1/chat/completions, /v1/embeddings, /health); only EmbeddedServer serves the web UI and the /props / /slots / /metrics endpoints the UI requires.