Cyllama Server Usage Examples¶

Cyllama provides two server implementations, both offering the same OpenAI-compatible API:

Server Types¶

1. Mongoose C Server (EmbeddedServer)¶

Native C networking via the Mongoose web server library
High concurrency and low overhead
Compiled as part of the standard make build

2. Python HTTP Server (PythonServer)¶

Pure Python HTTP server (stdlib http.server)
No compiled dependencies beyond the core Cython extensions
Subject to Python GIL limitations

Basic Usage¶

Start Default Embedded Server¶

python -m cyllama.llama.server -m models/Llama-3.2-1B-Instruct-Q8_0.gguf

Start High-Performance Mongoose Server¶

python -m cyllama.llama.server -m models/Llama-3.2-1B-Instruct-Q8_0.gguf --server-type mongoose

Advanced Configuration¶

Custom Host and Port¶

python -m cyllama.llama.server \
    -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
    --server-type mongoose \
    --host 0.0.0.0 \
    --port 8080

Multiple Parallel Processing Slots¶

python -m cyllama.llama.server \
    -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
    --server-type mongoose \
    --n-parallel 4 \
    --ctx-size 2048

GPU Configuration¶

python -m cyllama.llama.server \
    -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
    --server-type mongoose \
    --gpu-layers 32 \
    --ctx-size 4096

API Endpoints¶

Both server implementations provide the same OpenAI-compatible API:

Health Check¶

curl http://localhost:8080/health

List Models¶

curl http://localhost:8080/v1/models

Chat Completion¶

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ],
    "max_tokens": 100
  }'

Embeddings¶

The /v1/embeddings endpoint is available when the server is started with embedding=True. It accepts single strings or batches and returns OpenAI-compatible responses.

# Single text
curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "hello world", "model": "nomic-embed-text"}'

# Batch input
curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["first text", "second text"], "model": "nomic-embed-text"}'

Response format:

{
  "object": "list",
  "data": [
    {"object": "embedding", "embedding": [0.1, 0.2, ...], "index": 0}
  ],
  "model": "nomic-embed-text",
  "usage": {"prompt_tokens": 3, "total_tokens": 3}
}

Embedding Server Configuration¶

To serve embeddings, enable the embedding flag in ServerConfig and optionally specify a dedicated embedding model:

from cyllama.llama.server.python import ServerConfig, PythonServer

config = ServerConfig(
    model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
    embedding=True,

    # Optional: use a dedicated embedding model (defaults to model_path)
    embedding_model_path="models/bge-small-en-v1.5-q8_0.gguf",

    # Embedding-specific parameters (same options as the Embedder class)
    embedding_n_ctx=512,          # Context size (match model training)
    embedding_n_batch=512,        # Batch size
    embedding_n_gpu_layers=-1,    # GPU layers (-1 = all)
    embedding_pooling="mean",     # Pooling: mean, cls, last, none
    embedding_normalize=True,     # L2 normalize output vectors
)

with PythonServer(config) as server:
    # Server now handles both /v1/chat/completions and /v1/embeddings
    import time
    while True:
        time.sleep(1)

The same configuration works with EmbeddedServer (Mongoose):

from cyllama.llama.server.embedded import EmbeddedServer

with EmbeddedServer(config) as server:
    server.wait_for_shutdown()

Embedding Configuration Reference¶

Parameter	Default	Description
`embedding`	`False`	Enable `/v1/embeddings` endpoint
`embedding_model_path`	`None`	Path to embedding model (uses `model_path` if `None`)
`embedding_n_ctx`	`512`	Context size for embedding model
`embedding_n_batch`	`512`	Batch size for embedding model
`embedding_n_gpu_layers`	`-1`	GPU layers for embedding model (-1 = all)
`embedding_pooling`	`"mean"`	Pooling strategy: `mean`, `cls`, `last`, `none`
`embedding_normalize`	`True`	L2 normalize output embeddings

Performance Comparison¶

Feature	EmbeddedServer (Mongoose)	PythonServer
Networking	Native C	Python HTTP
Concurrency	High	GIL limited
Memory Usage	Lower	Higher
Startup Time	Fast	Fast
Best For	Production, high-throughput	Development, simplicity