RAG Support¶

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from a knowledge base before generation. cyllama provides a complete RAG solution using:

llama.cpp for both embedding generation and text generation
sqlite-vector for high-performance vector similarity search
SQLite FTS5 for hybrid keyword + semantic search

Architecture¶

                    +-----------------+
                    |   RAG Pipeline  |
                    +--------+--------+
                             |
         +-------------------+-------------------+
         |                   |                   |
+--------v--------+ +--------v--------+ +--------v--------+
|    Embedder     | |SqliteVectorStore| |   Generator     |
| (embedding LLM) | | (retrieval)     | | (generation LLM)|
+-----------------+ +-----------------+ +-----------------+

Quick Start¶

The simplest way to use RAG is through the high-level RAG class:

from cyllama.rag import RAG

# Initialize with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)

# Add documents to the knowledge base
rag.add_texts([
    "Python is a high-level programming language known for its simplicity.",
    "Machine learning uses algorithms to learn patterns from data.",
    "Neural networks are inspired by biological brain structures."
])

# Or load from files
rag.add_documents(["docs/guide.md", "docs/api.txt"])

# Query the knowledge base
response = rag.query("What is Python?")
print(response.text)
print(f"Sources: {len(response.sources)}")

# Stream the response
for chunk in rag.stream("Explain machine learning"):
    print(chunk, end="", flush=True)

# Clean up
rag.close()

Using Context Managers¶

For proper resource cleanup, use the context manager:

from cyllama.rag import RAG

with RAG(
    embedding_model="models/bge-small.gguf",
    generation_model="models/llama.gguf"
) as rag:
    rag.add_texts(["Your documents here..."])
    response = rag.query("Your question?")
    print(response.text)
# Resources automatically cleaned up

Pluggable Backends¶

RAG and RAGPipeline accept an injected embedder and vector store via the embedder= and store= constructor parameters:

from cyllama.rag import RAG, SqliteVectorStore

rag = RAG(
    embedding_model="",  # ignored when embedder= is supplied
    generation_model="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
    embedder=my_embedder,                                  # any EmbedderProtocol
    store=SqliteVectorStore(dimension=1536, db_path="x.db"),  # any VectorStoreProtocol
)

Both slots are typed as structural protocols (EmbedderProtocol, VectorStoreProtocol in cyllama.rag.types). Alternative backends — OpenAI embeddings, Qdrant, Chroma, pgvector, an in-house service — only need to implement the handful of methods the RAG layer actually calls to become drop-in replacements. See:

Omit the argument to fall back to the defaults (Embedder over a local GGUF embedding model and SqliteVectorStore).

Components Overview¶

Core Components¶

Component	Description
`RAG`	High-level interface with sensible defaults
`AsyncRAG`	Async wrapper for non-blocking operations
`RAGPipeline`	Lower-level orchestration of retrieval + generation
`RAGConfig`	Configuration for retrieval and generation

Storage & Retrieval¶

Component	Description
`Embedder`	Generate vector embeddings from text
`SqliteVectorStore`	SQLite-based vector storage with sqlite-vector (default backend; implements `VectorStoreProtocol`). `VectorStore` remains as a deprecated alias.
`QdrantVectorStore`	Qdrant adapter for `VectorStoreProtocol` (optional: `uv sync --group qdrant`). Reference example for multi-backend support.
`HybridStore`	Combined FTS5 + vector search

Text Processing¶

Component	Description
`TextSplitter`	Recursive character text splitting
`TokenTextSplitter`	Token-based splitting
`MarkdownSplitter`	Markdown-aware splitting

Document Loaders¶

Component	Description
`TextLoader`	Plain text files
`MarkdownLoader`	Markdown with frontmatter
`JSONLoader`	JSON with configurable extraction
`JSONLLoader`	JSON Lines with lazy loading
`DirectoryLoader`	Batch loading from directories
`PDFLoader`	PDF files (requires `docling`)

Advanced Features¶

Component	Description
`Reranker`	Cross-encoder reranking
`create_rag_tool`	Agent integration
`async_search_knowledge`	Async search helper

Embedding Models¶

cyllama uses llama.cpp embedding models in GGUF format. Recommended models:

Model	Dimension	Size	Notes
bge-small-en-v1.5	384	~130MB	Good quality/size balance
bge-base-en-v1.5	768	~440MB	Higher quality
snowflake-arctic-embed-s	384	~130MB	Fast, accurate
all-MiniLM-L6-v2	384	~90MB	Lightweight
nomic-embed-text-v1.5	768	~550MB	Long context (8192)

Downloading Models¶

# Using huggingface-cli
huggingface-cli download BAAI/bge-small-en-v1.5-gguf bge-small-en-v1.5-q8_0.gguf

# Or directly with wget
wget https://huggingface.co/BAAI/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf

Serving Embeddings over HTTP¶

The Embedder can also be served via the built-in OpenAI-compatible server (PythonServer or EmbeddedServer). This lets lightweight clients generate embeddings over HTTP without installing cyllama or having GPU access locally:

from cyllama.llama.server.python import ServerConfig, PythonServer

config = ServerConfig(
    model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
    embedding=True,
    embedding_model_path="models/bge-small-en-v1.5-q8_0.gguf",
)

with PythonServer(config) as server:
    # Serves /v1/chat/completions and /v1/embeddings
    import time
    while True:
        time.sleep(1)

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "hello world"}'

See Embedder docs and Server Usage for configuration details.

Command-Line Interface¶

The cyllama rag command provides command-line RAG without writing any Python:

# Single query against a directory
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure the system?"

# Index specific files and enter interactive mode (omit -p)
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md

# Stream output and show source chunks
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "Summarize the architecture" --stream --sources

# Custom system instruction and retrieval settings
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -s "Answer in one paragraph" -k 3 --threshold 0.4

Options¶

Flag	Description	Default
`-m, --model`	Path to GGUF generation model	(required)
`-e, --embedding-model`	Path to GGUF embedding model	(required)
`-f, --file`	File to index (repeatable)
`-d, --dir`	Directory to index (repeatable)
`--glob`	Glob pattern for directory loading	`*/`
`-p, --prompt`	Single query (omit for interactive mode)
`-s, --system`	System instruction prepended to the prompt template
`-n, --max-tokens`	Maximum tokens to generate	200
`--temperature`	Generation temperature	0.8
`-k, --top-k`	Number of chunks to retrieve	5
`--threshold`	Minimum similarity threshold	(none)
`-ngl, --n-gpu-layers`	GPU layers to offload	-1
`--stream`	Stream output tokens	off
`--sources`	Show source chunks with similarity scores	off
`--db PATH`	Persist the vector index to a SQLite file (see below)	(in-memory)
`--rebuild`	Delete the `--db` file and re-index from `-f`/`-d`	off
`--no-chat-template`	Use raw-completion prompting instead of the model's chat template	off (chat template on)
`--show-think`	Keep `<think>...</think>` reasoning blocks in the output	off (stripped)
`--repetition-threshold N`	Stop generation after the same n-gram repeats N times. `0` disables.	2
`--repetition-ngram N`	Word-level n-gram length for repetition detection	5
`--repetition-window N`	Rolling word-window size for repetition detection	300

At least one document source (-f or -d) is required on the first run. With --db PATH, subsequent runs may omit -f/-d to query the existing index.

In interactive mode, type your questions at the > prompt. Press Ctrl+C or EOF to exit.

Persistent Vector Store (CLI)¶

By default cyllama rag builds the index in memory and rebuilds it on every invocation. With --db PATH, the index is persisted to a SQLite file and reused on subsequent runs, so the corpus is embedded only once:

# First run: index the corpus and persist it
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db rag.db -f corpus.txt -p "What is in the corpus?"

# Subsequent runs: reuse the persisted index without re-embedding
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db rag.db -p "Another question?"

# Re-running with the same -f is a true no-op on indexing — the
# file's content hash is already in the dedup table:
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db rag.db -f corpus.txt -p "..."
# > reusing N chunks from rag.db (1 unchanged)

# Switched embedding models or chunking? Use --rebuild:
cyllama rag -m models/llama.gguf -e models/bge-base.gguf \
    --db rag.db --rebuild -f corpus.txt -p "..."

Decision matrix:

Args	Behavior
`--db PATH` only, DB exists	Reuse existing index, no indexing
`--db PATH` + `-f/-d`, DB missing	Create DB, index sources
`--db PATH` + `-f/-d`, DB exists	Reopen DB, append (dedup-skipping unchanged sources)
`--db PATH --rebuild` + `-f/-d`	Delete DB, recreate, index sources
`--db PATH --rebuild` without `-f/-d`	Error (rebuild needs sources)
`--db PATH` missing, no `-f/-d`	Error (nothing to query)

If the embedding model basename, chunk size, or chunk overlap on a reopen does not match what's stored in the DB's metadata table, cyllama rag exits with a clear error pointing at --rebuild. See SqliteVectorStore — Metadata Validation for details.

Generation Defaults Worth Knowing¶

The CLI flips three RAGConfig fields from their library defaults because they fix common failure modes for chat-tuned and reasoning-tuned models. See RAG Pipeline — RAGConfig for the underlying fields.

Behavior	CLI default	Disable with
Native chat-template prompting	on	`--no-chat-template`
`<think>` block stripping	on	`--show-think`
N-gram repetition guard	on (`threshold=2`)	`--repetition-threshold 0`

Next Steps¶

Embedder - Generating embeddings
SqliteVectorStore - Vector storage and search
Text Processing - Document splitting and loading
RAG Pipeline - RAG pipeline configuration
Advanced RAG Features - Async, hybrid search, agent integration