RAG Support¶
Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from a knowledge base before generation. cyllama provides a complete RAG solution using:
-
llama.cpp for both embedding generation and text generation
-
sqlite-vector for high-performance vector similarity search
-
SQLite FTS5 for hybrid keyword + semantic search
Architecture¶
+-----------------+
| RAG Pipeline |
+--------+--------+
|
+-------------------+-------------------+
| | |
+--------v--------+ +--------v--------+ +--------v--------+
| Embedder | |SqliteVectorStore| | Generator |
| (embedding LLM) | | (retrieval) | | (generation LLM)|
+-----------------+ +-----------------+ +-----------------+
Quick Start¶
The simplest way to use RAG is through the high-level RAG class:
from cyllama.rag import RAG
# Initialize with embedding and generation models
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)
# Add documents to the knowledge base
rag.add_texts([
"Python is a high-level programming language known for its simplicity.",
"Machine learning uses algorithms to learn patterns from data.",
"Neural networks are inspired by biological brain structures."
])
# Or load from files
rag.add_documents(["docs/guide.md", "docs/api.txt"])
# Query the knowledge base
response = rag.query("What is Python?")
print(response.text)
print(f"Sources: {len(response.sources)}")
# Stream the response
for chunk in rag.stream("Explain machine learning"):
print(chunk, end="", flush=True)
# Clean up
rag.close()
Using Context Managers¶
For proper resource cleanup, use the context manager:
from cyllama.rag import RAG
with RAG(
embedding_model="models/bge-small.gguf",
generation_model="models/llama.gguf"
) as rag:
rag.add_texts(["Your documents here..."])
response = rag.query("Your question?")
print(response.text)
# Resources automatically cleaned up
Pluggable Backends¶
RAG and RAGPipeline accept an injected embedder and vector store via the embedder= and store= constructor parameters:
from cyllama.rag import RAG, SqliteVectorStore
rag = RAG(
embedding_model="", # ignored when embedder= is supplied
generation_model="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
embedder=my_embedder, # any EmbedderProtocol
store=SqliteVectorStore(dimension=1536, db_path="x.db"), # any VectorStoreProtocol
)
Both slots are typed as structural protocols (EmbedderProtocol, VectorStoreProtocol in cyllama.rag.types). Alternative backends — OpenAI embeddings, Qdrant, Chroma, pgvector, an in-house service — only need to implement the handful of methods the RAG layer actually calls to become drop-in replacements. See:
Omit the argument to fall back to the defaults (Embedder over a local GGUF embedding model and SqliteVectorStore).
Components Overview¶
Core Components¶
| Component | Description |
|---|---|
RAG |
High-level interface with sensible defaults |
AsyncRAG |
Async wrapper for non-blocking operations |
RAGPipeline |
Lower-level orchestration of retrieval + generation |
RAGConfig |
Configuration for retrieval and generation |
Storage & Retrieval¶
| Component | Description |
|---|---|
Embedder |
Generate vector embeddings from text |
SqliteVectorStore |
SQLite-based vector storage with sqlite-vector (default backend; implements VectorStoreProtocol). VectorStore remains as a deprecated alias. |
QdrantVectorStore |
Qdrant adapter for VectorStoreProtocol (optional: uv sync --group qdrant). Reference example for multi-backend support. |
HybridStore |
Combined FTS5 + vector search |
Text Processing¶
| Component | Description |
|---|---|
TextSplitter |
Recursive character text splitting |
TokenTextSplitter |
Token-based splitting |
MarkdownSplitter |
Markdown-aware splitting |
Document Loaders¶
| Component | Description |
|---|---|
TextLoader |
Plain text files |
MarkdownLoader |
Markdown with frontmatter |
JSONLoader |
JSON with configurable extraction |
JSONLLoader |
JSON Lines with lazy loading |
DirectoryLoader |
Batch loading from directories |
PDFLoader |
PDF files (requires docling) |
Advanced Features¶
| Component | Description |
|---|---|
Reranker |
Cross-encoder reranking |
create_rag_tool |
Agent integration |
async_search_knowledge |
Async search helper |
Embedding Models¶
cyllama uses llama.cpp embedding models in GGUF format. Recommended models:
| Model | Dimension | Size | Notes |
|---|---|---|---|
| bge-small-en-v1.5 | 384 | ~130MB | Good quality/size balance |
| bge-base-en-v1.5 | 768 | ~440MB | Higher quality |
| snowflake-arctic-embed-s | 384 | ~130MB | Fast, accurate |
| all-MiniLM-L6-v2 | 384 | ~90MB | Lightweight |
| nomic-embed-text-v1.5 | 768 | ~550MB | Long context (8192) |
Downloading Models¶
# Using huggingface-cli
huggingface-cli download BAAI/bge-small-en-v1.5-gguf bge-small-en-v1.5-q8_0.gguf
# Or directly with wget
wget https://huggingface.co/BAAI/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf
Serving Embeddings over HTTP¶
The Embedder can also be served via the built-in OpenAI-compatible server (PythonServer or EmbeddedServer). This lets lightweight clients generate embeddings over HTTP without installing cyllama or having GPU access locally:
from cyllama.llama.server.python import ServerConfig, PythonServer
config = ServerConfig(
model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
embedding=True,
embedding_model_path="models/bge-small-en-v1.5-q8_0.gguf",
)
with PythonServer(config) as server:
# Serves /v1/chat/completions and /v1/embeddings
import time
while True:
time.sleep(1)
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "hello world"}'
See Embedder docs and Server Usage for configuration details.
Command-Line Interface¶
The cyllama rag command provides command-line RAG without writing any Python:
# Single query against a directory
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-d docs/ -p "How do I configure the system?"
# Index specific files and enter interactive mode (omit -p)
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-f guide.md -f faq.md
# Stream output and show source chunks
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-d docs/ -p "Summarize the architecture" --stream --sources
# Custom system instruction and retrieval settings
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-d docs/ -s "Answer in one paragraph" -k 3 --threshold 0.4
Options¶
| Flag | Description | Default |
|---|---|---|
-m, --model |
Path to GGUF generation model | (required) |
-e, --embedding-model |
Path to GGUF embedding model | (required) |
-f, --file |
File to index (repeatable) | |
-d, --dir |
Directory to index (repeatable) | |
--glob |
Glob pattern for directory loading | **/* |
-p, --prompt |
Single query (omit for interactive mode) | |
-s, --system |
System instruction prepended to the prompt template | |
-n, --max-tokens |
Maximum tokens to generate | 200 |
--temperature |
Generation temperature | 0.8 |
-k, --top-k |
Number of chunks to retrieve | 5 |
--threshold |
Minimum similarity threshold | (none) |
-ngl, --n-gpu-layers |
GPU layers to offload | -1 |
--stream |
Stream output tokens | off |
--sources |
Show source chunks with similarity scores | off |
--db PATH |
Persist the vector index to a SQLite file (see below) | (in-memory) |
--rebuild |
Delete the --db file and re-index from -f/-d |
off |
--no-chat-template |
Use raw-completion prompting instead of the model's chat template | off (chat template on) |
--show-think |
Keep <think>...</think> reasoning blocks in the output |
off (stripped) |
--repetition-threshold N |
Stop generation after the same n-gram repeats N times. 0 disables. |
2 |
--repetition-ngram N |
Word-level n-gram length for repetition detection | 5 |
--repetition-window N |
Rolling word-window size for repetition detection | 300 |
At least one document source (-f or -d) is required on the first run. With --db PATH, subsequent runs may omit -f/-d to query the existing index.
In interactive mode, type your questions at the > prompt. Press Ctrl+C or EOF to exit.
Persistent Vector Store (CLI)¶
By default cyllama rag builds the index in memory and rebuilds it on every invocation. With --db PATH, the index is persisted to a SQLite file and reused on subsequent runs, so the corpus is embedded only once:
# First run: index the corpus and persist it
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
--db rag.db -f corpus.txt -p "What is in the corpus?"
# Subsequent runs: reuse the persisted index without re-embedding
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
--db rag.db -p "Another question?"
# Re-running with the same -f is a true no-op on indexing — the
# file's content hash is already in the dedup table:
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
--db rag.db -f corpus.txt -p "..."
# > reusing N chunks from rag.db (1 unchanged)
# Switched embedding models or chunking? Use --rebuild:
cyllama rag -m models/llama.gguf -e models/bge-base.gguf \
--db rag.db --rebuild -f corpus.txt -p "..."
Decision matrix:
| Args | Behavior |
|---|---|
--db PATH only, DB exists |
Reuse existing index, no indexing |
--db PATH + -f/-d, DB missing |
Create DB, index sources |
--db PATH + -f/-d, DB exists |
Reopen DB, append (dedup-skipping unchanged sources) |
--db PATH --rebuild + -f/-d |
Delete DB, recreate, index sources |
--db PATH --rebuild without -f/-d |
Error (rebuild needs sources) |
--db PATH missing, no -f/-d |
Error (nothing to query) |
If the embedding model basename, chunk size, or chunk overlap on a reopen does not match what's stored in the DB's metadata table, cyllama rag exits with a clear error pointing at --rebuild. See SqliteVectorStore — Metadata Validation for details.
Generation Defaults Worth Knowing¶
The CLI flips three RAGConfig fields from their library defaults because they fix common failure modes for chat-tuned and reasoning-tuned models. See RAG Pipeline — RAGConfig for the underlying fields.
| Behavior | CLI default | Disable with |
|---|---|---|
| Native chat-template prompting | on | --no-chat-template |
<think> block stripping |
on | --show-think |
| N-gram repetition guard | on (threshold=2) |
--repetition-threshold 0 |
Next Steps¶
-
Embedder - Generating embeddings
-
SqliteVectorStore - Vector storage and search
-
Text Processing - Document splitting and loading
-
RAG Pipeline - RAG pipeline configuration
-
Advanced RAG Features - Async, hybrid search, agent integration