RAG Support¶

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from a knowledge base before generation. cyllama provides a complete RAG solution using:

llama.cpp for both embedding generation and text generation
sqlite-vector for high-performance vector similarity search
SQLite FTS5 for hybrid keyword + semantic search

Architecture¶

                    +-----------------+
                    |   RAG Pipeline  |
                    +--------+--------+
                             |
         +-------------------+-------------------+
         |                   |                   |
+--------v--------+ +--------v--------+ +--------v--------+
|    Embedder     | |  VectorStore    | |   Generator     |
| (embedding LLM) | | (retrieval)     | | (generation LLM)|
+-----------------+ +-----------------+ +-----------------+

Quick Start¶

The simplest way to use RAG is through the high-level RAG class:

from cyllama.rag import RAG

# Initialize with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)

# Add documents to the knowledge base
rag.add_texts([
    "Python is a high-level programming language known for its simplicity.",
    "Machine learning uses algorithms to learn patterns from data.",
    "Neural networks are inspired by biological brain structures."
])

# Or load from files
rag.add_documents(["docs/guide.md", "docs/api.txt"])

# Query the knowledge base
response = rag.query("What is Python?")
print(response.text)
print(f"Sources: {len(response.sources)}")

# Stream the response
for chunk in rag.stream("Explain machine learning"):
    print(chunk, end="", flush=True)

# Clean up
rag.close()

Using Context Managers¶

For proper resource cleanup, use the context manager:

from cyllama.rag import RAG

with RAG(
    embedding_model="models/bge-small.gguf",
    generation_model="models/llama.gguf"
) as rag:
    rag.add_texts(["Your documents here..."])
    response = rag.query("Your question?")
    print(response.text)
# Resources automatically cleaned up

Components Overview¶

Core Components¶

Component	Description
`RAG`	High-level interface with sensible defaults
`AsyncRAG`	Async wrapper for non-blocking operations
`RAGPipeline`	Lower-level orchestration of retrieval + generation
`RAGConfig`	Configuration for retrieval and generation

Storage & Retrieval¶

Component	Description
`Embedder`	Generate vector embeddings from text
`VectorStore`	SQLite-based vector storage with sqlite-vector
`HybridStore`	Combined FTS5 + vector search

Text Processing¶

Component	Description
`TextSplitter`	Recursive character text splitting
`TokenTextSplitter`	Token-based splitting
`MarkdownSplitter`	Markdown-aware splitting

Document Loaders¶

Component	Description
`TextLoader`	Plain text files
`MarkdownLoader`	Markdown with frontmatter
`JSONLoader`	JSON with configurable extraction
`JSONLLoader`	JSON Lines with lazy loading
`DirectoryLoader`	Batch loading from directories
`PDFLoader`	PDF files (requires `docling`)

Advanced Features¶

Component	Description
`Reranker`	Cross-encoder reranking
`create_rag_tool`	Agent integration
`async_search_knowledge`	Async search helper

Embedding Models¶

cyllama uses llama.cpp embedding models in GGUF format. Recommended models:

Model	Dimension	Size	Notes
bge-small-en-v1.5	384	~130MB	Good quality/size balance
bge-base-en-v1.5	768	~440MB	Higher quality
snowflake-arctic-embed-s	384	~130MB	Fast, accurate
all-MiniLM-L6-v2	384	~90MB	Lightweight
nomic-embed-text-v1.5	768	~550MB	Long context (8192)

Downloading Models¶

# Using huggingface-cli
huggingface-cli download BAAI/bge-small-en-v1.5-gguf bge-small-en-v1.5-q8_0.gguf

# Or directly with wget
wget https://huggingface.co/BAAI/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf

Next Steps¶

Embedder - Generating embeddings
VectorStore - Vector storage and search
Text Processing - Document splitting and loading
RAG Pipeline - RAG pipeline configuration
Advanced RAG Features - Async, hybrid search, agent integration