Skip to content

RAG Support

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from a knowledge base before generation. cyllama provides a complete RAG solution using:

  • llama.cpp for both embedding generation and text generation
  • sqlite-vector for high-performance vector similarity search
  • SQLite FTS5 for hybrid keyword + semantic search

Architecture

                    +-----------------+
                    |   RAG Pipeline  |
                    +--------+--------+
                             |
         +-------------------+-------------------+
         |                   |                   |
+--------v--------+ +--------v--------+ +--------v--------+
|    Embedder     | |  VectorStore    | |   Generator     |
| (embedding LLM) | | (retrieval)     | | (generation LLM)|
+-----------------+ +-----------------+ +-----------------+

Quick Start

The simplest way to use RAG is through the high-level RAG class:

from cyllama.rag import RAG

# Initialize with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)

# Add documents to the knowledge base
rag.add_texts([
    "Python is a high-level programming language known for its simplicity.",
    "Machine learning uses algorithms to learn patterns from data.",
    "Neural networks are inspired by biological brain structures."
])

# Or load from files
rag.add_documents(["docs/guide.md", "docs/api.txt"])

# Query the knowledge base
response = rag.query("What is Python?")
print(response.text)
print(f"Sources: {len(response.sources)}")

# Stream the response
for chunk in rag.stream("Explain machine learning"):
    print(chunk, end="", flush=True)

# Clean up
rag.close()

Using Context Managers

For proper resource cleanup, use the context manager:

from cyllama.rag import RAG

with RAG(
    embedding_model="models/bge-small.gguf",
    generation_model="models/llama.gguf"
) as rag:
    rag.add_texts(["Your documents here..."])
    response = rag.query("Your question?")
    print(response.text)
# Resources automatically cleaned up

Components Overview

Core Components

Component Description
RAG High-level interface with sensible defaults
AsyncRAG Async wrapper for non-blocking operations
RAGPipeline Lower-level orchestration of retrieval + generation
RAGConfig Configuration for retrieval and generation

Storage & Retrieval

Component Description
Embedder Generate vector embeddings from text
VectorStore SQLite-based vector storage with sqlite-vector
HybridStore Combined FTS5 + vector search

Text Processing

Component Description
TextSplitter Recursive character text splitting
TokenTextSplitter Token-based splitting
MarkdownSplitter Markdown-aware splitting

Document Loaders

Component Description
TextLoader Plain text files
MarkdownLoader Markdown with frontmatter
JSONLoader JSON with configurable extraction
JSONLLoader JSON Lines with lazy loading
DirectoryLoader Batch loading from directories
PDFLoader PDF files (requires docling)

Advanced Features

Component Description
Reranker Cross-encoder reranking
create_rag_tool Agent integration
async_search_knowledge Async search helper

Embedding Models

cyllama uses llama.cpp embedding models in GGUF format. Recommended models:

Model Dimension Size Notes
bge-small-en-v1.5 384 ~130MB Good quality/size balance
bge-base-en-v1.5 768 ~440MB Higher quality
snowflake-arctic-embed-s 384 ~130MB Fast, accurate
all-MiniLM-L6-v2 384 ~90MB Lightweight
nomic-embed-text-v1.5 768 ~550MB Long context (8192)

Downloading Models

# Using huggingface-cli
huggingface-cli download BAAI/bge-small-en-v1.5-gguf bge-small-en-v1.5-q8_0.gguf

# Or directly with wget
wget https://huggingface.co/BAAI/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf

Next Steps