RAG Pipeline¶

The RAG pipeline orchestrates the complete retrieval-augmented generation process: embedding queries, retrieving relevant documents, formatting prompts, and generating responses.

High-Level RAG Class¶

The RAG class provides the simplest interface:

from cyllama.rag import RAG, RAGConfig

# Initialize
rag = RAG(
    embedding_model="models/bge-small.gguf",
    generation_model="models/llama.gguf",
    chunk_size=512,          # Text splitting
    chunk_overlap=50,
    db_path=":memory:",      # Vector store location
    config=RAGConfig(        # Optional config
        top_k=5,
        temperature=0.7
    )
)

# Add documents
rag.add_texts([
    "Python was created by Guido van Rossum.",
    "Python emphasizes code readability.",
    "Python supports multiple programming paradigms."
])

# Or from files
rag.add_documents(["guide.md", "tutorial.txt"])

# Query
response = rag.query("Who created Python?")
print(response.text)

# Stream response
for chunk in rag.stream("Explain Python's philosophy"):
    print(chunk, end="", flush=True)

# Retrieve without generation
sources = rag.retrieve("Python creator")
for source in sources:
    print(f"[{source.score:.2f}] {source.text}")

# Direct vector search
results = rag.search("programming language", k=3)

# Clean up
rag.close()

RAGConfig¶

Configure retrieval and generation parameters:

from cyllama.rag import RAGConfig

config = RAGConfig(
    # Retrieval settings
    top_k=5,                      # Number of documents to retrieve
    similarity_threshold=0.5,     # Minimum similarity score

    # Generation settings
    max_tokens=512,               # Maximum response length
    temperature=0.7,              # Creativity (0.0 = deterministic)

    # Prompt formatting
    prompt_template="""Use the context to answer the question.

Context:
{context}

Question: {question}

Answer:""",
    context_separator="\n\n",     # Join retrieved documents
    include_metadata=False        # Include metadata in context
)

Custom Prompt Templates¶

# Simple template
config = RAGConfig(
    prompt_template="""Based on these facts:
{context}

Answer this: {question}"""
)

# With metadata
config = RAGConfig(
    include_metadata=True,
    prompt_template="""Sources:
{context}

Given the above sources, answer: {question}"""
)

RAGResponse¶

Query responses include text, sources, and statistics:

response = rag.query("What is Python?")

# Generated text
print(response.text)

# Original query
print(response.query)

# Retrieved sources
for source in response.sources:
    print(f"ID: {source.id}")
    print(f"Text: {source.text}")
    print(f"Score: {source.score}")
    print(f"Metadata: {source.metadata}")

# Generation stats (if available)
if response.stats:
    print(f"Tokens: {response.stats.generated_tokens}")
    print(f"Time: {response.stats.total_time}s")

# Serialize to dict
data = response.to_dict()

RAGPipeline (Low-Level)¶

For more control, use RAGPipeline directly:

from cyllama import LLM
from cyllama.rag import Embedder, VectorStore, RAGPipeline, RAGConfig

# Create components
embedder = Embedder("models/bge-small.gguf")
store = VectorStore(dimension=embedder.dimension)
llm = LLM("models/llama.gguf")

# Index documents
texts = ["Doc 1 content", "Doc 2 content"]
embeddings = embedder.embed_batch(texts)
store.add(embeddings, texts)

# Create pipeline
pipeline = RAGPipeline(
    embedder=embedder,
    store=store,
    generator=llm,
    config=RAGConfig(top_k=3)
)

# Query
response = pipeline.query("Your question?")
print(response.text)

# Retrieve only (no generation)
sources = pipeline.retrieve("Your question?")

# Stream
for chunk in pipeline.stream("Your question?"):
    print(chunk, end="")

# Override config for specific query
custom_config = RAGConfig(top_k=10, temperature=0.2)
response = pipeline.query("Question?", config=custom_config)

# Clean up
embedder.close()
store.close()
llm.close()

Query Override¶

Override configuration per query:

# Default config
rag = RAG(
    embedding_model="model.gguf",
    generation_model="model.gguf",
    config=RAGConfig(top_k=5, temperature=0.7)
)

# Override for specific query
precise_config = RAGConfig(top_k=10, temperature=0.1)
response = rag.query("Technical question?", config=precise_config)

# Creative query
creative_config = RAGConfig(top_k=3, temperature=0.9)
response = rag.query("Write a poem about...", config=creative_config)

Complete Example¶

from cyllama.rag import RAG, RAGConfig

# Custom configuration
config = RAGConfig(
    top_k=5,
    similarity_threshold=0.3,
    max_tokens=256,
    temperature=0.7,
    prompt_template="""You are a helpful assistant. Use the following context to answer the user's question. If the context doesn't contain relevant information, say so.

Context:
{context}

User Question: {question}

Answer:"""
)

# Initialize RAG
with RAG(
    embedding_model="models/bge-small.gguf",
    generation_model="models/llama.gguf",
    config=config
) as rag:
    # Build knowledge base
    rag.add_texts([
        "Python was created by Guido van Rossum in 1991.",
        "Python is known for its clear syntax and readability.",
        "Python supports object-oriented, functional, and procedural programming.",
        "The Python Package Index (PyPI) hosts thousands of third-party modules.",
    ])

    # Interactive query loop
    questions = [
        "Who created Python?",
        "What is Python known for?",
        "What programming paradigms does Python support?"
    ]

    for question in questions:
        print(f"\nQ: {question}")
        response = rag.query(question)
        print(f"A: {response.text}")
        print(f"   Sources: {len(response.sources)}")

RAG Methods Summary¶

Method	Description
`add_texts(texts, metadata, split)`	Add text strings to knowledge base
`add_documents(paths, split)`	Load and add files
`add_document(document, split)`	Add single Document object
`query(question, config)`	Get RAGResponse with generated text
`stream(question, config)`	Stream response tokens
`retrieve(question, config)`	Get relevant sources only
`search(query, k, threshold)`	Direct vector search
`count`	Number of documents
`clear()`	Remove all documents
`close()`	Release resources