Advanced RAG Features¶
cyllama provides advanced RAG features including async support, agent integration, hybrid search, and reranking.
AsyncRAG¶
For non-blocking RAG operations in async applications:
import asyncio
from cyllama.rag import AsyncRAG
async def main():
async with AsyncRAG(
embedding_model="models/bge-small.gguf",
generation_model="models/llama.gguf"
) as rag:
# Async document ingestion
await rag.add_texts([
"Python is a programming language.",
"Machine learning uses neural networks."
])
# Async query
response = await rag.query("What is Python?")
print(response.text)
# Async streaming
async for chunk in rag.stream("Explain ML"):
print(chunk, end="", flush=True)
# Async retrieve
sources = await rag.retrieve("neural networks")
for source in sources:
print(f"[{source.score:.2f}] {source.text}")
asyncio.run(main())
AsyncRAG Methods¶
| Method | Description |
|---|---|
add_texts(texts, metadata, split) |
Async add texts |
add_documents(paths, split) |
Async load files |
query(question, config) |
Async query |
stream(question, config) |
Async token stream |
retrieve(question, config) |
Async retrieval |
search(query, k, threshold) |
Async vector search |
clear() |
Async clear all |
close() |
Async cleanup |
Using with FastAPI¶
from fastapi import FastAPI
from cyllama.rag import AsyncRAG
app = FastAPI()
rag = None
@app.on_event("startup")
async def startup():
global rag
rag = AsyncRAG(
embedding_model="models/bge-small.gguf",
generation_model="models/llama.gguf"
)
await rag.add_documents(["knowledge_base/"])
@app.on_event("shutdown")
async def shutdown():
await rag.close()
@app.get("/query")
async def query(q: str):
response = await rag.query(q)
return {
"answer": response.text,
"sources": [s.text for s in response.sources]
}
Agent Integration¶
Create RAG tools for use with cyllama agents:
from cyllama import LLM
from cyllama.rag import RAG, create_rag_tool
from cyllama.agents import ReActAgent
# Create RAG instance
rag = RAG(
embedding_model="models/bge-small.gguf",
generation_model="models/llama.gguf"
)
# Add knowledge
rag.add_texts([
"The company was founded in 2020.",
"Our main product is an AI assistant.",
"We have 50 employees worldwide."
])
# Create tool from RAG
knowledge_tool = create_rag_tool(
rag,
name="search_knowledge",
description="Search the company knowledge base for information.",
top_k=3,
include_scores=True
)
# Use with agent
llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[knowledge_tool])
result = agent.run("When was the company founded and how many employees do we have?")
print(result)
Tool Options¶
tool = create_rag_tool(
rag,
name="search_docs", # Tool name
description="Search docs", # Tool description
top_k=5, # Results to retrieve
include_scores=True # Show similarity scores
)
HybridStore¶
Combine vector similarity with keyword search using FTS5:
from cyllama.rag import HybridStore, Embedder
embedder = Embedder("models/bge-small.gguf")
# Create hybrid store
store = HybridStore(
dimension=embedder.dimension,
db_path="hybrid.db",
alpha=0.5 # Balance between vector (1.0) and FTS (0.0)
)
# Add documents
texts = [
"Python programming tutorial for beginners",
"Advanced Python decorators and metaclasses",
"JavaScript async/await patterns",
"Database optimization techniques"
]
embeddings = embedder.embed_batch(texts)
store.add(embeddings, texts)
# Hybrid search - combines semantic + keyword
query_emb = embedder.embed("Python programming")
results = store.search(
query_embedding=query_emb,
query_text="Python", # FTS keyword search
k=3,
alpha=0.7 # Override: more weight on vectors
)
for r in results:
print(f"[{r.score:.3f}] {r.text}")
store.close()
embedder.close()
How Hybrid Search Works¶
- Vector Search: Finds semantically similar documents
- FTS5 Search: Finds documents with matching keywords
- Reciprocal Rank Fusion: Combines rankings from both methods
The alpha parameter controls the balance:
alpha=1.0: Pure vector searchalpha=0.5: Equal weight (default)alpha=0.0: Pure keyword search
When to Use Hybrid Search¶
- Exact keyword matches are important (product codes, names)
- Users search with specific technical terms
- Semantic search misses important lexical matches
Reranker¶
Cross-encoder reranking for improved result quality:
from cyllama.rag import Reranker, VectorStore, Embedder
# Initial retrieval (fast but less precise)
embedder = Embedder("models/bge-small.gguf")
store = VectorStore(dimension=embedder.dimension)
# ... add documents ...
# Get initial results (retrieve more than needed)
query = "machine learning algorithms"
query_emb = embedder.embed(query)
initial_results = store.search(query_emb, k=20)
# Rerank for precision (slower but more accurate)
reranker = Reranker("models/bge-reranker.gguf")
reranked = reranker.rerank(query, initial_results, top_k=5)
for r in reranked:
print(f"[{r.score:.3f}] {r.text}")
reranker.close()
How Reranking Works¶
- Initial Retrieval: Fast bi-encoder similarity search
- Reranking: Slower cross-encoder scores each query-document pair
- Final Results: Reordered by cross-encoder scores
Cross-encoders are more accurate because they see query and document together, but they're slower (can't be pre-computed).
Reranker Methods¶
# Score a single pair
score = reranker.score(query, document_text)
# Rerank a list of results
reranked = reranker.rerank(
query="search query",
results=initial_results,
top_k=5 # Return top 5 after reranking
)
Complete Advanced Example¶
from cyllama import LLM
from cyllama.rag import (
AsyncRAG,
HybridStore,
Embedder,
RAGPipeline,
RAGConfig,
Reranker,
create_rag_tool
)
from cyllama.agents import ReActAgent
import asyncio
# 1. Build a knowledge base with hybrid search
embedder = Embedder("models/bge-small.gguf")
with HybridStore(
dimension=embedder.dimension,
db_path="knowledge.db"
) as store:
documents = [
"Python 3.12 introduced new performance improvements.",
"The typing module provides type hints for Python.",
"FastAPI is a modern Python web framework.",
"PyTorch is used for deep learning research.",
]
embeddings = embedder.embed_batch(documents)
store.add(embeddings, documents)
# Hybrid search
query = "Python performance"
query_emb = embedder.embed(query)
results = store.search(query_emb, query_text="Python", k=2)
print("Hybrid search results:")
for r in results:
print(f" [{r.score:.3f}] {r.text}")
# 2. Async RAG with agent tool
async def agent_demo():
async with AsyncRAG(
embedding_model="models/bge-small.gguf",
generation_model="models/llama.gguf"
) as rag:
await rag.add_texts([
"Our API rate limit is 100 requests per minute.",
"Authentication requires an API key in the header.",
"The /users endpoint returns user data."
])
# Create tool for agent
from cyllama.rag import RAG
sync_rag = RAG(
embedding_model="models/bge-small.gguf",
generation_model="models/llama.gguf"
)
sync_rag.add_texts(["API docs..."])
tool = create_rag_tool(sync_rag)
llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[tool])
result = agent.run("What is the API rate limit?")
print(f"\nAgent result: {result}")
sync_rag.close()
llm.close()
asyncio.run(agent_demo())
embedder.close()
Performance Tips¶
- Initial Retrieval: Retrieve 2-4x more documents than needed, then rerank
- Hybrid Search: Use when keyword matches matter
- Async: Use
AsyncRAGin web applications to avoid blocking - Quantization: Call
store.quantize()for datasets >10k vectors - Caching: Reuse embedder and store instances across queries