Skip to content

Text Processing

cyllama provides utilities for splitting documents into chunks and loading various file formats.

Text Splitters

TextSplitter

Recursive character text splitting with configurable chunk size and overlap:

from cyllama.rag import TextSplitter

splitter = TextSplitter(
    chunk_size=512,      # Target chunk size in characters
    chunk_overlap=50,    # Overlap between chunks
    separators=None,     # Use default separators
    keep_separator=True  # Keep separators in output
)

text = """
This is a long document that needs to be split into smaller chunks.
Each chunk will be embedded separately and stored in the vector database.

The splitter tries to maintain semantic coherence by splitting on
natural boundaries like paragraphs and sentences.
"""

chunks = splitter.split(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars")

Default Separators

The splitter uses a hierarchy of separators:

  1. \n\n - Paragraph breaks
  2. \n - Line breaks
  3. . - Sentences
  4. ! - Exclamations
  5. ? - Questions
  6. ; - Semicolons
  7. , - Commas
  8. `` - Words
  9. `` - Characters

TokenTextSplitter

Split based on token count using a custom tokenizer:

from cyllama.rag import TokenTextSplitter

# Define tokenizer function
def my_tokenizer(text):
    return text.split()  # Simple word tokenizer

splitter = TokenTextSplitter(
    chunk_size=100,           # Tokens per chunk
    chunk_overlap=20,         # Token overlap
    tokenizer=my_tokenizer
)

chunks = splitter.split(long_text)

MarkdownSplitter

Markdown-aware splitting that respects document structure:

from cyllama.rag import MarkdownSplitter

splitter = MarkdownSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

markdown_text = """
# Introduction

This is the introduction section.

## Getting Started

Here's how to get started...

```python
code_block = "preserved"
  • List item 1
  • List item 2 """

chunks = splitter.split(markdown_text)

Headers, code blocks, and lists are preserved where possible

The MarkdownSplitter:

- Preserves code blocks (```)
- Respects header hierarchy (#, ##, ###)
- Keeps list items together
- Maintains frontmatter

## Document Loaders

### TextLoader

Load plain text files:

```python
from cyllama.rag import TextLoader

loader = TextLoader()
documents = loader.load("document.txt")

for doc in documents:
    print(f"Text: {doc.text[:100]}...")
    print(f"Metadata: {doc.metadata}")

MarkdownLoader

Load Markdown files with optional frontmatter parsing:

from cyllama.rag import MarkdownLoader

loader = MarkdownLoader(parse_frontmatter=True)
documents = loader.load("README.md")

# Frontmatter becomes metadata
print(documents[0].metadata)
# {'title': 'My Doc', 'author': 'John', ...}

JSONLoader

Load JSON files with configurable text extraction:

from cyllama.rag import JSONLoader

# Simple usage - extract 'content' field
loader = JSONLoader(text_key="content")
docs = loader.load("data.json")

# With jq-like filtering for nested data
loader = JSONLoader(
    text_key="text",
    jq_filter=".articles[].body"
)
docs = loader.load("nested.json")

JSONLLoader

Load JSON Lines files with lazy loading:

from cyllama.rag import JSONLLoader

# Regular loading
loader = JSONLLoader(text_key="content")
docs = loader.load("data.jsonl")

# Lazy loading for large files
for doc in loader.load_lazy("large.jsonl"):
    # Process one at a time
    process(doc)

PDFLoader

Load PDF files (requires docling package):

# Install: pip install cyllama[pdf]

from cyllama.rag import PDFLoader

loader = PDFLoader()
documents = loader.load("document.pdf")

for doc in documents:
    print(f"Text: {doc.text[:200]}...")
    print(f"Source: {doc.metadata['source']}")

DirectoryLoader

Batch load files from directories:

from cyllama.rag import DirectoryLoader

loader = DirectoryLoader(glob="**/*.md")  # Pattern to match

documents = loader.load("docs/")
print(f"Loaded {len(documents)} documents")

Convenience Functions

load_document()

Auto-detect format and load:

from cyllama.rag import load_document

# Automatically uses correct loader
docs = load_document("file.md")
docs = load_document("data.json", text_key="content")
docs = load_document("report.pdf")  # Requires docling

load_directory()

Load all matching files:

from cyllama.rag import load_directory

docs = load_directory("docs/", glob="**/*.txt")

Complete Example

from cyllama.rag import (
    TextSplitter,
    DirectoryLoader,
    Embedder,
    VectorStore
)

# Load documents
loader = DirectoryLoader(glob="**/*.md")
documents = loader.load("knowledge_base/")

# Split into chunks
splitter = TextSplitter(chunk_size=512, chunk_overlap=50)
chunks = []
for doc in documents:
    doc_chunks = splitter.split(doc.text)
    chunks.extend(doc_chunks)

print(f"Created {len(chunks)} chunks from {len(documents)} documents")

# Embed and store
embedder = Embedder("models/bge-small.gguf")

with VectorStore(dimension=embedder.dimension, db_path="kb.db") as store:
    embeddings = embedder.embed_batch(chunks)
    store.add(embeddings, chunks)
    print(f"Indexed {len(store)} chunks")

embedder.close()

Data Types

Document

from cyllama.rag import Document

doc = Document(
    text="Document content here",
    metadata={"source": "file.txt", "page": 1}
)

Chunk

from cyllama.rag import Chunk

chunk = Chunk(
    text="Chunk content",
    metadata={"source": "file.txt"},
    start=0,      # Start position in original
    end=100       # End position in original
)

Best Practices

  1. Chunk Size: 256-1024 characters works well for most use cases
  2. Overlap: 10-20% of chunk size helps maintain context
  3. Markdown: Use MarkdownSplitter for structured documents
  4. Large Files: Use lazy loading with JSONLLoader.load_lazy()
  5. Metadata: Preserve source information for citation