Skip to content

Embedder

The Embedder class generates vector embeddings from text using llama.cpp embedding models in GGUF format.

Basic Usage

from cyllama.rag import Embedder

# Initialize with an embedding model
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")

# Embed a single text
embedding = embedder.embed("What is machine learning?")
print(f"Dimension: {len(embedding)}")  # e.g., 384

# Embed multiple texts efficiently
texts = [
    "Python is a programming language.",
    "Machine learning uses neural networks.",
    "Data science involves statistics."
]
embeddings = embedder.embed_batch(texts)
print(f"Generated {len(embeddings)} embeddings")

# Clean up
embedder.close()

Constructor Options

embedder = Embedder(
    model_path="models/bge-small.gguf",
    n_ctx=512,           # Context size (match model training)
    n_batch=512,         # Batch size for processing
    n_gpu_layers=-1,     # GPU layers (-1 = all)
    pooling="mean",      # Pooling strategy
    normalize=True       # L2 normalize embeddings
)

Pooling Strategies

Strategy Description
mean Average all token embeddings (default)
cls Use first token embedding (CLS token)
last Use last token embedding
none Return all token embeddings
from cyllama.rag import Embedder, PoolingType

# Using enum
embedder = Embedder(
    "model.gguf",
    pooling=PoolingType.CLS
)

# Or string
embedder = Embedder(
    "model.gguf",
    pooling="cls"
)

Methods

embed()

Embed a single text string:

embedding = embedder.embed("Your text here")
# Returns: list[float] of dimension n_embd

embed_batch()

Embed multiple texts efficiently:

embeddings = embedder.embed_batch([
    "First document",
    "Second document",
    "Third document"
])
# Returns: list[list[float]]

embed_documents()

Embed documents with optional progress tracking:

embeddings = embedder.embed_documents(
    ["doc1", "doc2", "doc3"],
    show_progress=True  # Display progress bar
)

embed_with_info()

Get embedding with additional metadata:

result = embedder.embed_with_info("Your text here")
print(f"Embedding: {result.embedding[:5]}...")
print(f"Token count: {result.token_count}")
print(f"Truncated: {result.truncated}")

embed_iter()

Generator for memory-efficient batch embedding:

for embedding in embedder.embed_iter(large_text_list, batch_size=32):
    # Process each embedding
    store.add_one(embedding, text)

Properties

# Get embedding dimension
print(f"Dimension: {embedder.dimension}")  # e.g., 384

# Check if normalized
print(f"Normalized: {embedder.normalize}")

Context Manager

Use context manager for automatic cleanup:

from cyllama.rag import Embedder

with Embedder("models/bge-small.gguf") as embedder:
    embeddings = embedder.embed_batch(texts)
# Resources automatically released

Normalization

By default, embeddings are L2-normalized (unit vectors). This is important for cosine similarity:

import math

embedder = Embedder("model.gguf", normalize=True)
embedding = embedder.embed("test")

# Verify normalization
norm = math.sqrt(sum(x*x for x in embedding))
print(f"Norm: {norm}")  # Should be ~1.0

To disable normalization:

embedder = Embedder("model.gguf", normalize=False)
from cyllama.rag import Embedder, VectorStore

# Initialize
embedder = Embedder("models/bge-small.gguf")

# Documents to index
documents = [
    "Python is a versatile programming language.",
    "JavaScript runs in web browsers.",
    "Rust provides memory safety without garbage collection.",
    "Go was designed for concurrent programming.",
]

# Generate embeddings and store
embeddings = embedder.embed_batch(documents)

with VectorStore(dimension=embedder.dimension) as store:
    store.add(embeddings, documents)

    # Search
    query = "Which language is good for web development?"
    query_embedding = embedder.embed(query)

    results = store.search(query_embedding, k=2)
    for result in results:
        print(f"[{result.score:.3f}] {result.text}")

embedder.close()

Performance Tips

  1. Batch Processing: Use embed_batch() instead of multiple embed() calls
  2. GPU Acceleration: Set n_gpu_layers=-1 to use all GPU layers
  3. Context Size: Match n_ctx to your model's training context
  4. Memory Efficiency: Use embed_iter() for large datasets