Embedder¶
The Embedder class generates vector embeddings from text using llama.cpp embedding models in GGUF format.
Basic Usage¶
from cyllama.rag import Embedder
# Initialize with an embedding model
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
# Embed a single text
embedding = embedder.embed("What is machine learning?")
print(f"Dimension: {len(embedding)}") # e.g., 384
# Embed multiple texts efficiently
texts = [
"Python is a programming language.",
"Machine learning uses neural networks.",
"Data science involves statistics."
]
embeddings = embedder.embed_batch(texts)
print(f"Generated {len(embeddings)} embeddings")
# Clean up
embedder.close()
Constructor Options¶
embedder = Embedder(
model_path="models/bge-small.gguf",
n_ctx=512, # Context size (match model training)
n_batch=512, # Batch size for processing
n_gpu_layers=-1, # GPU layers (-1 = all)
pooling="mean", # Pooling strategy
normalize=True # L2 normalize embeddings
)
Pooling Strategies¶
| Strategy | Description |
|---|---|
mean |
Average all token embeddings (default) |
cls |
Use first token embedding (CLS token) |
last |
Use last token embedding |
none |
Return all token embeddings |
from cyllama.rag import Embedder, PoolingType
# Using enum
embedder = Embedder(
"model.gguf",
pooling=PoolingType.CLS
)
# Or string
embedder = Embedder(
"model.gguf",
pooling="cls"
)
Methods¶
embed()¶
Embed a single text string:
embed_batch()¶
Embed multiple texts efficiently:
embeddings = embedder.embed_batch([
"First document",
"Second document",
"Third document"
])
# Returns: list[list[float]]
embed_documents()¶
Embed documents with optional progress tracking:
embeddings = embedder.embed_documents(
["doc1", "doc2", "doc3"],
show_progress=True # Display progress bar
)
embed_with_info()¶
Get embedding with additional metadata:
result = embedder.embed_with_info("Your text here")
print(f"Embedding: {result.embedding[:5]}...")
print(f"Token count: {result.token_count}")
print(f"Truncated: {result.truncated}")
embed_iter()¶
Generator for memory-efficient batch embedding:
for embedding in embedder.embed_iter(large_text_list, batch_size=32):
# Process each embedding
store.add_one(embedding, text)
Properties¶
# Get embedding dimension
print(f"Dimension: {embedder.dimension}") # e.g., 384
# Check if normalized
print(f"Normalized: {embedder.normalize}")
Context Manager¶
Use context manager for automatic cleanup:
from cyllama.rag import Embedder
with Embedder("models/bge-small.gguf") as embedder:
embeddings = embedder.embed_batch(texts)
# Resources automatically released
Normalization¶
By default, embeddings are L2-normalized (unit vectors). This is important for cosine similarity:
import math
embedder = Embedder("model.gguf", normalize=True)
embedding = embedder.embed("test")
# Verify normalization
norm = math.sqrt(sum(x*x for x in embedding))
print(f"Norm: {norm}") # Should be ~1.0
To disable normalization:
Example: Semantic Search¶
from cyllama.rag import Embedder, VectorStore
# Initialize
embedder = Embedder("models/bge-small.gguf")
# Documents to index
documents = [
"Python is a versatile programming language.",
"JavaScript runs in web browsers.",
"Rust provides memory safety without garbage collection.",
"Go was designed for concurrent programming.",
]
# Generate embeddings and store
embeddings = embedder.embed_batch(documents)
with VectorStore(dimension=embedder.dimension) as store:
store.add(embeddings, documents)
# Search
query = "Which language is good for web development?"
query_embedding = embedder.embed(query)
results = store.search(query_embedding, k=2)
for result in results:
print(f"[{result.score:.3f}] {result.text}")
embedder.close()
Performance Tips¶
- Batch Processing: Use
embed_batch()instead of multipleembed()calls - GPU Acceleration: Set
n_gpu_layers=-1to use all GPU layers - Context Size: Match
n_ctxto your model's training context - Memory Efficiency: Use
embed_iter()for large datasets