Context Caching and Resource Management¶

This document describes how cyllama manages LLM contexts and resources for optimal performance and memory efficiency.

Overview¶

The LLM class manages the lifecycle of llama.cpp contexts, which hold the KV cache and other state needed for text generation. Starting with v0.1.14, contexts are intelligently cached and reused to avoid unnecessary allocation overhead.

Context Lifecycle¶

Automatic Context Reuse¶

When you call an LLM instance multiple times, the context is reused when possible:

from cyllama import LLM, GenerationConfig

llm = LLM("models/llama.gguf")
config = GenerationConfig(max_tokens=100)

# First call creates a context
response1 = llm("Hello", config=config)

# Second call reuses the same context (KV cache is cleared)
response2 = llm("Hi there", config=config)

# Context is recreated only if a larger size is needed
large_config = GenerationConfig(max_tokens=1000)
response3 = llm("Tell me a story", config=large_config)

When Contexts Are Recreated¶

A new context is created when:

No context exists yet (first generation)
The required context size exceeds the current context size
After calling reset_context() explicitly

KV Cache Clearing¶

When a context is reused, the KV cache is automatically cleared via kv_cache_clear(). This ensures each generation starts with a clean state while avoiding the overhead of context recreation.

Resource Management¶

Context Manager (Recommended)¶

The recommended way to use LLM is as a context manager:

from cyllama import LLM, GenerationConfig

with LLM("models/llama.gguf") as llm:
    config = GenerationConfig(max_tokens=50)
    response = llm("What is Python?", config=config)
    print(response)
# Resources are automatically released here

Explicit Cleanup¶

For more control, use the close() method:

llm = LLM("models/llama.gguf")
try:
    response = llm("Hello")
    print(response)
finally:
    llm.close()

Automatic Cleanup¶

The LLM class implements __del__ for automatic cleanup when the object is garbage collected. However, relying on garbage collection is not recommended for timely resource release.

API Reference¶

LLM Methods¶

`close()`¶

Release the context and sampler resources.

llm = LLM("model.gguf")
llm("Hello")
llm.close()  # Free GPU memory used by context

Safe to call multiple times
The model remains loaded for potential reuse
After close(), the instance can still be used (a new context will be created)

`reset_context()`¶

Force recreation of the context on the next generation.

llm = LLM("model.gguf")
llm("First conversation")

llm.reset_context()  # Clear all state

llm("New conversation")  # Fresh context created

Use this when you want to ensure a completely fresh start without any cached state.

LlamaContext Methods¶

`kv_cache_clear(clear_data=True)`¶

Clear the KV cache without recreating the context.

from cyllama.llama.llama_cpp import LlamaContext, LlamaModel, LlamaContextParams

model = LlamaModel("model.gguf", params)
ctx = LlamaContext(model, ctx_params)

# ... use context for generation ...

ctx.kv_cache_clear()  # Clear KV cache for reuse

# ... use context for new generation ...

Parameters:

clear_data (bool): If True (default), also clear the data buffers. If False, only clear metadata.

Performance Considerations¶

Benefits of Context Reuse¶

Reduced allocation overhead: Creating a new context involves GPU memory allocation which can be slow
Consistent memory usage: Reusing contexts prevents memory fragmentation
Faster subsequent generations: Only the KV cache needs to be cleared, not the entire context

When to Force Recreation¶

Use reset_context() when:

Starting a completely new conversation with no relation to previous ones
Debugging generation issues
Switching between very different prompt lengths (though automatic recreation handles this)

Memory Management Tips¶

Use context managers for automatic cleanup
Call close() when done with long-running applications
Monitor memory with tools like nvidia-smi for GPU memory
Set appropriate n_ctx in GenerationConfig to avoid oversized contexts

Example: Long-Running Application¶

from cyllama import LLM, GenerationConfig

def serve_requests(model_path: str):
    """Example of efficient context reuse in a server."""
    with LLM(model_path) as llm:
        config = GenerationConfig(max_tokens=200)

        while True:
            prompt = get_next_request()  # Your request handling
            if prompt is None:
                break

            # Context is reused efficiently across requests
            response = llm(prompt, config=config)
            send_response(response)
    # Resources automatically cleaned up

Example: Multiple Independent Conversations¶

from cyllama import LLM, GenerationConfig

llm = LLM("models/llama.gguf")
config = GenerationConfig(max_tokens=100)

# Conversation 1
response1 = llm("What is the capital of France?", config=config)
print(f"Conv 1: {response1}")

# Force fresh context for completely independent conversation
llm.reset_context()

# Conversation 2 (no KV cache contamination from Conv 1)
response2 = llm("Explain quantum computing", config=config)
print(f"Conv 2: {response2}")

llm.close()

Comparison with Previous Behavior¶

Aspect	Before v0.1.14	v0.1.14+
Context per generation	New context created	Reused when size permits
KV cache management	Discarded with context	Cleared via `kv_cache_clear()`
Resource cleanup	Implicit (GC)	Explicit `close()` + context manager
Memory efficiency	Lower	Higher
Generation latency	Higher (allocation)	Lower (reuse)

Troubleshooting¶

Memory Not Being Released¶

If GPU memory isn't released after generation:

# Ensure explicit cleanup
llm.close()

# Or use context manager
with LLM("model.gguf") as llm:
    # ... use llm ...
# Memory released here

Unexpected Behavior Between Generations¶

If generations seem to be affected by previous ones:

# Force a fresh context
llm.reset_context()
response = llm("New prompt")

Context Size Errors¶

If you get context size errors:

# Specify a fixed context size
config = GenerationConfig(
    max_tokens=100,
    n_ctx=4096  # Fixed size, won't auto-calculate
)