Context Caching and Resource Management¶
This document describes how cyllama manages LLM contexts and resources for optimal performance and memory efficiency.
Overview¶
The LLM class manages the lifecycle of llama.cpp contexts, which hold the KV cache and other state needed for text generation. Starting with v0.1.14, contexts are intelligently cached and reused to avoid unnecessary allocation overhead.
Context Lifecycle¶
Automatic Context Reuse¶
When you call an LLM instance multiple times, the context is reused when possible:
from cyllama import LLM, GenerationConfig
llm = LLM("models/llama.gguf")
config = GenerationConfig(max_tokens=100)
# First call creates a context
response1 = llm("Hello", config=config)
# Second call reuses the same context (KV cache is cleared)
response2 = llm("Hi there", config=config)
# Context is recreated only if a larger size is needed
large_config = GenerationConfig(max_tokens=1000)
response3 = llm("Tell me a story", config=large_config)
When Contexts Are Recreated¶
A new context is created when:
- No context exists yet (first generation)
- The required context size exceeds the current context size
- After calling
reset_context()explicitly
KV Cache Clearing¶
When a context is reused, the KV cache is automatically cleared via kv_cache_clear(). This ensures each generation starts with a clean state while avoiding the overhead of context recreation.
Resource Management¶
Context Manager (Recommended)¶
The recommended way to use LLM is as a context manager:
from cyllama import LLM, GenerationConfig
with LLM("models/llama.gguf") as llm:
config = GenerationConfig(max_tokens=50)
response = llm("What is Python?", config=config)
print(response)
# Resources are automatically released here
Explicit Cleanup¶
For more control, use the close() method:
Automatic Cleanup¶
The LLM class implements __del__ for automatic cleanup when the object is garbage collected. However, relying on garbage collection is not recommended for timely resource release.
API Reference¶
LLM Methods¶
close()¶
Release the context and sampler resources.
- Safe to call multiple times
- The model remains loaded for potential reuse
- After
close(), the instance can still be used (a new context will be created)
reset_context()¶
Force recreation of the context on the next generation.
llm = LLM("model.gguf")
llm("First conversation")
llm.reset_context() # Clear all state
llm("New conversation") # Fresh context created
Use this when you want to ensure a completely fresh start without any cached state.
LlamaContext Methods¶
kv_cache_clear(clear_data=True)¶
Clear the KV cache without recreating the context.
from cyllama.llama.llama_cpp import LlamaContext, LlamaModel, LlamaContextParams
model = LlamaModel("model.gguf", params)
ctx = LlamaContext(model, ctx_params)
# ... use context for generation ...
ctx.kv_cache_clear() # Clear KV cache for reuse
# ... use context for new generation ...
Parameters:
clear_data(bool): If True (default), also clear the data buffers. If False, only clear metadata.
Performance Considerations¶
Benefits of Context Reuse¶
- Reduced allocation overhead: Creating a new context involves GPU memory allocation which can be slow
- Consistent memory usage: Reusing contexts prevents memory fragmentation
- Faster subsequent generations: Only the KV cache needs to be cleared, not the entire context
When to Force Recreation¶
Use reset_context() when:
- Starting a completely new conversation with no relation to previous ones
- Debugging generation issues
- Switching between very different prompt lengths (though automatic recreation handles this)
Memory Management Tips¶
- Use context managers for automatic cleanup
- Call
close()when done with long-running applications - Monitor memory with tools like
nvidia-smifor GPU memory - Set appropriate
n_ctxinGenerationConfigto avoid oversized contexts
Example: Long-Running Application¶
from cyllama import LLM, GenerationConfig
def serve_requests(model_path: str):
"""Example of efficient context reuse in a server."""
with LLM(model_path) as llm:
config = GenerationConfig(max_tokens=200)
while True:
prompt = get_next_request() # Your request handling
if prompt is None:
break
# Context is reused efficiently across requests
response = llm(prompt, config=config)
send_response(response)
# Resources automatically cleaned up
Example: Multiple Independent Conversations¶
from cyllama import LLM, GenerationConfig
llm = LLM("models/llama.gguf")
config = GenerationConfig(max_tokens=100)
# Conversation 1
response1 = llm("What is the capital of France?", config=config)
print(f"Conv 1: {response1}")
# Force fresh context for completely independent conversation
llm.reset_context()
# Conversation 2 (no KV cache contamination from Conv 1)
response2 = llm("Explain quantum computing", config=config)
print(f"Conv 2: {response2}")
llm.close()
Comparison with Previous Behavior¶
| Aspect | Before v0.1.14 | v0.1.14+ |
|---|---|---|
| Context per generation | New context created | Reused when size permits |
| KV cache management | Discarded with context | Cleared via kv_cache_clear() |
| Resource cleanup | Implicit (GC) | Explicit close() + context manager |
| Memory efficiency | Lower | Higher |
| Generation latency | Higher (allocation) | Lower (reuse) |
Troubleshooting¶
Memory Not Being Released¶
If GPU memory isn't released after generation:
# Ensure explicit cleanup
llm.close()
# Or use context manager
with LLM("model.gguf") as llm:
# ... use llm ...
# Memory released here
Unexpected Behavior Between Generations¶
If generations seem to be affected by previous ones:
Context Size Errors¶
If you get context size errors: