Cancelling generation¶
LLM supports thread-safe cancellation of an in-flight generation at two
layers:
- Between tokens — a
threading.Eventpolled in the per-token loop. Sub-millisecond latency in steady-state generation. - Mid-decode — a nogil
ggml_abort_callbackreads a C-level flag and aborts the in-progressllama_decodefrom inside ggml's compute graph. This is what makes cancellation responsive during long prompt prefill, where a singledecodecall may run for seconds.
Both layers are wired by a single call: llm.cancel().
What "abort" means¶
ggml_abort_callback is cooperative: when it returns non-zero, ggml stops
scheduling further ops in the current graph and llama_decode returns
early. The process is not killed. Control returns to Python normally,
the partially-produced tokens are yielded, and the LLM object remains
reusable for the next call. Only the in-progress batch is discarded.
The cancel flag auto-clears at the start of each generation, so a stale
cancel() does not leak into the next call.
API¶
LLM.cancel()— request cancellation. Safe from any thread.LLM.cancel_requested— read-onlyboolproperty.LLM.install_sigint_handler()— opt-in Ctrl-C handler. Returns a context manager / handle with.restore().LlamaContext.cancel— read/writeboolmirror of the C-level flag, for direct lower-level use.
Examples¶
1. Cancel from another thread¶
import threading
from cyllama import LLM, GenerationConfig
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf")
config = GenerationConfig(max_tokens=512, temperature=0.0)
threading.Timer(0.1, llm.cancel).start()
chunks = list(llm("Write a long essay about cats.", config=config, stream=True))
print(f"got {len(''.join(chunks))} chars before cancel")
# The LLM is still usable.
followup = llm("Say hi.", config=GenerationConfig(max_tokens=10))
print(followup)
2. Ctrl-C handler — interrupts even mid-prefill¶
from cyllama import LLM, GenerationConfig
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf")
huge_prompt = "..." * 10_000 # forces a long prefill
with llm.install_sigint_handler():
for chunk in llm(huge_prompt, config=GenerationConfig(max_tokens=200), stream=True):
print(chunk, end="", flush=True)
# After Ctrl-C: prior SIGINT handler is restored, llm still usable.
print("\n-- back to normal --")
print(llm("ok?", config=GenerationConfig(max_tokens=5)))
install_sigint_handler() is opt-in by design; cyllama does not touch
signal handlers otherwise. The previous handler is saved and restored
on .restore() / __exit__, so it composes with Click, Jupyter,
asyncio, etc. Must be called from the main thread (signal.signal
restriction).
3. Cancel-on-disconnect in a FastAPI / SSE sidecar¶
The motivating use case: a streaming HTTP server should free the GPU
when the client closes the connection, instead of running to
max_tokens.
import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from cyllama import LLM, GenerationConfig
app = FastAPI()
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf")
@app.get("/stream")
async def stream(request: Request, prompt: str):
async def gen():
loop = asyncio.get_running_loop()
it = iter(llm(prompt, config=GenerationConfig(max_tokens=2048), stream=True))
try:
while True:
if await request.is_disconnected():
llm.cancel() # aborts mid-decode
break
chunk = await loop.run_in_executor(None, next, it, None)
if chunk is None:
break
yield f"data: {chunk}\n\n"
finally:
llm.cancel() # idempotent; safe on normal exit too
return StreamingResponse(gen(), media_type="text/event-stream")
4. Direct use of LlamaContext.cancel¶
For callers working below the LLM API:
from cyllama import LLM
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf")
list(llm("warm up", stream=True)) # forces _ensure_context()
ctx = llm._ctx
ctx.cancel = True # sets the C bint
assert ctx.cancel is True
ctx.cancel = False # clear before next call
Notes and caveats¶
- Performance. The between-token check is one
Event.is_set()per token (sub-microsecond). The mid-decode callback isnoexcept nogiland does a single indirect load per ggml op poll. Overhead is not measurable against decode time. - Memory model. The C flag is a plain
bint, not a C11 atomic. Aligned word writes are atomic on every CPU cyllama targets; a stale read just delays cancellation by one op poll. This is acceptable for a one-shot "abort now" signal. - Custom abort callbacks.
LLMauto-installs the cancel callback on every context creation. CallingLlamaContext.set_abort_callback()with a Python callable overrides it. To combine user logic with cancellation, consultctx.cancel(or your own state) inside that Python callback. - Stable Diffusion.
cyllama.sddoes not currently support cancellation;generate_image()is a single blocking C call with no abort path. Tracked against upstream leejet/stable-diffusion.cpp#1124.