Quickstart¶
Get started with cyllama in under 5 minutes.
Installation¶
# Clone and build
git clone https://github.com/shakfu/cyllama.git
cd cyllama
make
# Download a test model
make download
Your First Generation¶
from cyllama import complete
response = complete(
"What is Python?",
model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)
print(response)
Streaming Output¶
See tokens as they're generated:
from cyllama import complete
for chunk in complete(
"Tell me a short story",
model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf",
stream=True
):
print(chunk, end="", flush=True)
Chat Conversations¶
Multi-turn chat with message history:
from cyllama import chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
print(response)
Reusable LLM Instance¶
For multiple generations, keep the model loaded:
from cyllama import LLM
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf")
# Model stays loaded between calls
response1 = llm("What is 2+2?")
response2 = llm("What is the capital of France?")
response3 = llm("Explain gravity in one sentence.")
Async Generation¶
Non-blocking generation for async applications:
import asyncio
from cyllama import AsyncLLM
async def main():
async with AsyncLLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf") as llm:
response = await llm("What is Python?")
print(response)
# Async streaming
async for chunk in llm.stream("Tell me a joke"):
print(chunk, end="", flush=True)
asyncio.run(main())
Using Agents¶
Build tool-using AI agents:
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression."""
return str(eval(expression))
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer) # "100"
Configuration Options¶
Customize generation behavior:
from cyllama import LLM, GenerationConfig
config = GenerationConfig(
max_tokens=200, # Maximum tokens to generate
temperature=0.7, # Randomness (0.0 = deterministic)
top_p=0.95, # Nucleus sampling
top_k=40, # Top-k sampling
repeat_penalty=1.1, # Penalize repetition
)
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf", config=config)
response = llm("Write a haiku about programming")
Or pass parameters directly:
from cyllama import LLM
llm = LLM("models/Llama-3.2-1B-Instruct-Q8_0.gguf", temperature=0.9, max_tokens=100)
response = llm("Be creative!")
Batch Processing¶
Process multiple prompts efficiently:
from cyllama import batch_generate
prompts = [
"What is 2+2?",
"What is 3+3?",
"What is 4+4?"
]
responses = batch_generate(
prompts,
model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf"
)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}\nA: {response}\n")
Image Generation¶
Generate images with Stable Diffusion:
from cyllama.sd import text_to_image
images = text_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
prompt="a photo of a cute cat",
width=512,
height=512,
sample_steps=4,
cfg_scale=1.0
)
images[0].save("cat.png")
Speech-to-Text¶
Transcribe audio with Whisper:
from cyllama.whisper import WhisperContext, WhisperFullParams
ctx = WhisperContext("models/ggml-base.en.bin")
params = WhisperFullParams()
params.language = "en"
# samples must be 16kHz mono float32
ctx.full(samples, params)
for i in range(ctx.full_n_segments()):
text = ctx.full_get_segment_text(i)
print(text)
Next Steps¶
- User Guide - Comprehensive feature documentation
- API Reference - Complete API documentation
- Cookbook - Practical recipes and patterns
- Agents Overview - Building tool-using agents