fluffbuzz onboard. This page is the opinionated guide for higher-end local stacks and custom OpenAI-compatible local servers.
Recommended: LM Studio + large local model (Responses API)
Best current local stack. Load a large model in LM Studio (for example, a full-size Qwen, DeepSeek, or Llama build), enable the local server (defaulthttp://127.0.0.1:1234), and use Responses API to keep reasoning separate from final text.
- Install LM Studio: https://lmstudio.ai
- In LM Studio, download the largest model build available (avoid “small”/heavily quantized variants), start the server, confirm
http://127.0.0.1:1234/v1/modelslists it. - Replace
my-local-modelwith the actual model ID shown in LM Studio. - Keep the model loaded; cold-load adds startup latency.
- Adjust
contextWindow/maxTokensif your LM Studio build differs. - For WhatsApp, stick to Responses API so only final text is sent.
models.mode: "merge" so fallbacks stay available.
Hybrid config: hosted primary, local fallback
Local-first with hosted safety net
Swap the primary and fallback order; keep the same providers block andmodels.mode: "merge" so you can fall back to Sonnet or Opus when the local box is down.
Regional hosting / data routing
- Hosted MiniMax/Kimi/GLM variants also exist on OpenRouter with region-pinned endpoints (e.g., US-hosted). Pick the regional variant there to keep traffic in your chosen jurisdiction while still using
models.mode: "merge"for Anthropic/OpenAI fallbacks. - Local-only remains the strongest privacy path; hosted regional routing is the middle ground when you need provider features but want control over data flow.
Other OpenAI-compatible local proxies
vLLM, LiteLLM, OAI-proxy, or custom gateways work if they expose an OpenAI-style/v1 endpoint. Replace the provider block above with your endpoint and model ID:
models.mode: "merge" so hosted models stay available as fallbacks.
Behavior note for local/proxied /v1 backends:
- FluffBuzz treats these as proxy-style OpenAI-compatible routes, not native OpenAI endpoints
- native OpenAI-only request shaping does not apply here: no
service_tier, no Responsesstore, no OpenAI reasoning-compat payload shaping, and no prompt-cache hints - hidden FluffBuzz attribution headers (
originator,version,User-Agent) are not injected on these custom proxy URLs
- Some servers accept only string
messages[].contenton Chat Completions, not structured content-part arrays. Setmodels.providers.<provider>.models[].compat.requiresStringContent: truefor those endpoints. - Some smaller or stricter local backends are unstable with FluffBuzz’s full
agent-runtime prompt shape, especially when tool schemas are included. If the
backend works for tiny direct
/v1/chat/completionscalls but fails on normal FluffBuzz agent turns, first tryagents.defaults.experimental.localModelLean: trueto drop heavyweight default tools likebrowser,cron, andmessage; this is an experimental flag, not a stable default-mode setting. See Experimental Features. If that still fails, trymodels.providers.<provider>.models[].compat.supportsTools: false. - If the backend still fails only on larger FluffBuzz runs, the remaining issue is usually upstream model/server capacity or a backend bug, not FluffBuzz’s transport layer.
Troubleshooting
- Gateway can reach the proxy?
curl http://127.0.0.1:1234/v1/models. - LM Studio model unloaded? Reload; cold start is a common “hanging” cause.
- FluffBuzz warns when the detected context window is below 32k and blocks below 16k. If you hit that preflight, raise the server/model context limit or choose a larger model.
- Context errors? Lower
contextWindowor raise your server limit. - OpenAI-compatible server returns
messages[].content ... expected a string? Addcompat.requiresStringContent: trueon that model entry. - Direct tiny
/v1/chat/completionscalls work, butfluffbuzz infer model runfails on Gemma or another local model? Disable tool schemas first withcompat.supportsTools: false, then retest. If the server still crashes only on larger FluffBuzz prompts, treat it as an upstream server/model limitation. - Safety: local models skip provider-side filters; keep agents narrow and compaction on to limit prompt injection blast radius.