Skip to main content
inferrs can serve local models behind an OpenAI-compatible /v1 API. FluffBuzz works with inferrs through the generic openai-completions path. inferrs is currently best treated as a custom self-hosted OpenAI-compatible backend, not a dedicated FluffBuzz provider plugin.

Getting started

1

Start inferrs with a model

inferrs serve google/gemma-4-E2B-it \
  --host 127.0.0.1 \
  --port 8080 \
  --device metal
2

Verify the server is reachable

curl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/v1/models
3

Add an FluffBuzz provider entry

Add an explicit provider entry and point your default model at it. See the full config example below.

Full config example

This example uses Gemma 4 on a local inferrs server.
{
  agents: {
    defaults: {
      model: { primary: "inferrs/google/gemma-4-E2B-it" },
      models: {
        "inferrs/google/gemma-4-E2B-it": {
          alias: "Gemma 4 (inferrs)",
        },
      },
    },
  },
  models: {
    mode: "merge",
    providers: {
      inferrs: {
        baseUrl: "http://127.0.0.1:8080/v1",
        apiKey: "inferrs-local",
        api: "openai-completions",
        models: [
          {
            id: "google/gemma-4-E2B-it",
            name: "Gemma 4 E2B (inferrs)",
            reasoning: false,
            input: ["text"],
            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
            contextWindow: 131072,
            maxTokens: 4096,
            compat: {
              requiresStringContent: true,
            },
          },
        ],
      },
    },
  },
}

Advanced configuration

Some inferrs Chat Completions routes accept only string messages[].content, not structured content-part arrays.
If FluffBuzz runs fail with an error like:
messages[1].content: invalid type: sequence, expected a string
set compat.requiresStringContent: true in your model entry.
compat: {
  requiresStringContent: true
}
FluffBuzz will flatten pure text content parts into plain strings before sending the request.
Some current inferrs + Gemma combinations accept small direct /v1/chat/completions requests but still fail on full FluffBuzz agent-runtime turns.If that happens, try this first:
compat: {
  requiresStringContent: true,
  supportsTools: false
}
That disables FluffBuzz’s tool schema surface for the model and can reduce prompt pressure on stricter local backends.If tiny direct requests still work but normal FluffBuzz agent turns continue to crash inside inferrs, the remaining issue is usually upstream model/server behavior rather than FluffBuzz’s transport layer.
Once configured, test both layers:
curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"google/gemma-4-E2B-it","messages":[{"role":"user","content":"What is 2 + 2?"}],"stream":false}'
fluffbuzz infer model run \
  --model inferrs/google/gemma-4-E2B-it \
  --prompt "What is 2 + 2? Reply with one short sentence." \
  --json
If the first command works but the second fails, check the troubleshooting section below.
inferrs is treated as a proxy-style OpenAI-compatible /v1 backend, not a native OpenAI endpoint.
  • Native OpenAI-only request shaping does not apply here
  • No service_tier, no Responses store, no prompt-cache hints, and no OpenAI reasoning-compat payload shaping
  • Hidden FluffBuzz attribution headers (originator, version, User-Agent) are not injected on custom inferrs base URLs

Troubleshooting

inferrs is not running, not reachable, or not bound to the expected host/port. Make sure the server is started and listening on the address you configured.
Set compat.requiresStringContent: true in the model entry. See the requiresStringContent section above for details.
Try setting compat.supportsTools: false to disable the tool schema surface. See the Gemma tool-schema caveat above.
If FluffBuzz no longer gets schema errors but inferrs still crashes on larger agent turns, treat it as an upstream inferrs or model limitation. Reduce prompt pressure or switch to a different local backend or model.
For general help, see Troubleshooting and FAQ.

Local models

Running FluffBuzz against local model servers.

Gateway troubleshooting

Debugging local OpenAI-compatible backends that pass probes but fail agent runs.

Model selection

Overview of all providers, model refs, and failover behavior.