Skip to main content

Media Generation and Understanding

FluffBuzz generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured.

Capabilities at a glance

CapabilityToolProvidersWhat it does
Image generationimage_generateComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAICreates or edits images from text prompts or references
Video generationvideo_generateAlibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAICreates videos from text, images, or existing videos
Music generationmusic_generateComfyUI, Google, MiniMaxCreates music or audio tracks from text prompts
Text-to-speech (TTS)ttsElevenLabs, Microsoft, MiniMax, OpenAI, xAIConverts outbound replies to spoken audio
Media understanding(automatic)Any vision/audio-capable model provider, plus CLI fallbacksSummarizes inbound images, audio, and video

Provider capability matrix

This table shows which providers support which media capabilities across the platform.
ProviderImageVideoMusicTTSSTT / TranscriptionMedia Understanding
AlibabaYes
BytePlusYes
ComfyUIYesYesYes
DeepgramYes
ElevenLabsYesYes
falYesYes
GoogleYesYesYesYes
MicrosoftYes
MiniMaxYesYesYesYes
MistralYes
OpenAIYesYesYesYesYes
QwenYes
RunwayYes
TogetherYes
VydraYesYes
xAIYesYesYesYesYes
Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.

How async generation works

Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls video_generate or music_generate, FluffBuzz submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, FluffBuzz wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply. Deepgram, ElevenLabs, Mistral, OpenAI, and xAI can all transcribe inbound audio through the batch tools.media.audio path when configured. Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording. OpenAI maps to FluffBuzz’s image, video, batch TTS, batch STT, Voice Call streaming STT, realtime voice, and memory embedding surfaces. xAI currently maps to FluffBuzz’s image, video, search, code-execution, batch TTS, batch STT, and Voice Call streaming STT surfaces. xAI Realtime voice is an upstream capability, but it is not registered in FluffBuzz until the shared realtime voice contract can represent it.