How to Build a One-Prompt AI Video Workflow: Script, Voice, Avatar, and Edit

Q: What does this kind of video workflow cost to run?

Rough per-video costs for a 3-minute video: Claude — less than $0.10 for a 1,000-token script ElevenLabs — approximately $0.30–0.60 depending on character count and model HeyGen — pricing is credit-based; expect $2–5 per rendered minute depending on plan FFmpeg — free; compute cost is minimal A fully automated 3-minute video costs roughly $3–8 in API fees at current pricing.

From a Single Prompt to a Finished Video: What’s Actually Possible Now

A complete YouTube video — written, narrated, presented by an AI avatar, and edited — used to take a team of people several days. Now it can happen in under an hour, triggered by a single prompt.

That’s not marketing copy. It’s a practical outcome of chaining Claude’s writing ability with ElevenLabs’ voice synthesis, HeyGen’s AI avatar rendering, and FFmpeg’s video processing. The result is a one-prompt AI video workflow that handles everything from script to final export without a human touching a timeline.

This post breaks down exactly how that workflow is built: what each tool does, how they connect, where the handoffs happen, and what you need to get it running. Whether you’re a solo creator or running a content operation at scale, the mechanics here are worth understanding.

What “One Prompt” Actually Means

Before getting into the tools, let’s be precise about what a one-prompt workflow means — because it’s easy to oversell.

You type something like: “Create a 3-minute explainer video about how solar panels work, targeted at homeowners, with a professional tone.”

From that, the workflow:

Generates a structured video script with an intro, three main sections, and a call-to-action
Converts the script into natural-sounding spoken audio
Renders a lifelike AI avatar reading that script on screen
Assembles the audio, avatar footage, and any B-roll or captions into a finished video file

Hermes Crash Course — free 1-hour live workshop

You don’t touch a script editor, audio recording booth, video editor, or rendering queue. The prompt goes in, the video comes out.

The word “single” is slightly aspirational in practice — most setups require a few configuration decisions upfront. But once those are made, the workflow runs end-to-end from one input.

The Four Tools That Make This Work

Claude: Scriptwriter and Orchestrator

Claude handles the thinking layer. In this workflow, Claude’s role isn’t just to write a good script — it’s to write a script formatted for production.

That means outputting structured JSON or clearly delimited sections: an intro block, timed speaking segments, callout text for lower thirds, and suggested visual descriptions. A raw essay won’t work downstream. The output needs to be machine-readable enough that the next tool can pick it up without human reformatting.

A well-structured prompt to Claude might specify:

Target length in seconds or words (voice synthesis tools bill by character)
Tone and pacing preferences
Any branding language or phrases to include or avoid
Output format — usually JSON with fields like segment_id, speaker_text, caption_text, and visual_note

Claude is also used at the orchestration level in more advanced setups — evaluating whether generated sections meet quality criteria, reformatting outputs from one tool to feed into the next, and writing fallback content if something fails.

ElevenLabs: Voice Synthesis

ElevenLabs converts the script text into audio. Its strength is prosody — the rhythm, emphasis, and natural pausing that makes AI voice sound like a real person rather than a GPS system.

For video workflows, ElevenLabs’ API accepts plain text or SSML (a markup language for speech) and returns an MP3 or WAV file. You can clone a specific voice using a few minutes of audio samples, or use one of their pre-built voices.

Key settings to configure:

Voice selection — a consistent voice across all videos builds recognizable brand audio
Stability vs. similarity — higher stability produces more consistent output; lower stability sounds more expressive but less predictable
Model selection — their Turbo models generate faster (useful in automated pipelines); their higher-quality models sound better

One thing to handle carefully: ElevenLabs is sensitive to formatting. Bullet points, headers, and markdown in the input text produce weird results. Claude’s script output needs to be stripped of all formatting before it goes into the voice API call.

HeyGen: AI Avatar Rendering

HeyGen takes the audio file and renders a video of a human avatar speaking it. You can use one of their stock avatars or create a custom one from a recorded video of yourself.

The workflow here is:

Upload the audio file generated by ElevenLabs
Select your avatar
Submit a render job via the HeyGen API
Poll for completion, then download the rendered MP4

HeyGen’s rendering takes a few minutes per video — it’s not instant. For automated pipelines, you need to handle asynchronous job completion. The API returns a job ID when you submit, and you check its status on a loop until it’s done.

HeyGen also supports scene-based rendering, where different sections can use different backgrounds, overlays, or avatar poses. For more complex videos, you’d split the script into segments and render each separately before assembling them in the edit step.

FFmpeg: Video Assembly and Export

FFmpeg is a command-line video processing tool that handles the final assembly. It’s free, runs locally or on a server, and can do almost anything: concatenate clips, add captions, overlay B-roll, mix audio tracks, adjust resolution, and export to any format.

In a one-prompt video workflow, FFmpeg typically handles:

Merging clips — if you rendered multiple HeyGen segments, FFmpeg stitches them together
Adding captions — subtitle files (.srt or .vtt) generated from the transcript get burned into or attached to the video
Logo and lower third overlays — static image overlays for branding
Audio normalization — ensuring consistent volume levels
Final export — encoding to H.264/AAC at the right resolution for the target platform

FFmpeg runs entirely via command line, which makes it easy to invoke from Python, JavaScript, or any automation script. You build the FFmpeg command string dynamically based on what assets were generated, then execute it.

Building the Workflow: Step by Step

Here’s how the full pipeline works in sequence.

Step 1: Receive and Parse the Input Prompt

The workflow starts when it receives the input — either from a user, a form submission, a scheduled trigger, or an API call. The raw prompt goes to Claude with a system instruction that defines the expected output format.

Example system prompt to Claude:

You are a video script writer. Given a topic and parameters, output a production-ready script as a JSON array. Each element should include:
- segment_id (integer)
- speaker_text (string, exactly as it should be spoken)
- caption_text (string, shorter version for subtitle display)
- visual_note (optional string describing suggested background or B-roll)

Keep each segment between 30 and 60 seconds when spoken at a conversational pace. Use plain sentences. No markdown, bullets, or headers in speaker_text.

Step 2: Generate and Validate the Script

Claude returns the structured script. Before moving on, the workflow should validate:

All required fields are present in every segment
speaker_text contains no formatting characters
Total estimated duration is within the target range (rough estimate: 150 words ≈ 1 minute)

If validation fails, the workflow can either retry with Claude or flag the error for review. Automated retries work well for formatting issues; content problems usually need a different prompt approach.

Step 3: Convert Script to Audio

The validated speaker_text from each segment goes to ElevenLabs. Depending on your setup, you can:

Send all segments as one concatenated text block (simpler, less control)
Send each segment individually and stitch the audio later (more control over pacing between sections)

The individual approach is generally better for longer videos because it lets you add natural pauses between segments and makes it easier to re-render a single section if there’s a quality issue.

Store each returned audio file with a filename that matches its segment_id so the assembly step can reconstruct the order.

Step 4: Render the Avatar Video

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

With audio files ready, submit render jobs to HeyGen. Pass each audio file along with your avatar configuration. If you’re using a consistent avatar across all segments, this is straightforward — just loop through the audio files and submit each as a separate job.

Keep track of all job IDs. Then poll the HeyGen API for completion. A simple polling loop checks every 30–60 seconds; most renders complete within 3–5 minutes.

Once all segments are rendered, download the MP4 files.

Step 5: Generate Captions

You have two options for captions:

Derive them from the script — use the caption_text field Claude already generated and convert it to SRT format with estimated timestamps based on audio duration
Generate them from audio — use a transcription API (OpenAI Whisper works well) on the ElevenLabs audio files for more accurate word-level timing

Option 2 produces better-synced captions but adds a step. For most automated workflows, option 1 is accurate enough, especially since the source text is known.

Step 6: Assemble the Final Video

FFmpeg takes all the rendered segments and assembles the final video:

# Concatenate all segments
ffmpeg -f concat -safe 0 -i filelist.txt -c copy combined.mp4

# Add captions
ffmpeg -i combined.mp4 -vf subtitles=captions.srt output_with_captions.mp4

# Add logo overlay
ffmpeg -i output_with_captions.mp4 -i logo.png \
  -filter_complex "overlay=W-w-10:10" final_output.mp4

The filelist.txt is a simple text file listing all the segment MP4s in order. FFmpeg reads them sequentially and concatenates without re-encoding (which is faster and preserves quality).

Step 7: Deliver the Output

The finished video file gets routed to its destination — an S3 bucket, a Google Drive folder, a publishing queue, or wherever the workflow is configured to send it. An automated message or notification can include a preview link, metadata, and any quality flags logged during the run.

Handling the Async Problem

The biggest technical challenge in this workflow isn’t any single tool — it’s managing asynchronous processes. HeyGen renders take time. You can’t just chain API calls synchronously and wait.

The practical solutions:

Polling loops — Check job status every N seconds until complete. Simple, but ties up a process thread if you’re running many jobs.

Webhooks — HeyGen supports webhook callbacks that fire when a render completes. This is cleaner for production systems — the render job completes on HeyGen’s infrastructure, and your workflow resumes when notified.

Queue-based architecture — For high-volume setups, submit jobs to a queue, process completions as they arrive, and correlate outputs back to the originating request using job IDs. This scales well and handles failures gracefully.

For most creators and small teams running a handful of videos per day, a polling approach with reasonable intervals is perfectly fine.

Where MindStudio Fits Into This Workflow

Building this pipeline from scratch means writing custom API integration code for Claude, ElevenLabs, and HeyGen, handling authentication, managing async state, and deploying it somewhere. That’s a reasonable project for a developer — maybe a few days of work.

MindStudio’s AI Media Workbench makes this significantly faster. It’s a workspace where you can chain media generation steps visually, without writing infrastructure code. The platform has built-in integrations with the major AI models and handles the async coordination, retries, and delivery logic that’s otherwise tedious to build yourself.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

You can build the one-prompt video workflow described in this article as a MindStudio agent — connecting Claude for script generation, ElevenLabs for voice, HeyGen for avatar rendering, and FFmpeg-style video processing tools — all through a visual workflow builder. The 24+ built-in media tools cover most of what you’d otherwise invoke FFmpeg for: clip merging, subtitle generation, and video export.

The result is a deployable workflow that anyone on your team can trigger from a form or API call, without needing to understand what’s happening under the hood. You can try it free at mindstudio.ai.

For developers who want more control, MindStudio’s Agent Skills Plugin lets you call these capabilities from your own Claude Code or LangChain agents as typed method calls — so you can integrate the media production layer into a larger agentic system without rebuilding the infrastructure.

Common Mistakes (and How to Avoid Them)

Script formatting bleeds into audio

If Claude’s script output includes markdown, numbered lists, or headers and those make it into ElevenLabs, the voice synthesis produces garbage: “hashtag introduction,” “one period,” etc. Always strip formatting from speaker_text before the audio step.

Over-long segments

Segments longer than 90 seconds tend to lose vocal energy even with ElevenLabs’ more expressive settings. Break content into shorter chunks. This also helps with re-rendering — if one segment sounds off, you only re-render that piece.

Not validating JSON output from Claude

Claude usually returns well-formed JSON, but not always. Wrap the output parsing in error handling. If the parse fails, retry with an explicit instruction to Claude: “Return only valid JSON, no explanation text before or after.”

Ignoring render failures

HeyGen jobs occasionally fail — usually due to audio file issues (format, length, silence at the start). Build in status checking that catches failed jobs and surfaces them, rather than silently producing an incomplete video.

Skipping audio normalization

If different segments were rendered on different runs or with slightly different ElevenLabs settings, volume levels may vary. A quick FFmpeg normalization pass before the final export prevents jarring volume jumps.

Frequently Asked Questions

How long does the full one-prompt video workflow take to run?

End-to-end time depends mostly on HeyGen’s render queue. For a 3-minute video split into 4–5 segments, expect 10–20 minutes total: 1–2 minutes for Claude + ElevenLabs, 8–15 minutes for HeyGen rendering, and 1–2 minutes for FFmpeg assembly. High-volume periods on HeyGen’s platform can push render times higher.

Do I need to know how to code to build this?

For a fully custom pipeline, yes — you’d be writing Python or JavaScript to stitch together the APIs. But platforms like MindStudio let you build the same workflow visually without writing API integration code, handling authentication and async logic for you. Basic configuration decisions (which avatar, which voice, output format) don’t require programming knowledge either way.

How realistic do AI-generated avatars look in 2025?

HeyGen’s current generation avatars are convincing in standard conditions: consistent lighting, forward-facing camera angle, professional framing. Rapid head movements or extreme close-ups show more artifacts. For explainer videos, tutorials, and marketing content, the output quality is generally sufficient for professional use. Custom avatars (trained on your own footage) look better than stock avatars for most viewers.

What does this kind of video workflow cost to run?

Rough per-video costs for a 3-minute video:

Claude — less than $0.10 for a 1,000-token script
ElevenLabs — approximately $0.30–0.60 depending on character count and model
HeyGen — pricing is credit-based; expect $2–5 per rendered minute depending on plan
FFmpeg — free; compute cost is minimal

A fully automated 3-minute video costs roughly $3–8 in API fees at current pricing.

Can I use a custom voice clone instead of a stock voice?

Yes. ElevenLabs allows you to create a cloned voice from 1–3 minutes of clean audio. Once created, the clone is available via API and works identically to stock voices in automated workflows. This is useful for brand consistency — every video sounds like the same person, whether that’s you or a designated brand voice.

What’s the best way to handle B-roll and visual variety?

Pure avatar-to-camera output gets monotonous for videos longer than 90 seconds. Options include: using HeyGen’s background scene features to vary the visual setting per segment, overlaying generated images or screen recordings via FFmpeg, or integrating an image generation step (FLUX or Midjourney via API) to produce visuals for each segment. The visual_note field in Claude’s script output is a good place to store prompts for this.

Key Takeaways

A one-prompt AI video workflow chains Claude, ElevenLabs, HeyGen, and FFmpeg to go from text input to finished video without manual production steps.
Claude acts as both scriptwriter and output formatter — structuring content so it’s machine-readable for downstream tools.
ElevenLabs handles voice synthesis; strip all formatting from input text before calling the API.
HeyGen renders AI avatar video from audio; expect asynchronous job completion and build polling or webhook handling accordingly.
FFmpeg assembles the final video — merging segments, adding captions, and applying overlays.
The hardest part isn’t any single API — it’s managing async state and error handling across the full chain.
MindStudio’s AI Media Workbench lets you build this entire workflow visually, with built-in media tools and async coordination handled automatically. Start free at mindstudio.ai.