AI Video for Ecommerce: from generation to delivery in an integrated pipeline

by Federico January 15, 2026 6 min read

ai
video
ecommerce

Over the last eighteen months, AI-generated video has shifted from conference-demo curiosity to working marketing tool. What makes the difference, as is often the case, is integration: a Kling clip rendered in five minutes is worth little if it then has to be downloaded by hand, captioned by hand, exported by hand and uploaded to Meta Ads by hand. The real value emerges when the entire pipeline — from prompt to delivery — becomes a single governed flow.

In this article I describe a pattern I implemented recently for a mid-market fashion brand that was producing an average of twenty video variants per month, at an internal hourly cost that was hard to justify. The goal was to reduce time-to-publish for each AI video asset from roughly four hours to under forty minutes, while keeping creative control over the prompt and traceability over performance.

The starting point: where the chain breaks

When a marketing team starts using tools like Kling, Runway or Sora on a continuous basis, the bottleneck moves quickly. It is no longer the generation of the video — the model handles that in minutes — but everything around it: writing prompts consistent with brand identity, managing asynchronous jobs (renders take one to eight minutes), adding synchronized captions, exporting in the formats required by advertising platforms, and, most importantly, understanding two weeks later which videos actually generated ROAS.

Most companies solve this problem with a single person acting as a manual glue layer. It works as long as volumes are low. Above fifteen to twenty clips per week, the glue layer becomes the real operational cost.

The architecture: nine components, one flow

The pipeline I describe is made up of nine functional blocks, orchestrated by Supabase Edge Functions in Deno with a Lovable frontend for control. Each block is isolated, idempotent, and communicates with the others via events.

The first block is a cinematic prompt generator. Rather than letting the team write prompts freely, we defined four recurring clip types — product showcase, before/after, features highlight, lifestyle story — and for each one a Gemini template that incorporates the color palette, the brand’s photographic style and product context. The marketer chooses type and product, the template produces the cinematic prompt. This brings unwanted variability between videos down to zero.

The second block is the Kling orchestrator. This is the most technically delicate part: Kling v3 Omni is asynchronous, renders take minutes, and you have to manage submit, polling and callback without saturating the Edge workers. The pattern I adopted is submit-and-forget with webhook callback: the Edge Function launches the job, saves the reference, and when the webhook arrives with the ready video, it starts the next phase. For multi-clip jobs, the orchestrator handles reuse of clips already generated for the same SKU, avoiding spending credits again on assets that already exist.

The third block is AI captioning, integrated with ElevenLabs Scribe v2 for word-level transcription. Word-level matters: it is needed to generate karaoke-style captions where the word lights up in sync with the voice-over. For Italian voice-over videos we use ElevenLabs Multilingual v2, which delivers an acceptable broadcast-grade output for the Italian market without going through a professional studio.

The fourth block is the FFmpeg export. Here the architectural choice was to run FFmpeg in background on a dedicated worker rather than inside the main Edge Function, because exporting a 1080p vertical video with captions can take thirty to sixty seconds and blocking an Edge Function for that long is an anti-pattern.

For interactive preview we added Remotion on the client side, which lets the marketer see the final composition (video + audio + captions + overlay) without waiting for the server-side render. This alone cut the number of re-renders by about 30%, because composition errors get caught earlier.

The part most teams skip: measuring the output

Generating AI video is easy. Understanding which ones work is the hard part. To close the loop I built a marketing video analytics dashboard that cross-references each clip’s metadata (type, product, dominant color palette, length, presence of voice-over) with Meta Ads and Google Ads performance, via UTM and native APIs. After six weeks of data, the brand could see, for example, that before/after videos under fifteen seconds had a median CTR roughly 40% higher than thirty-second lifestyle stories, and recalibrated the prompt mix accordingly.

Another useful piece is the automatic color palette extraction from product photos, with naming localized in Italian (e.g. “cipria”, “antracite”, “verde salvia”). This palette becomes input to the Kling prompt and guarantees visual consistency between hero shot and AI video without having to manually write “main color is dusty pink” every time.

When it makes sense, when it doesn’t

To be candid: this type of pipeline has a break-even point below which it does not pay to build. Under ten to fifteen clips per month, a freelance video editor on call is cheaper and probably produces creatively better results. The AI pipeline becomes rational when you exceed twenty variants per month, when you need rapid variations for advertising A/B tests, or when you manage multi-product catalogs that would require physical shoots that are unrealistic to plan.

It should also be said that Kling’s quality, good as it is, does not yet replace a professional shoot for the brand’s hero assets. The rule I adopted with the client was: homepage hero, top-of-funnel campaigns and editorial content remain classical production; all the middle-of-funnel performance material — Meta Ads variants, creative tests, retargeting — goes through the AI pipeline.

Take-aways for anyone evaluating a similar investment

The first point is structural: treat AI video as a software pipeline, not as a tool. That means code reviews on prompt templates, monitoring on failed renders, observability on credit costs. Without these, three months in you end up with an unmanageable Drive full of MP4s and no idea how much each clip is costing you.

The second point is organizational: you need a product owner for the pipeline, even part-time, who bridges the marketing team and the tech team. Without this person, templates age, prompts fragment, and visual consistency is lost within a quarter.

The third point is about the stack: Kling for render, Gemini for prompt engineering, ElevenLabs for audio and transcription, FFmpeg for export, Remotion for preview, Supabase Edge Functions as the glue. Replacing one of these components is feasible; changing three at the same time will cost you more than the benefit.

The fourth point is the most important: always measure return at the asset level, not at the pipeline level. The pipeline exists to let you discover faster what works. If you do not close the loop on performance metrics, you are just producing more content, not better content.