AI for ecommerce content: catalog enrichment, color palettes, video, voice

by Federico May 7, 2026 6 min read

ai
content
ecommerce

AI applied to ecommerce content is the category I get the most questions about, and where I see the most waste. Between “AI agent that runs the whole catalog” and “AI that writes your Shopify descriptions” there is a world of real work nobody talks about: gateways, prompt versioning, model selection by cost tier, measurable acceptance criteria. This article describes the AI content patterns I have shipped on a fashion/footwear catalog — where they earn their keep, where they do not, and how to measure the difference.

Four layers of AI in ecommerce content

When we talk about “AI for the catalog” we are really talking about four very different layers, with different economics and risks.

Text layer: product page enrichment. Descriptions, titles, feature bullets, translations. Most mature layer, easiest to industrialize. The pattern I use is tool-calling: the model calls typed tools (set_title, set_description, add_feature) that validate format and apply brand rules. No broken Markdown, no lengths past Shopify limits.

Vision layer: extracting metadata from images. The most useful case I have shipped is automatic color-palette extraction from product photos, with names in the local language and HEX codes. Not glamorous, but exactly the kind of metadata that powers filters, “available in X colors” badges, automated merchandising. Automatic extraction, validation against a brand palette, human fallback on the images where the model hesitates.

Audio layer: multilingual TTS. Voice-overs for short-form videos, social content, or catalog audio guides. ElevenLabs Multilingual v2 produces broadcast-quality Italian that until a year ago required a professional voice actor. MP3 saved to storage, reusable across the whole video production pipeline.

Image and video layer: generation and enhancement. Virtual staging of products, enhancement of existing photos, lifestyle image generation, cinematic video clip generation. This is the most expensive layer, with the highest error margin, and the one where it is easiest to produce unusable output.

The pattern that works: gateway + tool-calling + versioning

The single decision with the biggest impact on the AI content projects I see is to not call models directly from the front-end or from Make scenarios. Between the app and the model there should be a gateway, and the gateway should host three things: provider routing, typed tool-calling, and a versioned prompt library.

Routing matters because no model wins at everything. For Italian product descriptions I use a different LLM than for color palettes. The gateway abstracts the choice: the app asks “enrich this product”, the gateway decides which model to call based on cost, latency, and historical quality.

Typed tool-calling matters because without it, AI writes text that someone has to parse. With tool-calling the model calls set_title("...") and you receive a validated structure. Sounds minor until you work without it.

Prompt versioning — an “AI Library” with a master editor — matters because prompts change: new model, new brand tone rule, new edge case. Without versioning you lose track of “why quality was better two weeks ago”. With it (version, author, diff, rollback tag) every change is auditable.

Concrete example: color palettes with local-language names

The case I always bring up is palette extraction on a fashion catalog. Every product has one to five photos. For each photo you want to extract the three to five dominant colors, with a name in the local language the customer understands (“sage green”, not “#8FA882”) and a precise HEX code for the front-end.

The naive approach — a single call to a Vision LLM that returns everything — works, but it costs, and the model invents color names that change between calls (“military green” today, “army green” tomorrow). The pattern I have shipped:

HEX extraction with a deterministic algorithm (color clustering on the photo, no AI). Fast, free, reproducible.
Local-language naming through a “quick lookup” map of roughly 200 canonical colors. 95% of colors land here.
AI fallback on the remaining 5%, with a very tight prompt that receives the HEX and a list of allowed names, and has to pick the closest one.
IT to EN translation with the same logic: a map for common cases (“verde salvia” to “sage green”) and AI fallback for new ones.

The result is that 99% of palettes are generated without calling an LLM. Only the problematic residue pays the AI cost, and that residue is also where AI quality is most visible.

When AI image is worth it, when it is not

AI image (staging, enhancement, generation) is the most fascinating layer and the one with the most volatile ROI. The question to ask: is the unit cost per acceptable image lower than the cost of the manual process it replaces?

For enhancement of existing photos (upscaling, color correction, light cleanup) the math is almost always positive: a few cents per image replacing minutes of Photoshop. Adopt immediately.

For virtual staging — adapting the same product photo to different contexts (the same chair in three room moods) — the math depends on the number of markets. With a single market the manual approach often wins. With five markets and a thousand SKUs, break-even arrives early.

For lifestyle generation from scratch I am very cautious. Quality has improved enormously but brand consistency is still hard. The use case I see working is A/B tests and moodboards, not hero assets.

For AI video the situation is the same multiplied by ten. It works for high-volume, low-effort social content where each experiment is cheap. It does not yet work for hero brand videos that require perfection.

How to measure ROI

A metric that works, which I have computed on several projects: for the text layer (enrichment plus translations), a 2,000-SKU catalog with descriptions in three languages traditionally takes about 300 hours of copywriting. With the gateway + tool-calling + selective human review pattern (the human copywriter intervenes only on pages above a value threshold or below an AI confidence threshold), hours drop to about 60: 70-80% of the time recovered.

The number not to look at in isolation is “time saved”. The number to look at alongside it is “acceptance rate of generated pages” — how many pages pass review without significant changes. Below 75%, the pipeline is not producing real value, it is just shifting work from writing to correcting. Above 85%, the system is doing what it should.

The common mistake

The single mistake I see most is “let’s buy licenses for some AI catalog tool and see what happens”. Without a gateway, without versioning, without measurable acceptance criteria, without human fallback on low-confidence cases, you are in no man’s land. You spend money, produce variable-quality output, and after three months nobody can say whether the system is helping or not.

The alternative pattern — proprietary gateway on Supabase edge functions, versioned prompts, typed tool-calling, acceptance metrics tracked per layer — costs more upfront. It pays back in transparency, governability, and the ability to improve over time. It is the difference between “we have AI in the catalog” and “we know what AI is doing, why, and at what quality”.