What if the Biggest Problem With AI Video... Isn't the AI?

by Iliya OblakovPublished on March 28, 20268 min read11 views

What if the Biggest Problem With AI Video… Isn’t the AI?

I built a system that creates YouTube videos from scratch. I wrote about the first iteration here. The scripts were wild. The voice narration sounded human. But I was honest about one thing — the visuals were the weak link.

Here’s how I fixed it. And honestly? The 2nd live iteration is a completely different beast.

Watch the 2nd iteration in action

The Problem Nobody Talks About

Most people blame the AI when an AI-generated video looks bad. The script sounds robotic. The voice is flat. The pacing is off.

But the thing is… that’s not what gives it away. Not anymore.

What gives it away is the visuals. Specifically — stock footage.

The script says “this system handles 10,000 requests per second.” The best stock clip I could find? A generic server room with blinking lights. It doesn’t show the system. It doesn’t show the scale. It just sits there, filling space.

And that’s the problem. Stock footage is decoration. It doesn’t connect to the words. Viewers feel it instantly — even if they can’t explain why. The narration says one thing. The screen shows something vaguely related. That disconnect is what makes AI video look like AI video.

I spent weeks trying to fix this in v1. Smarter keyword extraction. Broader stock libraries. A triple quality gate that checked relevance, quality, and variety for every single clip. It helped. But it was a bandaid.

Stock footage is generic by definition. My scripts aren’t. So I had a choice — keep polishing a system with a hard ceiling, or burn it down and build something that could actually grow.

I chose fire.

What I Actually Built

I ripped out the entire visual layer. Stock APIs — gone. FFmpeg filter graphs — gone. The old Python-based card overlays — gone.

In their place? A React-based rendering engine called Remotion. Instead of assembling clips through command-line video tools, I’m now building videos as React components. TypeScript. Declarative animations. Full control over every pixel, every frame.

Here’s what that means in practice.

Every visual is generated from the script. Code cards with syntax highlighting. Terminal windows with typing animations. Architecture diagrams that build themselves node by node. Summary boards with progressive bullet reveals. There’s no stock footage API call. No hoping a search returns something useful. The visuals come directly from what the script describes.

Animations sync to the narrator’s voice. This is the part that gets me excited. The voice engine returns word-level timestamps. The renderer uses those timestamps to drive every text reveal, every card animation, every camera move. When the narrator says “and here’s the architecture,” the diagram starts building at that exact moment. Not a second before. Not a second after.

The camera never stops. The old system had preset zoom levels based on mood. Reflective scenes got a gentle drift. Excited scenes got a bigger push. But it was mechanical — the same motion every time. The new system tracks focus targets dynamically. It drifts to the active code line. Zooms into the highlighted node. Pulls back for overview shots. There’s no dead time. The frame is always alive.

Everything moves with natural bounce. Elements don’t just appear — they spring into place. Bullet points cascade with staggered physics. Diagram nodes bounce into position. It’s subtle. But it’s the difference between a slideshow and something that feels produced.

The Visual Toolkit

Let me walk you through what the pipeline generates now.

Code Cards — These look like a real code editor. Syntax highlighting. Character-by-character reveal with a cursor tracking across the screen. Focus lines that draw your eye to the key part. And here’s what’s wild — the card “breathes” with the narrator’s voice. When the voice gets louder, the UI responds. When it drops, the UI calms down.

Terminal Cards — Terminal windows where commands type themselves out at realistic speed. Output lines reveal one by one. It looks like someone is actually working — not a screenshot pretending to be interactive.

Architecture Diagrams — Node layouts that build themselves while the narrator explains. Each node gets an icon, a label, and curved connectors draw between them. The whole thing assembles with staggered spring animations, piece by piece.

Summary Boards — Clean bullet reveals with glass-panel styling. Each bullet springs in as the narrator hits the talking point. An accent bar tracks progress through the list.

Kinetic Text — Full-screen text with word-by-word reveal synced to the voice. Key phrases get bold uppercase emphasis with scale pops. Impossible to miss — and timed to land at the exact right moment.

Section Titles — Full-screen cards that mark transitions between major parts of the video. Breathing room. Structure. They help viewers track where they are.

Every one of these is a React component. Every one responds to the audio timeline. No stock footage. No guessing.

The Script Pipeline Still Runs Everything

The visual layer is new. But the brain behind the scripts? That’s the same three-stage editorial pipeline — and honestly, it matters even more now.

The Writer creates the first draft from the research. The Reviewer tears it apart — flags robotic language, weak hooks, pacing issues, missing structure, and false authority claims the AI might’ve made up. The Director polishes it — adds natural connectors, enforces contractions, fixes readability, tightens the arc from hook to payoff.

Here’s why this matters more now: when every visual is generated from the script’s directives, the script IS the video. A well-structured script with clear visual cues produces a clean video. A sloppy script produces visual noise.

The quality gates haven’t changed. Readability capped at 7th-grade level. Anti-plagiarism checks. Hook validation. Mood distribution rules. All the guardrail tests still pass before anything renders.

What changed is that the visuals finally deserve the scripts.

The Honest Take

I’ve been straight about where things stand since day one. That’s not changing.

Scripts? Still excellent. Three editorial passes is the right call. Every iteration confirms it.

Voice? Better than v1. Mood-adaptive settings are more refined. Pausing feels natural. Emphasis hits harder.

Visuals? This is where the massive leap happened. Going from stock footage to audio-synced generated visuals isn’t a small upgrade — it’s a category change. The videos look intentional now. Every frame connects to what’s being said.

Is it perfect? No. The renderer is still being refined. Some transitions need more polish. The camera system has room to grow. But here’s the difference — every improvement compounds now. I’m not fighting the ceiling of “whatever stock footage exists.” I’m building on a foundation that gets better with every change.

That’s what makes this exciting.

What’s Next

The foundation is solid. Now it’s about depth.

Audio-reactive UI — Components that pulse with the narrator’s voice. Some of this already works. More is coming.
Smarter choreography — Richer camera sequences and cursor movements driven straight from script directives.
Pacing bursts — Rapid 1.5-second cuts during high-energy moments, contrasted with slower reveals during explanations.
More visual types — Comparison boards, pillar layouts, and hero scenes for more diversity across longer videos.

The goal hasn’t changed: get to where someone watches the output and can’t tell whether a human edited it.

After this iteration? That goal feels a lot closer.

Watch It Yourself

The 2nd live iteration — full Remotion render

Compare it to the first iteration. The difference speaks for itself.

If you want to follow the full progression — technical breakdowns, video comparisons, and what’s coming next — check my portfolio where I document everything.

About Me

I’m a full-stack developer. I build web products, SaaS tools, and automation systems — mostly with Ruby on Rails and Next.js. This project sits at the intersection of everything I enjoy: backend architecture, API integrations, AI workflows, and solving real problems.

If you’re curious about how I think about building things:

Honestly? I’m more excited about this project now than when I started. And the question I keep coming back to: How good can this actually get?

We’re about to find out.

Iliya Oblakov is a freelance full-stack developer specializing in Ruby on Rails and Next.js development, based in Bulgaria. He builds web products, SaaS tools, and automation systems for clients worldwide.

What if the Biggest Problem With AI Video... Isn't the AI?

What if the Biggest Problem With AI Video… Isn’t the AI?

Rate this article