I Built a System That Creates YouTube Videos From Scratch. Here's What Happened.
от Iliya OblakovПубликувано на 17 март 2026 г.9 мин. четене0 прегледа

I Built a System That Creates YouTube Videos From Scratch. Here’s What Happened.
A few weeks ago I had one of those 2 AM ideas that wouldn’t let me sleep.
What if I could build something that takes a single example — just a transcript from an existing video — and turns it into a completely new, original YouTube video? Not a clone. Not a remix. A genuinely new piece of content, with its own script, its own voice, its own visuals, all stitched together automatically.
So I built it. And honestly? The results surprised even me.
The Problem I Was Trying to Solve
If you’ve ever tried to run a YouTube channel — or any content channel, really — you know the pain. Scripting alone takes hours. Then you need voiceover, visuals, editing, music, pacing… Each video is a small production. For a solo creator or a small team, it’s brutal.
I wanted to see if AI and automation could handle the heavy lifting while still producing something that actually looks and sounds good. Not AI slop. Not those robotic text-to-speech compilations with random stock footage. Something you’d actually watch.
How It Works
Here’s the full pipeline, step by step.
1. Research
You feed the system a single example transcript. It reads it, understands the topic, and then goes off to research that subject on its own. Think of it like handing a brief to a very fast, very focused research assistant. They come back with their own angle, their own findings — not a copy of what you gave them.
The system scores potential topics, weighs what’s performing well in a given niche, and picks the strongest direction. It’s not random. There’s actual logic behind what it decides to write about.
2. Script Writing (With a Real Editorial Process)
This is where I leaned heavily on AI — specifically Claude. But here’s the thing: one AI pass isn’t enough. Anyone who’s used ChatGPT or Claude knows that first drafts are… first drafts. They need work.
So I built a three-stage editorial loop:
- The Writer creates the initial script from scratch based on the research.
- The Reviewer reads it with fresh eyes and tears it apart — structure, pacing, clarity, engagement, everything.
- The Director takes the feedback and polishes the final version. Tightens sentences. Fixes rhythm. Makes it sound human.
Three separate passes, three different jobs. If the script doesn’t hit the bar, it goes back.
On top of that, there’s a built-in originality check. The system measures how similar the new script is to the original source material. If there’s too much overlap — even at the phrase level — it gets flagged and reworked. This isn’t about spinning someone else’s content. It’s about creating something genuinely new.
There’s also a readability check. If the script reads above a seventh-grade level, it gets simplified. Good content is content people can follow without effort. Nobody wants to feel like they’re reading a textbook.
3. Voice Narration
Each scene gets narrated through a text-to-speech API. But here’s the part I’m genuinely proud of — the voice isn’t flat.
Every scene in the script has a mood tag. Excited. Serious. Curious. Urgent. Reflective. The narration engine reads that tag and adjusts how it speaks. Excited parts are a little faster, a little more energetic. Reflective parts slow down, get quieter. It’s subtle, but it makes a massive difference. It’s the difference between a video that sounds like a robot reading a teleprompter and one that sounds like someone actually talking to you.
4. Visual Sourcing (The Triple Quality Gate)
For every scene, the system connects to stock video and image libraries and searches for footage that matches the content. Not random B-roll. It uses keywords extracted directly from the narration to find relevant visuals.
Here’s where I got obsessive. Every visual goes through three quality gates:
- Relevance — Does this clip actually match what’s being talked about?
- Quality — Is it high resolution? Is it visually clean?
- Variety — Have we already used something too similar in a previous scene?
If a clip fails any gate, the system automatically runs another search. And another. And another. It doesn’t settle. This was non-negotiable for me because bad visuals are usually what makes AI-generated content instantly recognizable as AI-generated.
The system also calculates exactly how many visual clips each scene needs based on the audio duration. Longer scenes get more cuts. Shorter ones stay focused. This keeps the pacing feeling natural — not too fast, not too slow.
5. Video Assembly
This is where the pieces come together. A set of Python scripts take everything — the narrated audio, the visual clips, background music — and assemble them scene by scene.
Each scene is rendered individually first. Transitions between clips within a scene. Transitions between scenes. Subtle motion effects on still images so nothing feels static. Background music that ducks under the voice when someone’s speaking and comes back up during pauses.
Then all the scenes get stitched into one final video.
The output is a single MP4 file — 1080p, properly encoded, ready to upload to YouTube without any additional processing.
6. The Full YouTube Package
The system doesn’t just spit out a video file. It generates everything you need to publish:
- A YouTube-optimized title
- A full description with SEO keywords and chapter markers
- Timed chapter markers that match the video’s sections
- A pinned comment draft for engagement
- A thumbnail
You could literally take the output folder and upload everything as-is. That was the goal — zero manual steps between “run the pipeline” and “publish.”
What Makes This Different From Other AI Video Tools
I’ve tested a lot of AI video generators. Most of them produce content you can spot as AI-made within five seconds. Here’s what I specifically designed to avoid that:
Real editorial judgment, not one-shot generation. Three QA stages on the script. Three quality gates on visuals. The system is designed to be picky, not fast.
Mood awareness across the entire pipeline. The voice changes tone. The visuals match the energy. Even the pacing of cuts adapts to whether a moment is meant to be exciting or contemplative. This consistency is what makes a video feel intentional rather than randomly assembled.
A strict anti-plagiarism system. The script gets checked against the source material for overlap. Not just exact matches — similar phrasing gets caught too. The output is meant to be original content inspired by a topic, not a reshuffled version of someone else’s work.
Readability as a hard requirement. If the language is too complex, it gets simplified. Period. Content should be accessible.
The Honest Results
Let me be upfront — iteration one isn’t perfect. Nobody ships a perfect v1.
The script quality genuinely impressed me. That three-stage review loop produces scripts that are coherent, well-paced, and actually interesting to listen to. That part exceeded my expectations.
The voice narration is solid. The mood-adaptive settings make a noticeable difference. It doesn’t sound like a robot reading text — it sounds like someone telling you a story.
The visuals are the weakest link right now. Stock footage has its limits. Sometimes the search finds exactly the right clip. Sometimes you get something that’s… close enough. I’m actively working on improving the visual matching logic and exploring ways to generate custom visuals where stock falls short.
The assembly and encoding work flawlessly. Once you have good inputs, the rendering pipeline handles everything cleanly.
Overall? It’s way better than I expected for a first iteration. And each version is getting noticeably better.
Why I Built This (And Why It Matters)
I’m a full-stack developer. I build web products, SaaS tools, and automation systems — mostly with Ruby on Rails and Next.js. If you’re curious about my work, you can check out my portfolio or read some of my recent writing:
- What it takes to become a software developer. Do you, have it?
- Scoring the first developer job — tears, sweat and satisfaction
This video automation project sits at the intersection of everything I enjoy: backend architecture, API integrations, AI workflows, and solving real problems that save people real time.
Content creation is one of the biggest bottlenecks for businesses and creators right now. Everyone knows they should be making videos. Almost nobody has the time or budget to do it consistently. If this pipeline reaches production quality, it changes the equation entirely.
What’s Coming Next
I’m iterating in public. Every version gets documented, compared to the last, and shared — warts and all.
Here’s what I’m focused on for the next iterations:
- Better visual matching — smarter keyword extraction, broader source libraries, and potentially AI-generated visuals for scenes where stock footage falls short.
- Tighter pacing — fine-tuning the relationship between voice speed, cut frequency, and scene length.
- Style consistency — making sure the visual tone stays cohesive across an entire video, not just scene by scene.
- Series support — automatically splitting longer topics into multi-part episodes with proper continuity.
The goal is simple: reach a point where someone watches the output and genuinely can’t tell whether a human edited it or not.
Will it get there? That’s what we’re about to find out.
Want to See the Output?
I’m sharing the actual videos from each iteration — starting with v1. No cherry-picking. No only showing the best parts. The full thing, so you can judge for yourself.
If you want to follow the progression, the best way is to check my site where I’ll be posting detailed breakdowns of each version. Or just reach out — I’m always happy to talk shop.
The question I keep asking myself: How good can this actually get?
I guess we’ll find out together.
Iliya Oblakov is a freelance full-stack developer specializing in Ruby on Rails and Next.js development, based in Bulgaria. He builds web products, SaaS tools, and automation systems for clients worldwide.