I Built An AI Instagram Carousel Generator In One Weekend Susiloharjo

I built a self-hosted AI Instagram carousel generator in one weekend. The whole thing runs on Next.js, Docker Compose, and Google Gemini — and it ships a 10-slide carousel from a one-line prompt to a scheduled Instagram post without me touching it after I hit submit

You have done this grind: stare at a blank page, type a hook, get bored by slide three, push out the rest at 2am. The carousel takes three hours you do not have. The quality is whatever your tired brain produced at 1am. I have done it enough times that I decided to write the tool that does it for me

The bet I made: a self-hosted stack — Next.js, Postgres, Redis, MinIO, and the Gemini API — can produce a publishable 10-slide carousel faster than I can outline one, at quality I would not be embarrassed to post. The rest of this post is the proof.

What it actually does

You give the app a topic, a number of slides, a tone of voice, and a language. The app does everything else. It calls Gemini to generate a storyline — title, per-slide heading and body, caption, hashtags, and a call to action. It calls Gemini again, once per slide, to generate the visual. It puts the slides into an editor where you can rewrite text, regenerate individual slides, drag to reorder, and preview the whole thing in an iPhone-style mockup. When you are happy, you hit Post Now or Schedule. The app pushes the carousel to Instagram via the Graph API, and the status updates in real time.

The split-view editor is the part that makes the difference between a generator and a tool I actually use. The first version I built was a one-shot — it generated the carousel, showed it to me, and that was it. I never used it. The reason is the same reason I never used the AI agents I built before shipping them: a 70 percent result you cannot improve is worse than a 60 percent result you can finish yourself. The editor is the part that lets me hit 95 percent in two minutes.

Why self-hosted, not SaaS

There are two reasons I built this on my own box instead of using Canva, Predis, or one of the SaaS carousel tools. The first is cost. A self-hosted Next.js + Docker stack on the M720q costs Rp 75,000 a month total. The Gemini API call is under Rp 30,000 a month. The equivalent SaaS plan would be $30 a month minimum.

The second reason: I do not want my content drafts living in someone else’s database. Every carousel starts as a draft with a hook I have not committed to, a CTA I am not sure I want to ship, slides I might delete. I do not want those used to train someone else’s model or subject to a TOS change. The whole reason I run the homelab is to keep this under my own control.

The third reason, which I did not anticipate: the self-hosted version is faster to iterate. The first weekend I shipped a working prototype. The second weekend I added the iPhone preview. The third weekend I added scheduling. On a SaaS I would be filing feature requests and waiting for someone else’s roadmap.

The stack

I went with the boring choices because I wanted the prototype to work in a weekend. The stack is:

**Next.js 16 with the App Router**, running both frontend and API routes in a single process. The App Router handles the server/client component split without me wiring up two deployments.

**PostgreSQL 16** for persistent data — carousels, slides, scheduled jobs, tokens. 8 tables, no ORM. I have learned that an ORM saves 30 minutes on the first query and costs 3 hours debugging the N+1 problem it generated later.

**Redis 7 with BullMQ** for the queue. Image generation is slow — 10 slides at 5 seconds each blocks the API. I push each slide onto a BullMQ worker and the API returns immediately with a carousel ID the client polls.

**MinIO** for image storage. S3-compatible, so the same code works locally or on a real S3 bucket.

**Google Gemini** for both text (`gemini-2.5-flash-lite`) and image generation (`gemini-2.5-flash-image`). The flash variants are cheap, and the image model produces portrait 3:4 visuals that fit the feed format without cropping.

**Docker Compose** to run the whole thing — five services, one `docker compose up -d`.

The hard parts

There were three things that took longer than I expected. The first was the Instagram Graph API. The flow is: get a Facebook login token, exchange it for a long-lived Instagram token, then upload images as children of a carousel container, then publish. The docs are good if you already know the flow. The docs are useless if you are reading them for the first time at 2am on a Sunday. I burned 4 hours on token refresh logic that should have taken 30 minutes, because the error messages from the Graph API are not designed to be helpful.

The second was image consistency. The first version generated a different visual style for every slide, because each call to the image model was independent. I solved this by passing the previous slide’s style and composition as a hint in the next call’s prompt. The hint is a one-paragraph string that says “match the palette, lighting, and visual rhythm of the previous slide.” It is not perfect — about 20 percent of the carousels still have a visual jump on slide 5 or 6 — but it raised the hit rate from 50 percent to 80 percent.

The third was the editor state model. I had two options: ship the editor as a controlled form with React state, or use a more sophisticated state model with auto-save. I shipped the controlled form first. It broke the moment two users opened the same carousel. I rewrote it with a debounced auto-save on blur, plus a per-slide optimistic update. The whole thing took a Saturday. The lesson, as always, was that any stateful UI that involves persistence and concurrency is twice the work you think it is, and you should plan for that from the first commit.

What I would not recommend

I would not recommend this stack if you are shipping this as a product to people who are not you. The auth model is mine — there is no user abstraction, no team, no roles, no billing. Adding 1,000 users means redesigning the data model, queue, storage, and auth. That is a different product and a different 6 months.

I would not recommend the image hint approach as a substitute for proper image conditioning. If I needed 100 percent consistent carousels, I would fine-tune a model on my own visual style. The hint approach is good enough for the 80 percent of carousels where visual jumps are minor. The rest get regenerated until they pass.

I would not recommend shipping a self-hosted AI product to a non-technical customer. The whole reason this runs on my homelab is that I know how to fix it when it breaks. A non-technical user would be stuck the first time the BullMQ worker crashes or the Gemini API rate-limits them. Self-hosted AI is for builders, not consumers.

The point of the post

I built this in a weekend, and I built it because I needed it. I am sharing it because the pattern is reusable: take a content workflow you do by hand, find the slow part, and replace it with an LLM call wrapped in a tool you control. The whole stack — Next.js, Docker, Postgres, Redis, MinIO, Gemini — fits in a single docker-compose.yml. The total cost is less than one Notion subscription. The hardest part is admitting that the content you produce at 2am is not as good as what you would produce with a tool that does the boring middle for you.

The bet I made was right. I now publish more carousels, with higher quality, in less time. The M720q in my living room runs the whole stack. The Gemini bill is Rp 30,000 a month. I get to spend my time on the parts I am actually good at — the hook, the angle, the visual taste — instead of everything in between.

If you have a content workflow that is grinding you down, the answer is probably not “do more of it.” The answer is probably “write the tool that does it for you, run it on a homelab, and let an LLM handle the boring middle.” The whole thing fits in a single file. You can build it this weekend.

For the homelab + Docker Compose setup this runs on, see my $0/month blog stack breakdown. The pattern here — build a tool that uses an LLM to do the boring middle — is the same one I wrote about in shipping AI agents before they are perfect.

If this resonated, you might also enjoy my take on how my AI agent almost broke the ERP database.

If this resonated, you might also enjoy my take on what I do when AI generates garbage code.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.