How the New OpenAI Image Generation Works
OpenAI went viral with the Ghibli-style AI images. Why is this new OpenAI image generation so good and accurate? What's different from diffusion models?
If you are on X or LinkedIn, you have probably seen the wave of “Ghibli images.” - The latest image generator released on OpenAI's ChatGPT has sparked a wave of memes online. It features artwork inspired by Studio Ghibli, the renowned Japanese animation studio behind classics like My Neighbor Totoro and Princess Mononoke.
This is what I’m talking about:
OpenAI just changed the game completely with their new 4o Image Generation. It creates very high-quality visuals, can include text (which previously looked like weird alien scribbles), and allows editing of the images or accurate recreating of real people.
Instead of using the same old diffusion approach that powered previous DALL-E models, OpenAI switched to something called "autoregressive architecture," and it makes a massive difference in what AI images can do now.
I made a few notes about why this is a big deal and what is different technically from the previous ways we used to generate images with AI.
Diffusion vs Autoregression
Diffusion algorithm
This is used by DALL-E 2 & 3, Stable Diffusion, and Midjourney. The diffusion approach processes the entire image simultaneously, starting with random noise and gradually refining it into a coherent image – similar to when you are watching a Polaroid develop.
The process involves two key stages:
Forward Diffusion: Systematically adds noise to an image until it becomes random.
Reverse Diffusion: The neural network learns to remove noise step-by-step to reconstruct an image.
During training, the model learns to predict the noise that was added at each step of the forward process. At inference time, we can start with pure noise and iteratively apply the learned denoising process to generate new images.
Autoregressive approach
Now used by OpenAI's GPT-4o. Constructs images sequentially (pixel by pixel, left to right, top to bottom), similar to how GPT generates text. This explains why the generation appears as a curtain drawing downward and why it's slower but often more precise.
The process involves:
Image Tokenization: Converts the image space into discrete "visual tokens" using techniques like VQ-VAE.
Sequential Prediction: Predicts each token based on all previously generated tokens, maintaining consistency throughout the image.
Difference shown on example
I found a diagram from an academic paper that demonstrates nicely how the approaches differ:
Top Section: AR Decoder (What GPT-4o Uses)
The X-ray is processed token by token, starting with <bos>
(beginning of sequence). Each new token ("the" → "heart" → "is" → "normal") depends on previous ones. The arrows show this direct dependency between tokens - that is very important! It’s building sequentially.
Bottom Section: Diffusion Model (What DALL-E 2 & 3 Used)
It starts with "Initial Gaussian Noise" (random static) at the bottom and works on the entire output at once through denoising steps. There are no sequential dependencies between elements.

Native Multimodal Model
With this update to the autoregressive approach, we get what OpenAI calls "native multimodal image generation." Of course, it’s OpenAI, so they don’t reveal any code or paper behind this approach. But “native multimodal” is just a fancy way of saying the text and images are processed by the same “brain” now.
The AI doesn't need to switch between different systems to understand your request and then make an image - it's all happening in one place.
This connection brings some really impressive improvements.
The model can actually write proper text in images because it's using the same system that's so good at generating text in chats. Because everything shares the same neural network, the model actually understands the context of your entire chat. Ask it to make a cat, then say "now put a hat on it," and it knows what you're talking about without starting over.
You can give it much more detailed prompts with multiple elements (like "draw 10 animals in a specific arrangement"), and it actually follows them correctly. It does eventually get confused if you try to cram in too many things (more than 10-20 objects).
The image doesn't lose quality or consistency when you request changes because the model understands both the image it made and your new requests.
Why wasn’t autoregression used before?
Autoregression isn't some brand-new AI technique. It dates back to 1927 when it was first used to model sunspot timing patterns. And if you've been using ChatGPT, you've been using an autoregressive model all along - that's exactly how language models generate text.
So why hasn't it been used for image generation until now? The simple answer: it was too slow and computationally expensive.
What made autoregressive image generation finally practical now though? While OpenAI hasn't revealed anything, developers on social media suggest there's an iterative refinement system at work rather than a single-pass generation. That would mean a multi-step approach where the image isn't generated in just one complete autoregressive pass but through multiple iterations that progressively improve the quality. People also speculate it’s achieved by better hardware, more efficient image tokenization.
Whatever the exact technique, the result is still computationally intensive (hence the slower generation) but now just feasible enough to be practical. I just really want to know what OpenAI put in there…
Downsides
The autoregressive approach isn’t perfect yet. There's the mentioned big tradeoff in speed. If you've tried 4o image generation, you've noticed it's painfully slower than the old DALL-E models. It can take 30 seconds to a full minute (sometimes even longer) to generate a single image.
It’s obvious, given what we said - it's building the image piece by piece in sequence rather than working on the whole image at once.
The model also can get confused if you ask for too many objects (more than 10-20 things), sometimes crops images weirdly, or struggles with languages other than English.
What’s next?
There are also some pretty big ethical questions. The model can copy celebrity faces, brand logos, and artistic styles (like Studio Ghibli's distinctive look), which raises all sorts of copyright and consent issues. OpenAI does include special metadata in all generated images to show they're AI-made, but that can be removed easily.
Sam Altman (OpenAI's CEO) seems pretty relaxed (based on his posts on X) about these risks. He’s just a chill guy.
Regardless of this, this shift shows us where AI is heading: everything is going to merge together. For regular users, this means more natural conversations with AI and the ability to refine creative work through normal dialogue instead of learning complex prompting tricks.
For society, it's another huge step toward a world where any media can be manipulated by anyone with minimal technical skill. We're entering an era where "seeing is believing" becomes increasingly meaningless. The importance of trusting your sources rather than just trusting what you see will only grow.
Whether the benefits outweigh the risks is still an open question, but one thing is for sure - OpenAI just changed the game in a fundamental way, and everyone else will likely follow.
P. S. If you are a video person more than a text person, I recommend this YouTube video explaining the difference between diffusion and autoregressive models.
What is Ghibli, btw?
I think in the age of AI-generated content, it’s good to spread knowledge about the origins of the Ghibli visual style, given it went so viral. For those who don't know, Studio Ghibli was co-founded by Hayao Miyazaki in 1985. Miyazaki is a legendary Japanese animator, filmmaker, and manga artist who created beloved films like "My Neighbor Totoro", "Kiki's Delivery Service", and "Spirited Away".
These films have a distinctive visual style that's instantly recognizable - and now AI can copy it with a simple text prompt. That's impressive technically, but it's worth noting that Miyazaki himself has been openly critical of AI animation, reportedly calling it "an insult to life itself" when shown examples of AI-generated imagery.
Great topic, well explained. Kudos