Mercury Coder and Language Diffusion Modeling: Something to Watch

Mar 01, 2025

I’ve written about masked diffusion models before, but I didn’t expect someone to release a practical demo so soon for what seemed like an experimental technology in its early stages of R&D. But that day has come with Mercury Coder, produced by Inception Labs. Mercury Coder is a new class of LLM that is built on a novel set of principles. Instead of the “next token prediction” of autoregressive transformer models, it uses diffusion, similar to the stable diffusion used for image generation except applied to the language token, rather than image, domain.

Mercury’s thermodynamic principles allow it to explore a much vaster space of possible outputs simultaneously. Like a gas expanding to fill the volume of a container, each particle is instead a different token, at first semi-randomly generated but later stabilizing. This makes it blazingly fast, capable of processing 1,109 tokens per second compared to 59 tokens per second for GPT-4o Mini.

Performance metrics-wise, with 8 billion parameters Mercury Coder performs at the level of a middle-of-the-road autoregressive LLM, which is not too shabby for an early demo, and leaves open the possibility that it has room to scale.

Playing with the app is interesting. The developers have included a “diffusion process” viewer setting, that when turned on will show an animation of streaming text race down the screen as it grows from more random to more ordered, resulting finally in the finished output. I tried to include some screenshots of the effect, but it goes so fast it’s hard to capture. Seeing structured, usable code get generated before you blink is impressive.

An example of the diffusion animation. The text is faint so look closely.

This diffusion-based LLM with its expansive, combination-exploring nature could have many as of yet inconceivable applications. I could imagine it crushing it on certain search problems, or in cryptographic tasks that require trying out many combos at once. It’s further proof that the path to innovation with AI may lie beyond simply scaling transformers and instead investigating the space of possible alternative architectures, a bit like how Mercury Coder works!

Logic Bombs

Discussion about this post