Text-to-image AI exploded this year as technical advances greatly improved the fidelity of the art that AI systems could create. Controversial as systems like stable diffusion and OpenAI FROM-E 2 They are, platforms like DeviantArt and Canva have embraced them to power creative tools, personalize branding, and even come up with new products.
But the technology at the heart of these systems is capable of much more than generating art. Called diffusion, it’s being used by some intrepid research groups to produce music, synthesize DNA sequences, and even discover new drugs.
So what is diffusion, exactly, and why is it such a massive leap over the previous state of the art? As the year winds down, it’s worth taking a look at the origins of broadcasting and how it progressed over time to become the influential force it is today. The Diffusion story isn’t over (refinements in techniques come with each passing month), but the last year or two has brought notable progress especially.
The birth of diffusion.
You might remember the trend of deep-spoofing apps several years ago: apps that inserted portraits of people into existing images and videos to create realistic-looking replacements for the original subjects in that target content. Using AI, apps would “insert” a person’s face, or in some cases their entire body, into a scene, often convincingly enough to fool someone at first glance.
Most of these apps were based on an artificial intelligence technology called generative adversarial networks, or GAN for short. GANs consist of two parts: one generator that produces synthetic examples (for example, images) from random data and a discriminated against which attempts to distinguish between synthetic examples and actual examples from a training data set. (Typical GAN training data sets consist of hundreds to millions of examples of things the GAN is expected to eventually capture.) Both the generator and the discriminator improve in their respective abilities until the discriminator cannot distinguish real examples from synthesized examples with better than the 50% accuracy expected from chance.
High performance GANs can create, for example, snapshots of fictional apartment buildings. StyleGAN, a system Nvidia developed a few years ago, can generate high-resolution headshots of fictional people by learning attributes such as facial pose, freckles, and hair. Beyond imaging, GANs have been applied to the 3D modeling space and vector sketchesshowing an aptitude for generating video clips as much as speaks and even loop instrument samples in songs
However, in practice, GANs suffered from a number of shortcomings due to their architecture. Simultaneous training of the generator and discriminator models was inherently unstable; sometimes the generator would “collapse” and generate many similar-looking samples. GANs also required a lot of data and computing power to run and train, making them difficult to scale.
Go into broadcast.
How broadcast works
Diffusion was inspired by physics, being the process in physics where something moves from a region of higher concentration to a region of lower concentration, like a sugar cube dissolving in coffee. The sugar granules in coffee are initially concentrated at the top of the liquid, but are gradually distributed.
Diffusion systems borrow from diffusion in non-equilibrium thermodynamics specifically, where the process increases the entropy, or randomness, of the system with time. Consider a gas: it will eventually spread out to fill an entire space evenly through random motion. Similarly, data such as images can be transformed into a uniform distribution by randomly adding noise.
Broadcast systems slowly destroy the structure of the data by adding noise until there is nothing but noise left.
In physics, diffusion is spontaneous and irreversible: sugar that diffuses into coffee cannot be restored to its cube shape. But diffusion systems in machine learning aim to learn a kind of “reverse diffusion” process to restore destroyed data, gaining the ability to recover data from noise.
Broadcast systems have been around for almost a decade. But a relatively recent OpenAI innovation called CLIP (short for “Contrastive Language Image Pretraining”) made them much more practical in everyday applications. CLIP classifies data, such as images, to “score” each step in the dissemination process based on how likely they are to be classified in a given text message (eg, “a sketch of a dog on a flowery lawn”).
At first, the data has a very low CLIP score, because it is mostly noise. But as the broadcast system reconstructs the data from the noise, it slowly gets closer to matching the ad. A useful analogy is uncarved marble: like a master sculptor telling a novice where to carve, CLIP guides the diffusion system toward an image that awards a higher score.
OpenAI introduced CLIP together with the DALL-E imaging system. It has since made its way into the successor to DALL-E, DALL-E 2, as well as open source alternatives like Stable Diffusion.
What can diffusion do?
So what can CLIP-guided diffusion models do? Well, as mentioned above, they’re pretty good at generating art, from photorealistic art to sketches, drawings, and paintings in the style of virtually any artist. In fact, there is evidence to suggest that they problematically regurgitate some of your training data.
But the talent of the models, as controversial as it is, does not end there.
Researchers have also experimented with using guided diffusion models to compose new music. Harmonyan organization financially backed by AI stability, the London-based startup behind Stable Diffusion, has released a diffusion-based model that can generate music clips by training on hundreds of hours of existing songs. More recently, developers Seth Forsgren and Hayk Martiros created a hobby project called refusion which uses a diffusion model cleverly trained on audio spectrograms (visual representations) to generate ditties.
Beyond the realm of music, several labs are trying to apply diffusion technology to biomedicine in the hope of discovering new treatments for diseases. Startup Generate Biomedicines and a team from the University of Washington trained diffusion-based models to produce protein designs with specific properties and functions, as MIT Tech Review reported earlier this month.
The models work in different ways. Generate biomedical ad noise unraveling the chains of amino acids that make up a protein and then joining random chains together to form a new protein, guided by the constraints specified by the researchers. The University of Washington model, on the other hand, starts with a coded structure and uses information about how the pieces of a protein should fit together provided by a separate AI system trained to predict the protein’s structure.
They have already achieved some success. The model designed by the University of Washington group was able to find a protein that can bind to parathyroid hormone, the hormone that controls calcium levels in the blood, better than existing drugs.
Meanwhile in OpenBioML, a Stability AI-backed effort to bring machine learning-based approaches to biochemistry, researchers have developed a system called DNA Diffusion to generate cell-type-specific regulatory DNA sequences, segments of nucleic acid molecules that influence the expression of specific genes within an organism. DNA-Diffusion, if all goes to plan, will generate DNA regulatory sequences from text instructions such as “A sequence that will activate a gene at its highest expression level in type X cells” and “A sequence that activates a gene in the liver and the heart. , but not in the brain.”
What might the future hold for diffusion models? The sky may well be the limit. Researchers have already applied it to generating videos, compress images Y synthesizing speech. That’s not to say that diffusion won’t eventually be replaced with a higher-performing, more efficient machine learning technique, as GANs were with diffusion. But it’s the architecture of the day for a reason; diffusion is just versatile.