Diffusion model

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure.[1] The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data.[2] A trained diffusion model can be sampled in many ways, with different efficiency and quality.

There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[3] They are typically trained using variational inference.[4] The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typically U-nets or transformers.

As of 2024, diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise.[2][5] The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.

Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.[6]

Other than computer vision, diffusion models have also found applications in natural language processing[7][8] such as text generation[9][10] and summarization,[11] sound generation,[12] and reinforcement learning.[13][14]

Denoising diffusion model

[edit]

Non-equilibrium thermodynamics

[edit]

Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion.[15]

Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.

The equilibrium distribution is the Gaussian distribution , with pdf . This is just the Maxwell–Boltzmann distribution of particles in a potential well at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

Denoising Diffusion Probabilistic Model (DDPM)

[edit]

The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.[4][16]

Forward diffusion

[edit]

To present the model, we need some notation.

  • are fixed constants.
  • is the normal distribution with mean and variance , and is the probability density at .
  • A vertical bar denotes conditioning.

A forward diffusion process starts at some starting point , where is the probability distribution to be learned, then repeatedly adds noise to it bywhere are IID samples from . This is designed so that for any starting distribution of , we have converging to .

The entire diffusion process then satisfiesorwhere is a normalization constant and often omitted. In particular, we note that is a gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with gaussian process, In particular, notice that for large , the variable converges to . That is, after a long enough diffusion process, we end up with some that is very close to , with all traces of the original gone.

For example, sincewe can sample directly "in one step", instead of going through all the intermediate steps .

Derivation by reparameterization

We know is a gaussian, and is another gaussian. We also know that these are independent. Thus we can perform a reparameterization: where are IID gaussians.

There are 5 variables and two linear equations. The two sources of randomness are , which can be reparameterized by rotation, since the IID gaussian distribution is rotationally symmetric.

By plugging in the equations, we can solve for the first reparameterization: where is a gaussian with mean zero and variance one.

To find the second one, we complete the rotational matrix:

Since rotational matrices are all of the form , we know the matrix must be and since the inverse of rotational matrix is its transpose,

Plugging back, and simplifying, we have

Backward diffusion

[edit]

The key idea of DDPM is to use a neural network parametrized by . The network takes in two arguments , and outputs a vector and a matrix , such that each step in the forward diffusion process can be approximately undone by . This then gives us a backward diffusion process defined byThe goal now is to learn the parameters such that is as close to as possible. To do that, we use maximum likelihood estimation with variational inference.

Variational inference

[edit]

The ELBO inequality states that , and taking one more expectation, we getWe see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.

Define the loss functionand now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to[17]where does not depend on the parameter, and thus can be ignored. Since also does not depend on the parameter, the term can also be ignored. This leaves just with to be minimized.

Noise prediction network

[edit]

Since , this suggests that we should use ; however, the network does not have access to , and so it has to estimate it instead. Now, since , we may write , where is some unknown gaussian noise. Now we see that estimating is equivalent to estimating .

Therefore, let the network output a noise vector , and let it predictIt remains to design . The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value , where either yielded similar performance.

With this, the loss simplifies to which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss functionresulted in better models.

Backward diffusion process

[edit]

After a noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows:

  1. Compute the noise estimate
  2. Compute the original data estimate
  3. Sample the previous data
  4. Change time

Score-based generative model

[edit]

Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).[18][19][20][21]

Score matching

[edit]

The idea of score functions

[edit]

Consider the problem of image generation. Let represent an image, and let be the probability distribution over all possible images. If we have itself, then we can say for certain how likely a certain image is. However, this is intractable in general.

Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?

Consequently, we are actually quite uninterested in itself, but rather, . This has two major effects:

  • One, we no longer need to normalize , but can use any , where is any unknown constant that is of no concern to us.
  • Two, we are comparing neighbors , by

Let the score function be ; then consider what we can do with .

As it turns out, allows us to sample from using thermodynamics. Specifically, if we have a potential energy function , and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the Boltzmann distribution . At temperature , the Boltzmann distribution is exactly .

Therefore, to model , we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the Langevin equation and the Boltzmann distribution is, by Fokker-Planck equation, the unique thermodynamic equilibrium. So no matter what distribution has, the distribution of converges in distribution to as .

Learning the score function

[edit]

Given a density , we wish to learn a score function approximation . This is score matching.[22] Typically, score matching is formalized as minimizing Fisher divergence function . By expanding the integral, and performing an integration by parts, giving us a loss function, also known as the Hyvärinen scoring rule, that can be minimized by stochastic gradient descent.

Annealing the score function

[edit]

Suppose we need to model the distribution of images, and we want , a white-noise image. Now, most white-noise images do not look like real images, so for large swaths of . This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function at that point, then we cannot impose the time-evolution equation on a particle:To deal with this problem, we perform annealing. If is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.

Continuous diffusion processes

[edit]

Forward diffusion process

[edit]

Consider again the forward diffusion process, but this time in continuous time:By taking the limit, we obtain a continuous diffusion process, in the form of a stochastic differential equation:where is a Wiener process (multidimensional Brownian motion).

Now, the equation is exactly a special case of the overdamped Langevin equationwhere is diffusion tensor, is temperature, and is potential energy field. If we substitute in , we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models.

Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to at time , then after a long time, the cloud of particles would settle into the stable distribution of . Let be the density of the cloud of particles at time , then we haveand the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning.

By Fokker-Planck equation, the density of the cloud evolves according towhere is the dimension of space, and is the Laplace operator.

Backward diffusion process

[edit]

If we have solved for time , then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density , and let the particles in the cloud evolve according tothen by plugging into the Fokker-Planck equation, we find that . Thus this cloud of points is the original cloud, evolving backwards.[23]

Noise conditional score network (NCSN)

[edit]

At the continuous limit, and so In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling , then get . That is, we can quickly sample for any .

Now, define a certain probability distribution over , then the score-matching loss function is defined as the expected Fisher divergence: After training, , so we can perform the backwards diffusion process by first sampling , then integrating the SDE from to : This may be done by any SDE integration method, such as Euler–Maruyama method.

The name "noise conditional score network" is explained thus:

  • "network", because is implemented as a neural network.
  • "score", because the output of the network is interpreted as approximating the score function .
  • "noise conditional", because is equal to blurred by an added gaussian noise that increases with time, and so the score function depends on the amount of noise added.

Their equivalence

[edit]

DDPM and score-based generative models are equivalent.[19][2][24] This means that a network trained using DDPM can be used as a NCSN, and vice versa.

We know that , so by Tweedie's formula, we have As described previously, the DDPM loss function is with where . By a change of variables, and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have

Thus, a score-based network can be used for denoising diffusion.

Conversely, the continuous limit of the backward equation gives us precisely the same equation as score-based diffusion: Thus, a denoising network can be used as for score-based diffusion.

Main variants

[edit]

Noise schedule

[edit]
Illustration for a linear diffusion noise schedule. With settings .

In DDPM, the sequence of numbers is called a (discrete time) noise schedule. In general, consider a strictly increasing monotonic function of type , such as the sigmoid function. In that case, a noise schedule is a sequence of real numbers . It then defines a sequence of noises , which then derives the other quantities .

In order to use arbitrary noise schedules, instead of training a noise prediction model , one trains .

Similarly, for the noise conditional score network, instead of training , one trains .

Denoising Diffusion Implicit Model (DDIM)

[edit]

The original DDPM method for generating images is slow, since the forward diffusion process usually takes to make the distribution of to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as is gaussian for all , the backward diffusion process does not allow skipping steps. For example, to sample requires the model to first sample . Attempting to directly sample would require us to marginalize out , which is generally intractable.

DDIM[25] is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM.

In detail, the DDIM sampling method is as follows. Start with the forward diffusion process . Then, during the backward denoising process, given , the original data is estimated as then the backward diffusion process can jump to any step , and the next denoised sample is where is an arbitrary real number within the range , and is a newly sampled gaussian noise.[17] If all , then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level.

The original paper recommended defining a single "eta value" , such that . When , this is the original DDPM. When , this is the fully deterministic DDIM. For intermediate values, the process interpolates between them.

By the equivalence, the DDIM algorithm also applies for score-based diffusion models.

Latent diffusion model (LDM)

[edit]

Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.[26]

The encoder-decoder pair is most often a variational autoencoder (VAE).

Architectural improvements

[edit]

[27] proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling from , they recommended sampling from for a learned parameter .

In the v-prediction formalism, the noising formula is reparameterised by an angle such that and a "velocity" defined by . The network is trained to predict the velocity , and denoising is by .[28] This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e. ) and then reverse it, whereas the standard parameterization never reaches total noise since is always true.[29]

Classifier guidance

[edit]

Classifier guidance was proposed in 2021 to improve class-conditional generation by using a classifier. The original publication used CLIP text encoders to improve text-conditional image generation.[30]

Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution , where ranges over images, and ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).

Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image conditional on description , we imagine that the requester really had in mind an image , but the image is passed through a noisy channel and came out garbled, as . Image generation is then nothing but inferring which the requester had in mind.

In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". In the equation for backward diffusion, the score can be replaced by where is the score function, trained as previously described, and is found by using a differentiable image classifier.

During the diffusion process, we need to condition on the time, givingAlthough, usually the classifier model does not depend on time, in which case .

Classifier guidance is defined for the gradient of score function, thus for score-based diffusion network, but as previously noted, score-based diffusion models are equivalent to denoising models by , and similarly, . Therefore, classifier guidance works for denoising diffusion as well, using the modified noise prediction:[30]

With temperature

[edit]

The classifier-guided diffusion model samples from , which is concentrated around the maximum a posteriori estimate . If we want to force the model to move towards the maximum likelihood estimate , we can use where is interpretable as inverse temperature. In the context of diffusion models, it is usually called the guidance scale. A high would force the model to sample from a distribution concentrated around . This sometimes improves quality of generated images.[30]

This gives a modification to the previous equation:For denoising models, it corresponds to[31]

Classifier-free guidance (CFG)

[edit]

If we do not have a classifier , we could still extract one out of the image model itself:[31] Such a model is usually trained by presenting it with both and , allowing it to model both and .

Note that for CFG, the diffusion model cannot be merely a generative model of the entire data distribution . It must be a conditional generative model . For example, in stable diffusion, the diffusion backbone takes as input both a noisy model , a time , and a conditioning vector (such as a vector encoding a text prompt), and produces a noise prediction .

For denoising models, it corresponds toAs sampled by DDIM, the algorithm can be written as[32]A similar technique applies to language model sampling. Also, if the unconditional generation is replaced by , then it results in negative prompting, which pushes the generation away from condition.[33][34]

Samplers

[edit]

Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "noise schedule" can also affect the quality of samples. A noise schedule is a function that sends a natural number to a noise level: A noise schedule is more often specified by a map . The two definitions are equivalent, since .

In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling.[35] One can interpolate between noise and no noise. The amount of noise is denoted ("eta value") in the DDIM paper, with denoting no noise (as in deterministic DDIM), and denoting full noise (as in DDPM).

In the perspective of SDE, one can use any of the numerical integration methods, such as Euler–Maruyama method, Heun's method, linear multistep methods, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration.[36]

A survey and comparison of samplers in the context of image generation is in.[37]

Other examples

[edit]

Notable variants include[38] Poisson flow generative model,[39] consistency model,[40] critically-damped Langevin diffusion,[41] GenPhys,[42] cold diffusion,[43] discrete diffusion,[44][45] etc.

Flow-based diffusion model

[edit]

Abstractly speaking, the idea of diffusion model is to take an unknown probability distribution (the distribution of natural-looking images), then progressively convert it to a known probability distribution (standard gaussian distribution), by building an absolutely continuous probability path connecting them. The probability path is in fact defined implicitly by the score function .

In denoising diffusion models, the forward process adds noise, and the backward process removes noise. Both the forward and backward processes are SDEs, though the forward process is integrable in closed-form, so it can be done at no computational cost. The backward process is not integrable in closed-form, so it must be integrated step-by-step by standard SDE solvers, which can be very expensive. The probability path in diffusions model is defined through an Itô process and one can retrieve the deterministic process by using the Probability ODE flow formulation.[2]

In flow-based diffusion models, the forward process is a deterministic flow along a time-dependent vector field, and the backward process is also a deterministic flow along the same vector field, but going backwards. Both processes are solutions to ODEs. If the vector field is well-behaved, the ODE will also be well-behaved.

Given two distributions and , a flow-based model is a time-dependent velocity field in , such that if we start by sampling a point , and let it move according to the velocity field: we end up with a point . The solution of the above ODE define a probability path by the pushforward measure operator. In particular, .

The probability path and the velocity field also satisfy the continuity equation, in the sense of probability distribution: