Classifier-Free Guidance with Flow Matching

Classifier-Free Guidance with Flow Matching: Mathematical Foundations

1. Background: Flow Matching for Generative Modeling

Flow matching is a framework for generative modeling that learns a vector field (or “flow”) to transform a simple base distribution (e.g., Gaussian noise) into a complex data distribution. Let $x_0 \sim p_0(x)$ be a sample from the base distribution, and $x_1 \sim p_1(x)$ a sample from the data distribution. The goal is to learn a time-dependent vector field $v_\theta(x, t)$ such that the solution to the ordinary differential equation (ODE):

\[\frac{dx}{dt} = v_\theta(x, t), \qquad x(0) = x_0\]

transports $x_0$ to $x_1$ as $t$ goes from $0$ to $1$.

The probability density $p_t(x)$ of $x(t)$ evolves according to the continuity equation:

\[\frac{\partial p_t(x)}{\partial t} + \nabla_x \cdot (p_t(x) v_\theta(x, t)) = 0\]

The flow $v_\theta$ is trained to match the marginal distributions $p_0$ and $p_1$ at $t=0$ and $t=1$, respectively, and to interpolate between them for $t \in (0, 1)$.

2. Conditional Generation and Classifier-Free Guidance

Suppose we wish to generate samples conditioned on some attribute $y$ (e.g., class label, caption). Let $p_1(x|y)$ denote the conditional data distribution. In conditional generative modeling, we want to learn a flow $v_\theta(x, t, y)$ that transports $p_0(x)$ to $p_1(x|y)$.

Classifier-free guidance is a technique to steer the generative process toward a desired condition $y$ without relying on an external classifier. Instead, the model is trained to predict both the unconditional and conditional flows, and at sampling time, the two are combined to “guide” the generation.

2.1. Conditional and Unconditional Flows

Conditional flow: $v_\theta(x, t, y)$ — trained to transport $p_0(x)$ to $p_1(x|y)$.
Unconditional flow: $v_\theta(x, t, \varnothing)$ — trained to transport $p_0(x)$ to the marginal $p_1(x)$ (i.e., ignoring $y$).

During training, the model is randomly provided with $y$ (conditional) or $\varnothing$ (unconditional), so it learns both behaviors.

3. Mathematical Formulation of Classifier-Free Guidance

At sampling time, we wish to bias the generative process toward the condition $y$ more strongly. This is achieved by interpolating between the conditional and unconditional flows:

\[v_{\text{guided}}(x, t, y) = (1 + w) \cdot v_\theta(x, t, y) - w \cdot v_\theta(x, t, \varnothing)\]

where $w \geq 0$ is the guidance weight (hyperparameter). For $w=0$, we recover the standard conditional flow; for $w > 0$, the process is “pushed” more strongly toward $y$.

3.1. Motivation: Log-Probability Gradient Interpretation

The optimal flow for transporting $p_0(x)$ to $p_1(x|y)$ is (under certain conditions):

\[v^*(x, t, y) = \nabla_x \log p_t(x\|y)\]

Similarly, the unconditional flow is:

\[v^*(x, t, \varnothing) = \nabla_x \log p_t(x)\]

The difference between the conditional and unconditional flows is:

\[v^*(x, t, y) - v^*(x, t, \varnothing) = \nabla_x \log \frac{p_t(x\|y)}{p_t(x)} = \nabla_x \log p_t(y|x)\]

Thus, the guidance term is proportional to the gradient of the log-probability of $y$ given $x$ at time $t$.

3.2. Guided Flow as a Log-Posterior Gradient

The guided flow can be written as:

\[\begin{align*} v_{\text{guided}}(x, t, y) &= v_\theta(x, t, y) + w \left[ v_\theta(x, t, y) - v_\theta(x, t, \varnothing) \right] \\ &\approx \nabla_x \log p_t(x\|y) + w \nabla_x \log p_t(y|x) \\ &= \nabla_x \log \left[ p_t(x\|y) \cdot p_t(y|x)^w \right] \\ &= \nabla_x \log \left[ p_t(x\|y)^{1+w} \cdot p_t(x)^{-w} \right] \end{align*}\]

This shows that classifier-free guidance amplifies the conditional likelihood $p_t(x|y)$ relative to the marginal $p_t(x)$, biasing the generative process toward samples more likely under the condition $y$.

4. Training Objective for Classifier-Free Guidance in Flow Matching

During training, the model $v_\theta(x, t, y)$ is optimized to match the target flow for both conditional and unconditional cases. A typical loss is:

\[\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, x_1, y} \left[ \lambda_{\text{cond}} \| v_\theta(x, t, y) - v^*_{\text{cond}}(x, t, y) \|^2 + \lambda_{\text{uncond}} \| v_\theta(x, t, \varnothing) - v^*_{\text{uncond}}(x, t) \|^2 \right]\]

where $ v^*_{\text{cond}}$

and $v^*_{\text{uncond}}$ are target flows (e.g., from data or analytic forms)

and $\lambda_{\text{cond}}$, $\lambda_{\text{uncond}}$ are weights.

5. Sampling Algorithm with Classifier-Free Guidance

Given a trained model, sampling proceeds by integrating the guided ODE:

Initialize: $x_0 \sim p_0(x)$
For $t$ from $0$ to $1$:
- Compute $v_\theta(x, t, y)$ and $v_\theta(x, t, \varnothing)$
- Form $v_{\text{guided}}(x, t, y)$ as above
- Update $x$ via ODE step: $x \leftarrow x + v_{\text{guided}}(x, t, y) \, dt$
Return: $x_1$ as the generated sample conditioned on $y$

6. Summary Table

Symbol	Meaning
$v_\theta(x, t, y)$	Conditional flow (learned)
$v_\theta(x, t, \varnothing)$	Unconditional flow (learned)
$w$	Guidance weight (hyperparameter)
$v_{\text{guided}}(x, t, y)$	Guided flow used for sampling
$p_t(x\|y)$	Conditional distribution at time $t$
$p_t(x)$	Marginal distribution at time $t$

7. Key Takeaways

Classifier-free guidance in flow matching is a principled way to steer generative models toward desired conditions by combining conditional and unconditional flows.
The guidance mechanism can be interpreted as amplifying the gradient of the log-posterior of the condition, biasing samples toward higher likelihood under $y$.
The method is fully differentiable and does not require an external classifier, making it efficient and easy to implement in flow-based generative models.

References:

Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. arXiv:2207.12598
Lipman, Y., et al. (2022). Flow Matching for Generative Modeling. arXiv:2206.02777
Song, Y., et al. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456

8. The “Nulltoken” Trick: Introduction and Mathematical Pitfalls

How the Nulltoken is Introduced

In classifier-free guidance for conditional generative models (including flow matching and diffusion), the “nulltoken” trick is a practical device for implementing unconditional generation within a unified model architecture. The idea is as follows:

The model $v_\theta(x, t, y)$ is trained to accept a condition $y$ (e.g., a class label, text prompt, or other side information).
To enable unconditional generation, a special “null” or “empty” token (often denoted as $\varnothing$ or a reserved embedding) is introduced.
During training, with some probability, the condition $y$ is replaced by the nulltoken, and the model is trained to predict the unconditional flow $v_\theta(x, t, \varnothing)$.
At sampling time, both $v_\theta(x, t, y)$ and $v_\theta(x, t, \varnothing)$ are evaluated by passing either the actual condition or the nulltoken, and combined for guidance.

This approach is simple and effective in practice, as it allows a single model to handle both conditional and unconditional cases by overloading the conditioning input.

Why the Nulltoken is Mathematically Incorrect

While the nulltoken trick is widely used, it is not mathematically principled from the perspective of probability theory and conditional modeling. The core issue is that the nulltoken does not correspond to a well-defined marginalization over the condition. Let’s see why.

1. True Marginalization vs. Nulltoken

True unconditional flow: The correct unconditional flow should be $v^*_{\text{uncond}}(x, t) = \mathbb{E}_{y \sim p(y)} \left[ v^*_{\text{cond}}(x, t, y) \right]$ That is, the unconditional flow is the expectation of the conditional flow over the true data distribution of $y$.
Nulltoken flow: In the nulltoken trick, the unconditional flow is defined as $v_\theta(x, t, \varnothing)$ where $\varnothing$ is a special token, not a sample from $p(y)$.

2. The Mathematical Error

The nulltoken does not represent marginalization over $y$:

The model is never exposed to actual samples $y$ when the nulltoken is used; it is only trained to map the nulltoken to some “unconditional” behavior.
There is no guarantee that $v_\theta(x, t, \varnothing) = \mathbb{E}{y}[v\theta(x, t, y)]$.
In fact, the nulltoken is an out-of-distribution input: it is not part of the true support of $y$.

3. Consequences

The unconditional flow $v_\theta(x, t, \varnothing)$ may not match the true marginal flow, and may even be inconsistent with the conditional flows.
The guidance formula $v_{\text{guided}}(x, t, y) = (1 + w) v_\theta(x, t, y) - w v_\theta(x, t, \varnothing)$ is not guaranteed to correspond to the gradient of the log-conditional density, except in the special case where $v_\theta(x, t, \varnothing)$ exactly matches the marginal flow.

4. Example: What Would Be Correct?

The mathematically correct unconditional flow is: $v^*_{\text{uncond}}(x, t) = \mathbb{E}_{y \sim p(y)} [v^*_{\text{cond}}(x, t, y)]$ But the nulltoken flow is: $v_\theta(x, t, \varnothing) \neq v^*_{\text{uncond}}(x, t)$ unless the model is specifically trained so that $v_\theta(x, t, \varnothing)$ matches the marginalization, which is not generally the case.

Summary Table: Nulltoken vs. True Marginalization

Approach	What is computed?	Is it correct?
True marginal	$\mathbb{E}{y}[v\theta(x, t, y)]$	Yes
Nulltoken trick	$v_\theta(x, t, \varnothing)$ (special token input)	No

Practical Implications

The nulltoken trick is a pragmatic hack: it works well empirically, but is not theoretically justified.
For most applications, the error is small if the model is sufficiently expressive and the nulltoken is handled carefully.
For applications requiring strict probabilistic correctness, the nulltoken trick should be replaced by explicit marginalization or other principled approaches.

Mathematical Derivation: Nulltoken as Measure-Zero Extension

Let $Y$ denote the support of the true label distribution $p(y)$, i.e., $Y = {y : p(y) > 0}$. The nulltoken trick augments $Y$ with a special symbol $\varnothing \notin Y$, so the model is defined on $Y’ = Y \cup {\varnothing}$.

The true unconditional flow is $v^*_{\text{uncond}}(x, t) = \int_Y v^*_{\text{cond}}(x, t, y)\, p(y)\, dy$ where the integral is over the support $Y$ of $p(y)$.

The nulltoken trick defines $v_\theta(x, t, \varnothing)$ where, by construction, $p(y = \varnothing) = 0$.

Mathematically, this is equivalent to extending the measure $p(y)$ to a new measure $\tilde{p}(y)$ on $Y’$ such that $\tilde{p}(A) = p(A \cap Y) + \alpha \cdot \mathbb{I}[\varnothing \in A]$ where $\alpha = 0$ and $\mathbb{I}$ is the indicator function. Thus, the nulltoken is a measure-zero singleton: $\tilde{p}(\{\varnothing\}) = 0$

If we now define the “unconditional” flow as $v^*_{\text{uncond}}(x, t) = \int_{Y'} v^*_{\text{cond}}(x, t, y)\, d\tilde{p}(y)$ then $v^*_{\text{uncond}}(x, t) = \int_Y v^*_{\text{cond}}(x, t, y)\, p(y)\, dy + v^*_{\text{cond}}(x, t, \varnothing) \cdot \tilde{p}(\{\varnothing\}) = \int_Y v^*_{\text{cond}}(x, t, y)\, p(y)\, dy$ since $\tilde{p}({\varnothing}) = 0$.

Interpretation:

The nulltoken trick is mathematically equivalent to assigning all measure-zero subsets of $Y’$ (including ${\varnothing}$) to the nulltoken.
In effect, the model is trained to produce a flow for an event of probability zero, and then this flow is used as a proxy for the true marginalization over $y$.
This is only justified if $v_\theta(x, t, \varnothing) = \int_Y v_\theta(x, t, y)\, p(y)\, dy$, which is not guaranteed in general.

Conclusion:

The nulltoken trick corresponds to treating the nulltoken as the representative of all measure-zero events in $y$.
For strict probabilistic applications, this is not a valid marginalization, and the correct approach is to integrate over the true support $Y$ of $p(y)$.

Correct Marginalization: Monte Carlo Simulation

The theoretically correct way to obtain the unconditional flow is to explicitly marginalize over $y$ according to its distribution $p(y)$. This can be achieved via Monte Carlo (MC) simulation:

Sample $y^{(1)}, \ldots, y^{(N)}$ independently from $p(y)$.
Compute the conditional flows $v_\theta(x, t, y^{(i)})$ for each sample.
Estimate the unconditional flow as the empirical average: $v^*_{\text{uncond}}(x, t) \approx \frac{1}{N} \sum_{i=1}^N v_\theta(x, t, y^{(i)})$ As $N \to \infty$, this converges to the true marginalization: $v^*_{\text{uncond}}(x, t) = \mathbb{E}_{y \sim p(y)}[v_\theta(x, t, y)] = \int_Y v_\theta(x, t, y)\, p(y)\, dy$

Challenge:
This approach requires knowledge of the true label distribution $p(y)$. In practice, $p(y)$ may be unknown, ill-defined, or difficult to estimate, especially for complex, continuous, or structured $y$.

Mathematical Proposals to Address the Requirement of $p(y)$

To overcome the challenge of requiring $p(y)$, several mathematically principled strategies can be considered:

1. Empirical Distribution

If a dataset ${y^{(j)}}_{j=1}^M$ is available, use the empirical distribution: $\hat{p}(y) = \frac{1}{M} \sum_{j=1}^M \delta(y - y^{(j)})$ Then, $v^*_{\text{uncond}}(x, t) \approx \frac{1}{M} \sum_{j=1}^M v_\theta(x, t, y^{(j)})$

2. Uniform or Heuristic Prior

If $Y$ is finite or can be discretized, use a uniform prior: $p_{\text{unif}}(y) = \frac{1}{|Y|} \mathbb{I}[y \in Y]$ so that $v^*_{\text{uncond}}(x, t) = \frac{1}{|Y|} \sum_{y \in Y} v_\theta(x, t, y)$ For continuous $y$, use a heuristic prior $p_{\text{heur}}(y)$, e.g., a Gaussian or other parametric form.

3. Learned or Estimated Prior

Estimate $p(y)$ from data using a model $\hat{p}_\phi(y)$ (e.g., a density estimator or generative model): $v^*_{\text{uncond}}(x, t) = \int_Y v_\theta(x, t, y)\, \hat{p}_\phi(y)\, dy \approx \frac{1}{N} \sum_{i=1}^N v_\theta(x, t, y^{(i)}), \quad y^{(i)} \sim \hat{p}_\phi(y)$

4. Variational or Importance-Weighted Marginalization

If $y$ is latent or $p(y)$ is intractable, use an approximate posterior $q(y|x)$ and importance weights: $v^*_{\text{uncond}}(x, t) = \int_Y v_\theta(x, t, y)\, p(y)\, dy = \int_Y v_\theta(x, t, y)\, \frac{p(y)}{q(y|x)} q(y|x)\, dy$ Estimate via importance sampling: $v^*_{\text{uncond}}(x, t) \approx \frac{1}{N} \sum_{i=1}^N v_\theta(x, t, y^{(i)})\, w^{(i)}, \quad y^{(i)} \sim q(y|x), \quad w^{(i)} = \frac{p(y^{(i)})}{q(y^{(i)}|x)}$

5. Mixture or Ensemble Approximation

Select a set of representative labels ${y^{(l)}}{l=1}^L$ and assign weights $\alpha_l$ (e.g., based on frequency or relevance), with $\sum{l=1}^L \alpha_l = 1$: $v^*_{\text{uncond}}(x, t) \approx \sum_{l=1}^L \alpha_l\, v_\theta(x, t, y^{(l)})$

6. Task-Specific or Redefined Marginalization

Redefine the “unconditional” flow to marginalize over a subset $Y’ \subseteq Y$ or according to a task-specific prior $p_{\text{task}}(y)$: $v^*_{\text{uncond}}(x, t) = \int_{Y'} v_\theta(x, t, y)\, p_{\text{task}}(y)\, dy$

Summary:
While the nulltoken trick is a practical shortcut, Monte Carlo marginalization is the correct and principled method for obtaining the unconditional flow in classifier-free guidance: $v^*_{\text{uncond}}(x, t) = \mathbb{E}_{y \sim p(y)}[v_\theta(x, t, y)]$ However, its application is often limited by the difficulty of specifying or sampling from the true $p(y)$. The above mathematical strategies provide principled ways to approximate or replace $p(y)$ in practice, depending on the available data and the requirements of the application.

7. Guidance Weight as an Explicit Input Parameter

A principled extension of classifier-free guidance is to treat the guidance weight $w$ as an explicit input to the flow model. That is, we define a family of flows parameterized by $w$: $v_\theta(x, t, y, w) := (1 + w)\, v_\theta(x, t, y) - w\, v^*_{\text{uncond}}(x, t)$ where $w \in \mathbb{R}$ is a continuous parameter controlling the strength of guidance, and $v^*_{\text{uncond}}(x, t)$ is the unconditional flow.

7.1. Special Cases

Unconditional flow ($w = 0$): $v_\theta(x, t, y, 0) = v_\theta(x, t, y)$ If $y = \varnothing$, this recovers the unconditional flow.
Standard classifier-free guidance ($w > 0$): $v_\theta(x, t, y, w) = v_\theta(x, t, y) + w \left[ v_\theta(x, t, y) - v^*_{\text{uncond}}(x, t) \right]$
No guidance ($w = 0$) with $y = \varnothing$: $v_\theta(x, t, \varnothing, 0) = v^*_{\text{uncond}}(x, t)$

7.2. 1-Sample Monte Carlo Estimation of the Unconditional Flow

In practice, the unconditional flow $v^*{\text{uncond}}(x, t)$ is often intractable to compute exactly, especially for complex or continuous $y$. A simple and efficient approximation is to use a 1-sample Monte Carlo (MC) estimate: $v^*_{\text{uncond}}(x, t) \approx v_\theta(x, t, y'), \qquad y' \sim p(y)$ That is, for each data point $(x, t)$, we sample a label $y’$ from the prior $p(y)$ and use $v\theta(x, t, y’)$ as a stochastic estimate of the unconditional flow.

This approach is unbiased in expectation: $\mathbb{E}_{y' \sim p(y)}[v_\theta(x, t, y')] = v^*_{\text{uncond}}(x, t)$ and is simple to implement in minibatch training.

7.3. Practical Implementation

During training, for each data point $(x, t, y)$, sample a random $w \sim p(w)$ (e.g., uniform on $[0, w_{\max}]$).
Independently sample $y’ \sim p(y)$ for the 1-sample MC estimate of the unconditional flow.
Concatenate $w$ (and $y$) as inputs to the neural network parameterizing $v_\theta$.

8. Loss Function for Guidance-Parameterized Flow with 1-Sample MC Unconditional

To train the model to produce correct flows for all values of $w$, we propose a guidance-weighted loss function using the 1-sample MC estimate for the unconditional flow. For each training sample $(x, t, y)$, sample $w \sim p(w)$ and $y’ \sim p(y)$, and define:

\[v^*_{\text{target}}(x, t, y, w, y') = (1 + w)\, v^*_{\text{cond}}(x, t, y) - w\, v_\theta(x, t, y')\]

where $v^*_{\text{cond}}(x, t, y)$ is the target conditional flow (e.g., from score matching or supervision),

and $v_\theta(x, t, y’)$ is the 1-sample MC estimate of the unconditional flow.

The loss is:

\[\mathcal{L}_{\text{guidance-MC}}(\theta) = \mathbb{E}_{(x, t, y),\, w \sim p(w),\, y' \sim p(y)} \left[ \left\| v_\theta(x, t, y, w) - v^*_{\text{target}}(x, t, y, w, y') \right\|^2 \right]\]

Remarks:

This loss encourages the model to learn the correct flow for any guidance strength $w$, using a stochastic but unbiased estimate of the unconditional flow.
The expectation over $w$ and $y’$ can be implemented by sampling for each training batch.
This approach is scalable and practical for large or continuous label spaces.

Summary:
By introducing $w$ as an explicit input and using a 1-sample MC estimate for the unconditional flow, the model can flexibly and robustly generate flows for any desired guidance strength, including $w = 0$ (zero mean) and $w > 0$ (guided), without requiring explicit marginalization over $y$.

9. Finetuning Pre-Trained Models to Reduce Guidance Bias

Pre-trained models using the nulltoken trick or other approximations for the unconditional flow may exhibit bias in the guidance direction, especially for large $w$. To mitigate this, we can finetune such models using the guidance-weighted loss above, leveraging the 1-sample MC estimator for the unconditional flow.

9.1. Finetuning Procedure

Start from a pre-trained model $v_{\theta_0}$ (e.g., trained with nulltoken or standard classifier-free guidance).
Freeze or partially freeze some layers if desired, or allow full finetuning.
For each minibatch:
- Sample $(x, t, y)$ from the data.
- Sample $w \sim p(w)$ and $y’ \sim p(y)$.
- Compute the 1-sample MC estimate $v_\theta(x, t, y’)$ for the unconditional flow.
- Compute the guidance-weighted target $v^*_{\text{target}}(x, t, y, w, y’)$.
- Compute the loss $\mathcal{L}_{\text{guidance-MC}}$ and update $\theta$.
Repeat for several epochs or until convergence.

9.2. Advantages

Reduces bias: The model learns to produce flows that are correct for arbitrary $w$, not just $w=0$ or the nulltoken approximation.
Improves sample quality: Especially for large guidance weights, where bias can degrade generation.
Minimal compute: Only requires sampling $y’$ per batch; no need for full marginalization.

9.3. Optional: Regularization Toward Pre-Trained Solution

To preserve useful features from the pre-trained model, add a regularization term:

\[\mathcal{L}_{\text{finetune}}(\theta) = \mathcal{L}_{\text{guidance-MC}}(\theta) + \lambda_{\text{reg}} \left\| v_\theta - v_{\theta_0} \right\|^2\]

where $\lambda_{\text{reg}}$ controls the strength of regularization.

Summary:
Finetuning pre-trained models with the guidance-weighted loss and 1-sample MC unconditional flow enables principled reduction of guidance bias, leading to more accurate and robust conditional generation across a range of guidance strengths.