Model Predictive Path Integral (MPPI)
MPPI Derivation
Consider a stochastic differential equation (SDE) describing the system dynamics:
\[dx_t = (f(x_t, t) + G(x_t, t)u_t)dt + B(x_t, t) d\omega_t\]where:
- $ x_t \in \mathbb{R}^n $ is the state,
- $ u_t \in \mathbb{R}^m $ is the control input,
- $ f(x_t, t) $ is the drift (uncontrolled dynamics),
- $ G(x_t, t) $ is the control matrix,
- $ B(x_t, t) $ is the diffusion matrix,
- $ d\omega_t $ is the increment of a standard Wiener process.
We aim to analyze this in the perspective of functional analysis by regarding all $f, G, B$ as functions. We can formalize $f$, $G$, and $B$ as mappings into linear function spaces, making the functional analytic structure explicit:
\[\begin{align*} &f: \mathcal{X} \times [0, T] \to \mathbb{R}^n, \qquad f \in C^k(\mathcal{X} \times [0, T]; \mathbb{R}^n) \\ &G: \mathcal{X} \times [0, T] \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n), \qquad G \in C^k(\mathcal{X} \times [0, T]; \mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)) \\ &B: \mathcal{X} \times [0, T] \to \mathcal{L}(\mathbb{R}^p, \mathbb{R}^n), \qquad B \in C^k(\mathcal{X} \times [0, T]; \mathcal{L}(\mathbb{R}^p, \mathbb{R}^n)) \end{align*}\]where:
- $\mathcal{X} = C([0, T]; \mathbb{R}^n)$ is the Banach space of continuous state trajectories,
- $\mathcal{L}(\mathbb{R}^m, \mathbb{R}^n)$ is the space of linear operators (matrices) from $\mathbb{R}^m$ to $\mathbb{R}^n$,
- $C^k$ denotes the space of $k$-times continuously differentiable functions.
Thus, $G$ and $B$ are functions that, for each $(x, t)$, return a linear map (matrix) acting on the control or noise space, i.e., they are mappings into linear function spaces.
Cost Functional
We define the cost functional as a mapping on the space of admissible control functions $u \in \mathcal{U}$, where $\mathcal{U}$ is typically a suitable Banach space of measurable functions $u: [0, T] \to \mathbb{R}^m$. The cost functional $J$ is given by:
\[J(u) = \mathbb{E} \left[ \phi(x_T) + \int_0^T q(x_t, u_t, t)\,dt \right]\]where:
- $q: \mathcal{X} \times \mathbb{R}^m \times [0, T] \to \mathbb{R}$ is the running cost, assumed to be continuous (or $C^k$) in its arguments,
- $\phi: \mathcal{X} \to \mathbb{R}$ is the terminal cost, also assumed to be continuous (or $C^k$),
- the expectation $\mathbb{E}$ is taken over the probability measure induced by the SDE dynamics and the control $u$.
In the language of functional analysis, $J$ is a functional:
\[J: \mathcal{U} \to \mathbb{R}\]mapping each admissible control function $u$ to a real number representing the expected total cost. The optimal control problem is then to find $u^* \in \mathcal{U}$ such that
\[u^* = \arg\min_{u \in \mathcal{U}} J(u)\]subject to the SDE constraint on the state trajectory $x_t$.
This perspective highlights that the control problem is an optimization over a function space, with $J$ acting as a (possibly nonlinear) operator on $\mathcal{U}$, and the dynamics and cost are encoded as functionals on the path space $\mathcal{X}$ and control space $\mathcal{U}$.
Hamilton-Jacobi-Bellman Equation
The Hamilton-Jacobi-Bellman (HJB) equation provides a fundamental characterization of the value function in optimal control theory. In the context of stochastic control, the value function $V(x, t)$ represents the minimal expected cost-to-go, starting from state $x$ at time $t$ and optimizing over all admissible control functions $u$:
\[V(x, t) = \inf_{u \in \mathcal{U}} \mathbb{E} \left[ \phi(x_T) + \int_t^T q(x_s, u_s, s)\,ds \;\Big|\; x_t = x \right]\]where the expectation is taken over the stochastic trajectories induced by the SDE dynamics under control $u$.
The HJB equation is a (typically nonlinear) partial differential equation (PDE) that the value function $V$ must satisfy. For the controlled SDE described above, the HJB equation takes the form:
\[\begin{align*} - \frac{\partial V}{\partial t}(x, t) &= \inf_{u \in \mathbb{R}^m} \Bigg\{ q(x, u, t) + \nabla_x V(x, t)^\top \left[ f(x, t) + G(x, t) u \right] \\ &\qquad\qquad + \frac{1}{2} \operatorname{Tr} \left[ B(x, t) B(x, t)^\top \nabla_x^2 V(x, t) \right] \Bigg\} \\ V(x, T) &= \phi(x) \end{align*}\]Here, $\nabla_x V$ and $\nabla_x^2 V$ denote the gradient and Hessian of $V$ with respect to $x$, and $\operatorname{Tr}$ is the trace operator.
Example (PDE language):
For a deterministic system (no noise, $B \equiv 0$), the HJB equation reduces to:
For a stochastic system, the second-order term involving the Hessian appears, reflecting the diffusion.
Functional Analytic Perspective:
The value function $V$ is a mapping $V: \mathbb{R}^n \times [0, T] \to \mathbb{R}$, and the HJB equation is a PDE operator acting on this function. The optimal control problem, as described in the previous section, is thus recast as finding a function $V$ that solves the HJB PDE with the appropriate terminal condition.
The stochastic value function encodes the minimal expected cost, and the HJB equation provides a necessary condition for optimality. The optimal control $u^*(x, t)$ at each point is obtained by minimizing the right-hand side of the HJB equation with respect to $u$.
This connection between the cost functional, the value function, and the HJB equation is central to both the theory and numerical methods for stochastic optimal control, including sampling-based approaches like MPPI, which approximate the value function and its minimizer through trajectory sampling rather than direct PDE solution.
Qudaratic Action cost
From a functional analytic perspective, the quadratic action cost can be viewed as a functional on the space of admissible control functions. Instead of writing the action cost as $u^\top R u$ for a matrix $R$, we generalize to the case where $R$ is a positive definite operator or kernel, allowing for richer dependencies (e.g., time, state, or even nonlocal effects).
Let $u \in \mathcal{U}$ be a control function, and let $R$ be a symmetric, positive definite kernel $R: [0,T] \times [0,T] \to \mathbb{R}^{m \times m}$. The quadratic action cost over a time interval $[0,T]$ is then given by the bilinear form:
\[\mathcal{A}[u] = \frac{1}{2} \int_0^T \int_0^T u_t^\top R(t, s) u_s \, ds \, dt\]This defines a quadratic functional on the space of control functions, generalizing the standard $u^\top R u$ to possibly time-varying or even nonlocal (in time) penalties.
In the running cost, we can write:
\[q(x, u, t) = q_{\text{state}}(x, t) + \frac{1}{2} \int_0^T u_s^\top R(t, s) u_s \, ds\]where $q_{\text{state}}(x, t)$ is the state-dependent cost, and the second term is the action cost as a quadratic form induced by the kernel $R$.
Special case:
If $R(t, s) = r(t) \delta(t-s) I$, where $r(t) > 0$ and $\delta$ is the Dirac delta, we recover the standard pointwise quadratic cost:
Summary:
- The action cost is a quadratic functional on the control space, defined via a positive definite kernel $R$.
- This perspective allows for more general regularization and coupling between control values at different times.
- In practice, $R$ is often chosen to be diagonal in time (local), but the functional analytic view enables more general formulations.
Thus, the running cost $q$ in MPPI can be written as:
\[q(x, u, t) = q_{\text{state}}(x, t) + \frac{1}{2} \langle u, R u \rangle\]where $\langle u, R u \rangle$ denotes the quadratic form induced by the positive definite operator (kernel) $R$ on the control function $u$.
Deriving the cost function
Building on the operator (kernel) perspective for the quadratic action cost, we can now derive the value function PDE (the HJB equation) in this more general setting.
Recall the value function:
\[V(x, t) = \inf_{u \in \mathcal{U}} \mathbb{E} \left[ \int_t^T q(x_s, u_s, s) ds + \phi(x_T) \,\Big|\, x_t = x \right]\]where the running cost is
\[q(x, u, t) = q_{\text{state}}(x, t) + \frac{1}{2} \langle u, R u \rangle\]with $R$ a positive definite operator (kernel) on the control space.
The dynamic programming principle leads to the HJB equation:
\[- \frac{\partial V}{\partial t}(x, t) = \inf_{u} \left\{ q(x, u, t) + \nabla_x V(x, t)^\top [f(x, t) + G(x, t) u] + \frac{1}{2} \operatorname{Tr}\left[ B(x, t) B(x, t)^\top \nabla_x^2 V(x, t) \right] \right\}\]Operator-Quadratic Cost Minimization:
The minimization over $u$ now involves the operator $R$:
\[\inf_{u} \left\{ \frac{1}{2} \langle u, R u \rangle + \nabla_x V(x, t)^\top G(x, t) u \right\}\]This is a quadratic functional in $u$, and the minimizer $u^*$ is given by solving the first-order optimality condition:
\[R u^* + G(x, t)^\top \nabla_x V(x, t) = 0 \implies u^* = - R^{-1} G(x, t)^\top \nabla_x V(x, t)\]where $R^{-1}$ is the inverse operator (or kernel) of $R$.
Plugging back into the HJB equation:
Substitute $u^*$ into the HJB equation:
\[\begin{align*} - \frac{\partial V}{\partial t}(x, t) =\; & q_{\text{state}}(x, t) + \nabla_x V(x, t)^\top f(x, t) \\ & - \frac{1}{2} \left\langle G(x, t)^\top \nabla_x V(x, t),\, R^{-1} G(x, t)^\top \nabla_x V(x, t) \right\rangle \\ & + \frac{1}{2} \operatorname{Tr}\left[ B(x, t) B(x, t)^\top \nabla_x^2 V(x, t) \right] \end{align*}\]Here, the inner product $\langle \cdot, \cdot \rangle$ is defined by the operator $R^{-1}$, and the term
\[\left\langle G^\top \nabla_x V,\, R^{-1} G^\top \nabla_x V \right\rangle\]captures the generalization to nonlocal or time-varying quadratic costs.
Summary:
- The value function PDE (HJB) in the operator setting is structurally the same as the standard case, but the quadratic term involves the inverse operator $R^{-1}$.
- The optimal control is given by $u^* = - R^{-1} G^\top \nabla_x V$.
- This framework allows for general quadratic costs, including nonlocal and time-dependent penalties, by appropriate choice of $R$.
This operator-theoretic HJB equation forms the basis for advanced control and sampling-based algorithms like MPPI, where the cost structure can be tailored via the choice of $R$.
Ito’s Lemma
Let us now assume $R$ is a linear (matrix) kernel, i.e., $R$ is a positive definite matrix (possibly time- and state-dependent, but linear in $u$). This allows us to combine the quadratic terms in the PDE for $\Phi(x, t)$.
Recall the log transformation:
\[V(x, t) = -\lambda \log \Phi(x, t)\]Ito’s Lemma
For a function $f(x, t)$ and SDE $dx_t = a(x_t, t)\,dt + B(x_t, t)\,dW_t$,
\[df(x_t, t) = \left( \frac{\partial f}{\partial t} + a^\top \nabla_x f + \frac{1}{2} \operatorname{Tr}\left[ B B^\top \nabla_x^2 f \right] \right) dt + (\nabla_x f)^\top B\, dW_t\]Expanding the HJB with Log Transformation
After minimizing over $u$, the HJB is:
\[- \frac{\partial V}{\partial t} = q_{\text{state}}(x, t) + \nabla_x V^\top f(x, t) - \frac{1}{2} \nabla_x V^\top G(x, t) R^{-1} G(x, t)^\top \nabla_x V + \frac{1}{2} \operatorname{Tr}\left[ B B^\top \nabla_x^2 V \right]\]With $V(x, t) = -\lambda \log \Phi(x, t)$, the derivatives are:
- $\frac{\partial V}{\partial t} = -\lambda \frac{1}{\Phi} \frac{\partial \Phi}{\partial t}$
- $\nabla_x V = -\lambda \frac{1}{\Phi} \nabla_x \Phi$
- $\nabla_x^2 V = -\lambda \frac{1}{\Phi} \nabla_x^2 \Phi + \lambda \frac{1}{\Phi^2} \nabla_x \Phi \nabla_x \Phi^\top$
Plugging these into the HJB and multiplying both sides by $\Phi/\lambda$ gives:
\[\begin{align*} - \frac{\partial \Phi}{\partial t} &= f^\top \nabla_x \Phi + \frac{1}{2} \operatorname{Tr}\left[ B B^\top \nabla_x^2 \Phi \right] - \frac{1}{\lambda} q_{\text{state}}(x, t) \Phi \\ &\quad + \frac{\lambda}{2} \frac{1}{\Phi^2} \nabla_x \Phi^\top G R^{-1} G^\top \nabla_x \Phi \, \Phi \\ &\quad - \frac{1}{2} \frac{1}{\Phi} \nabla_x \Phi^\top B B^\top \nabla_x \Phi \end{align*}\]Combining the Quadratic Terms
Both quadratic terms involve $\nabla_x \Phi^\top (\cdot) \nabla_x \Phi$. We can write:
\[\frac{\lambda}{2} \frac{1}{\Phi^2} \nabla_x \Phi^\top G R^{-1} G^\top \nabla_x \Phi \, \Phi = \frac{\lambda}{2} \frac{1}{\Phi} \nabla_x \Phi^\top G R^{-1} G^\top \nabla_x \Phi\]So, the two quadratic terms can be summed together:
\[\frac{1}{2} \frac{1}{\Phi} \nabla_x \Phi^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) \nabla_x \Phi\]Final PDE for $\Phi(x, t)$ (Linear Kernel Case)
The PDE simplifies to:
\[\boxed{ \begin{aligned} - \frac{\partial \Phi}{\partial t} =\;& f(x, t)^\top \nabla_x \Phi + \frac{1}{2} \operatorname{Tr}\left[ B(x, t) B(x, t)^\top \nabla_x^2 \Phi \right] - \frac{1}{\lambda} q_{\text{state}}(x, t) \Phi \\ &+ \frac{1}{2} \frac{1}{\Phi} \nabla_x \Phi^\top \left( \lambda G(x, t) R^{-1} G(x, t)^\top - B(x, t) B(x, t)^\top \right) \nabla_x \Phi \end{aligned} }\]Summary:
- For a linear (matrix) kernel $R$, the two quadratic gradient terms combine into a single quadratic form.
- The resulting PDE for $\Phi(x, t)$ is more compact and highlights the interplay between control authority ($G R^{-1} G^\top$) and noise ($B B^\top$).
- This form is the basis for path integral control and MPPI in the linear-quadratic setting.
This explicit PDE is the correct starting point for sampling-based control algorithms under the log transformation.
Forward-Backward Stochastic Differential Equations (FBSDE) Perspective
To connect the above PDE for $\Phi(x, t)$ to a Backward Stochastic Differential Equation (BSDE), we use the Feynman-Kac formula, which relates certain parabolic PDEs to expectations over stochastic processes.
1. The Forward SDE
Let $X_t$ evolve according to the uncontrolled dynamics:
\[dX_t = f(X_t, t)\,dt + B(X_t, t)\,dW_t\]where $W_t$ is a standard Brownian motion.
2. The BSDE Representation
We seek a pair of processes $(Y_t, Z_t)$ such that:
- $Y_t = \Phi(X_t, t)$,
- $Z_t = \nabla_x \Phi(X_t, t)^\top B(X_t, t)$.
The BSDE associated with the PDE is:
\[\begin{aligned} dY_t &= -\left[ -\frac{1}{\lambda} q_{\text{state}}(X_t, t) Y_t + \frac{1}{2} \frac{1}{Y_t} Z_t^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_t \right] dt + Z_t^\top dW_t \end{aligned}\]with terminal condition $Y_T = \exp\left(-\frac{1}{\lambda} q_{\text{final}}(X_T)\right)$.
3. Explicit BSDE to Solve
Summary of the explicit BSDE:
- Forward SDE: $dX_t = f(X_t, t)\,dt + B(X_t, t)\,dW_t$
- Backward SDE:
This BSDE, together with the forward SDE, provides a probabilistic representation of the solution to the original PDE for $\Phi(x, t)$.
4. Solution as an Expectation over Trajectories
The solution $\Phi(x, t)$ can be written as an expectation over all possible trajectories of the forward SDE, weighted by the running and terminal costs:
\[\Phi(x, t) = \mathbb{E} \left[ \exp\left( -\frac{1}{\lambda} \left( q_{\text{final}}(X_T) + \int_t^T q_{\text{state}}(X_s, s)\,ds \right) + \frac{1}{2} \int_t^T \frac{1}{Y_s^2} Z_s^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_s\,ds \right) \Bigg| X_t = x \right]\]- The expectation $\mathbb{E}[\cdot]$ is taken over all realizations of the Brownian motion $W_s$ (i.e., all possible trajectories $X_s$ starting from $X_t = x$).
- The exponential includes both the accumulated state cost and the quadratic form involving $Z_s$ along the trajectory.
This expectation formulation is the basis for sampling-based algorithms (such as MPPI), where $\Phi(x, t)$ is estimated by averaging over many simulated trajectories.
References:
- Feynman-Kac formula for nonlinear PDEs
- BSDEs in stochastic control (see e.g. Yong & Zhou, “Stochastic Controls: Hamiltonian Systems and HJB Equations”, Ch. 6)
5. Simulating the System for the Generalized MPPI (No Zero Assumptions)
We now consider the fully general MPPI scheme derived above, where all terms—including the quadratic form in $Z_t$—are retained, and no simplifying assumptions (such as $B B^\top = \lambda G R^{-1} G^\top$) are made.
Discretized Forward-Backward SDEs
Let time be discretized as $t_k = t_0 + k \Delta t$, $k = 0, 1, …, N$.
-
Forward SDE (Euler-Maruyama):
\[X_{k+1} = X_k + f(X_k, t_k)\, \Delta t + B(X_k, t_k)\, \Delta W_k\]where $\Delta W_k \sim \mathcal{N}(0, \Delta t \cdot I)$.
-
Backward SDE (Explicit, from above):
\[Y_{k} = Y_{k+1} + \left[ -\frac{1}{\lambda} q_{\text{state}}(X_k, t_k) Y_k + \frac{1}{2} \frac{1}{Y_k} Z_k^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_k \right] \Delta t - Z_k^\top \Delta W_k\]with terminal condition:
\[Y_N = \exp\left(-\frac{1}{\lambda} q_{\text{final}}(X_N)\right)\]
Discretized Expectation for $\Phi(x, t)$
The solution $\Phi(x, t)$ is estimated as:
\[\Phi(x, t) \approx \frac{1}{M} \sum_{i=1}^M \exp\left( -\frac{1}{\lambda} \left[ q_{\text{final}}(X_N^{(i)}) + \sum_{k=0}^{N-1} q_{\text{state}}(X_k^{(i)}, t_k) \Delta t \right] + \frac{1}{2} \sum_{k=0}^{N-1} \frac{1}{(Y_k^{(i)})^2} (Z_k^{(i)})^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_k^{(i)} \Delta t \right)\]where each trajectory $i$ is simulated using the coupled forward-backward SDEs.
Generalized MPPI Algorithm (Discretized, All Terms Retained)
-
For $M$ sample trajectories:
- Initialize $X_0^{(i)} = x$, $Y_N^{(i)} = \exp\left(-\frac{1}{\lambda} q_{\text{final}}(X_N^{(i)})\right)$.
- Simulate forward $X_k^{(i)}$ using the discretized SDE.
- Simulate backward $Y_k^{(i)}, Z_k^{(i)}$ using the discretized BSDE (requires an estimator for $Z_k^{(i)}$, e.g., regression or finite differences).
- Accumulate all cost terms, including the quadratic form in $Z_k$.
-
Compute weights:
\[w^{(i)} = \exp\left( -\frac{1}{\lambda} \left[ q_{\text{final}}(X_N^{(i)}) + \sum_{k=0}^{N-1} q_{\text{state}}(X_k^{(i)}, t_k) \Delta t \right] + \frac{1}{2} \sum_{k=0}^{N-1} \frac{1}{(Y_k^{(i)})^2} (Z_k^{(i)})^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_k^{(i)} \Delta t \right)\] -
Update control:
\[\delta u_k^* = \frac{\sum_{i=1}^M w^{(i)} \delta u_k^{(i)}}{\sum_{i=1}^M w^{(i)}}\] \[u_k^* = u_{\text{nominal},k} + \delta u_k^*\] -
Apply $u_0^*$ to the real system, shift the horizon, and repeat.
Remarks
- The presence of the $Z_k$-dependent quadratic term means that, in practice, one must estimate $Z_k$ along each trajectory. This can be done via regression methods or by using the relationship between $Z_k$ and the increments of $Y_k$ and $W_k$.
- This generalized MPPI includes all terms from the BSDE and does not assume any special structure for $B$, $G$, or $R$.
References:
Interpretation of $Y$ and $Z$ in the Optimal Control Framework
In the context of stochastic optimal control and the forward-backward stochastic differential equation (FBSDE) formulation, the processes $Y$ and $Z$ have precise mathematical and practical meanings:
-
$Y_k$ (Value Function Estimate):
\[Y_k = V(X_k, t_k)\]
At each time step $k$, $Y_k$ represents the (conditional) expected cost-to-go, starting from the current state $X_k$ and time $t_k$, under the optimal control policy. More formally, $Y_k$ is the value function evaluated along the sampled trajectory:where $V(x, t)$ is the solution to the Hamilton-Jacobi-Bellman (HJB) equation associated with the stochastic control problem. In the FBSDE framework, $Y_k$ evolves backward in time, aggregating the running and terminal costs along each trajectory.
-
$Z_k$ (Value Function Gradient / Control Sensitivity):
\[Z_k = \nabla_x V(X_k, t_k)^\top B(X_k, t_k)\]
The process $Z_k$ encodes the sensitivity of the value function with respect to the underlying stochasticity (i.e., the Brownian motion increments). Mathematically, $Z_k$ is related to the gradient of the value function with respect to the state, projected onto the diffusion directions:where $B$ is the diffusion matrix in the system dynamics. In other words, $Z_k$ quantifies how changes in the noise affect the expected future cost, and it appears in the backward SDE as the coefficient of the stochastic (martingale) term.
In the context of control, $Z_k$ is crucial for:
- Computing the optimal control update: The optimal control law often depends on the gradient of the value function, which is captured by $Z_k$.
- Correcting for stochasticity: The $Z_k$-dependent quadratic term in the cost ensures that the control policy accounts for the effect of noise on the system.
Summary Table:
Symbol | Meaning | Role in Algorithm |
---|---|---|
$Y_k$ | Value function along trajectory ($V(X_k, t_k)$) | Backward cost-to-go, used for weighting and loss computation |
$Z_k$ | Projected gradient of value function ($\nabla_x V^\top B$) | Appears in BSDE, used for control update and cost correction |
In Practice:
- $Y_k$ is estimated as the expected cumulative cost from $k$ to the horizon, given the current state.
- $Z_k$ is estimated via regression or neural networks, using the relationship between increments of $Y_k$ and the observed noise, or by differentiating a learned value function.
This interpretation connects the FBSDE-based MPPI algorithm directly to the classical notions of value function and its gradient in stochastic optimal control, providing both theoretical grounding and practical guidance for implementation.
6. Importance Sampling via Girsanov’s Theorem and Numerical Simulation
In high-dimensional or rare-event regimes, direct Monte Carlo estimation of the expectation for $\Phi(x, t)$ can be highly inefficient due to the exponential weighting of trajectories. Importance sampling provides a principled way to reduce variance by simulating trajectories under an alternative measure, and then reweighting them appropriately. The mathematical foundation for this change of measure in the context of SDEs is provided by Girsanov’s theorem.
6.1. Girsanov’s Theorem (Strict Statement)
Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space supporting a standard $p$-dimensional Brownian motion $W_t$ adapted to a filtration ${\mathcal{F}_t}$. Consider the SDEs:
- Original (uncontrolled) SDE: \(dX_t = f(X_t, t)\,dt + B(X_t, t)\,dW_t, \qquad X_0 = x\)
- Controlled SDE (under new measure): \(dX_t = f(X_t, t)\,dt + G(X_t, t) u_t\,dt + B(X_t, t)\,dW_t, \qquad X_0 = x\) where $u_t$ is an adapted, integrable control process.
Girsanov’s Theorem:
Suppose $u_t$ is progressively measurable and satisfies Novikov’s condition:
Then, the process
\[\Lambda_T = \exp\left( -\int_0^T u_s^\top dW_s - \frac{1}{2} \int_0^T \|u_s\|^2 ds \right)\]is a strictly positive martingale, and defines a new probability measure $\mathbb{Q}$ on $(\Omega, \mathcal{F}_T)$ by $d\mathbb{Q} = \Lambda_T d\mathbb{P}$. Under $\mathbb{Q}$, the process
\[\tilde{W}_t = W_t + \int_0^t u_s ds\]is a standard Brownian motion, and the law of $X_t$ under $\mathbb{Q}$ is that of the controlled SDE.
6.2. Importance Sampling Representation for $\Phi(x, t)$
Let $u_s$ be a chosen control (drift) process. Simulate trajectories under the controlled SDE:
\[dX_s = f(X_s, s)\,ds + G(X_s, s) u_s\,ds + B(X_s, s)\,dW_s, \qquad X_t = x\]Then, by Girsanov’s theorem, the expectation for $\Phi(x, t)$ can be rewritten as:
\[\Phi(x, t) = \mathbb{E}^{\mathbb{Q}} \left[ \exp\left( -\frac{1}{\lambda} \left( q_{\text{final}}(X_T) + \int_t^T q_{\text{state}}(X_s, s)\,ds \right) + \frac{1}{2} \int_t^T \frac{1}{Y_s^2} Z_s^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_s\,ds \right) \cdot \Lambda_T^{-1} \Bigg| X_t = x \right]\]where
\[\Lambda_T^{-1} = \exp\left( \int_t^T u_s^\top dW_s + \frac{1}{2} \int_t^T \|u_s\|^2 ds \right)\]and the expectation $\mathbb{E}^{\mathbb{Q}}$ is over trajectories generated by the controlled SDE.
6.3. Discretized Importance Sampling Estimator
Let $u_k$ be the chosen control at time $t_k$. For $M$ simulated trajectories, the estimator becomes:
\[\Phi(x, t) \approx \frac{1}{M} \sum_{i=1}^M \exp\Bigg( -\frac{1}{\lambda} \Big[ q_{\text{final}}(X_N^{(i)}) + \sum_{k=0}^{N-1} q_{\text{state}}(X_k^{(i)}, t_k) \Delta t \Big] + \frac{1}{2} \sum_{k=0}^{N-1} \frac{1}{(Y_k^{(i)})^2} (Z_k^{(i)})^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) Z_k^{(i)} \Delta t + \sum_{k=0}^{N-1} (u_k^{(i)})^\top \Delta W_k^{(i)} + \frac{1}{2} \sum_{k=0}^{N-1} \|u_k^{(i)}\|^2 \Delta t \Bigg)\]where each trajectory is generated by:
\[X_{k+1}^{(i)} = X_k^{(i)} + f(X_k^{(i)}, t_k)\, \Delta t + B(X_k^{(i)}, t_k) u_k^{(i)} \Delta t + B(X_k^{(i)}, t_k) \Delta W_k^{(i)}\]with $\Delta W_k^{(i)} \sim \mathcal{N}(0, \Delta t \cdot I)$.
6.4. Numerical Simulation Procedure
- Choose an importance sampling control $u_k$ (e.g., from a nominal policy, previous iteration, or heuristic).
- For $i = 1, \ldots, M$ trajectories:
- Simulate the forward SDE with drift $u_k$.
- Simultaneously solve the backward SDE for $Y_k, Z_k$.
- Accumulate the running and terminal costs, the quadratic $Z_k$ terms, and the Girsanov likelihood ratio.
- Compute the weighted average as above to estimate $\Phi(x, t)$.
6.5. Theoretical Guarantees
- Unbiasedness:
The estimator is unbiased for any admissible choice of $u_k$ satisfying Novikov’s condition. - Variance Reduction:
The optimal choice of $u_k$ (in the sense of minimizing estimator variance) is the drift that makes the exponent as constant as possible, i.e., the optimal control.
6.6. Theorem (Unbiased Importance Sampling for Path Integrals)
Theorem:
Let $u_t$ be any progressively measurable process such that Novikov’s condition holds. Then, the estimator
where $F[X^{(i)}]$ is the exponential of the cost and quadratic terms along trajectory $i$, and $\Lambda_T^{-1}[W^{(i)}]$ is the Girsanov likelihood ratio, satisfies
\[\mathbb{E}^{\mathbb{Q}}[\hat{\Phi}_M(x, t)] = \Phi(x, t)\]and converges almost surely to $\Phi(x, t)$ as $M \to \infty$.
References:
- L.C.G. Rogers and D. Williams, “Diffusions, Markov Processes and Martingales”, Vol. 2, Ch. 5
- B. Øksendal, “Stochastic Differential Equations”, Ch. 8
- Kappen, H.J., “Path integrals and symmetry breaking for optimal control theory”, J. Stat. Mech. (2005)
- The Feynman-Kac and Girsanov theorems in stochastic control
7. Neural Network Approaches for Generalized MPPI
Recent advances in deep learning enable us to leverage neural networks to improve the efficiency and scalability of the generalized MPPI algorithm, especially in high-dimensional or nonlinear systems. Below, we outline several innovative strategies to incorporate neural networks into the MPPI framework, along with principled loss functions for training.
7.1. Supervised Regression Using Monte Carlo Targets
A straightforward approach is to use the Monte Carlo (MC) estimator of $\Phi(x, t)$ as a regression target for a neural network $\hat{\Phi}_\theta(x, t)$ parameterized by $\theta$.
-
Data Generation:
For a batch of initial states ${x^{(j)}}$ and times ${t^{(j)}}$, run $M$ MC simulations per sample to estimate $\Phi(x^{(j)}, t^{(j)})$ using the importance-sampled estimator described above. -
Loss Function:
\[\mathcal{L}_{\text{MC}}(\theta) = \frac{1}{N} \sum_{j=1}^N \left| \hat{\Phi}_\theta(x^{(j)}, t^{(j)}) - \hat{\Phi}_{\text{MC}}(x^{(j)}, t^{(j)}) \right|^2\]
Use mean squared error (MSE) between the network prediction and the MC estimate:where $\hat{\Phi}_{\text{MC}}$ is the MC estimate for sample $j$.
-
Remarks:
- This approach is simple and stable, but the MC estimator can be noisy and expensive to compute for each training sample.
- The network can be used at test time to rapidly approximate $\Phi(x, t)$, reducing the need for repeated MC sampling.
7.2. Physics-Informed Neural Networks (PINNs) for the Path Integral PDE
Instead of regressing to MC targets, we can train a neural network to directly satisfy the path integral PDE (or the associated BSDE) by minimizing the residual of the equation at sampled points.
-
Loss Function (PDE Residual):
\[\mathcal{R}(x, t; \theta) = -\frac{\partial \hat{\Phi}_\theta}{\partial t} - f^\top \nabla_x \hat{\Phi}_\theta - \frac{1}{2} \operatorname{Tr}\left[ B B^\top \nabla_x^2 \hat{\Phi}_\theta \right] + \frac{1}{\lambda} q_{\text{state}}(x, t) \hat{\Phi}_\theta - \frac{1}{2} \frac{1}{\hat{\Phi}_\theta} (\nabla_x \hat{\Phi}_\theta)^\top \left( \lambda G R^{-1} G^\top - B B^\top \right) \nabla_x \hat{\Phi}_\theta\]
For the PDE (see Section 3), define the residual at $(x, t)$:The loss is then:
\[\mathcal{L}_{\text{PDE}}(\theta) = \frac{1}{N} \sum_{j=1}^N \left| \mathcal{R}(x^{(j)}, t^{(j)}; \theta) \right|^2\]plus boundary/terminal condition penalties.
-
Remarks:
- This approach enforces the structure of the control problem and can generalize better, especially in regions with few MC samples.
- Automatic differentiation can be used to compute derivatives of $\hat{\Phi}_\theta$.
7.3. Learning the Value Function and Its Gradient via FBSDE
Recall that in the FBSDE formulation, $Y_k \approx V(X_k, t_k)$ and $Z_k \approx \nabla_x V(X_k, t_k)^\top B(X_k, t_k)$. We can parameterize $V_\theta(x, t)$ with a neural network and train it to match the FBSDE dynamics.
-
Loss Function (FBSDE Consistency):
\[\mathcal{L}_{\text{FBSDE}}(\theta) = \frac{1}{N} \sum_{j=1}^N \sum_{k=0}^{N-1} \left| Y_k^{(j)} - \left( Y_{k+1}^{(j)} + \Delta t \cdot \mathcal{F}_\theta(X_k^{(j)}, t_k) - Z_k^{(j)\top} \Delta W_k^{(j)} \right) \right|^2\]
For simulated trajectories ${X_k, Y_k, Z_k}$, define the one-step FBSDE residual:$\mathcal{F}_\theta$ is the drift term in the backward SDE.
$Y_k^{(j)} = V_\theta(X_k^{(j)}, t_k)$, $Z_k^{(j)} = \nabla_x V_\theta(X_k^{(j)}, t_k)^\top B(X_k^{(j)}, t_k)$.
-
Remarks:
- This approach can be combined with MC regression or PINN losses for improved accuracy.
- The network learns both the value function and its gradient, which are needed for optimal control.
7.4. Direct Policy Learning via Differentiable Path Integral
Alternatively, parameterize the control policy $u_\theta(x, t)$ with a neural network and train it to minimize the expected cost using the path integral estimator, leveraging the reparameterization trick for differentiability.
-
Loss Function (Expected Cost):
\[\mathcal{L}_{\text{PI}}(\theta) = \mathbb{E}_{\text{trajectories under } u_\theta} \left[ q_{\text{final}}(X_N) + \sum_{k=0}^{N-1} q_{\text{state}}(X_k, t_k) \Delta t - \frac{\lambda}{2} \sum_{k=0}^{N-1} \|u_\theta(X_k, t_k)\|^2 \Delta t \right]\]- The expectation is estimated by sampling trajectories using the current policy $u_\theta$.
- Gradients can be computed via backpropagation through the simulation (model-based) or using policy gradient methods (model-free).
-
Remarks:
- This approach directly optimizes the control policy, bypassing the need to estimate $\Phi$ or $V$.
- Can be combined with importance sampling for variance reduction.
7.5. Hybrid and Advanced Methods
-
Multi-Task Learning:
\[\mathcal{L}_{\text{total}} = \alpha \mathcal{L}_{\text{MC}} + \beta \mathcal{L}_{\text{PDE}} + \gamma \mathcal{L}_{\text{FBSDE}} + \delta \mathcal{L}_{\text{PI}}\]
Jointly train networks for $\Phi_\theta$, $V_\theta$, and $u_\theta$ with a combined loss:with tunable weights $\alpha, \beta, \gamma, \delta$.
-
Uncertainty Quantification:
Use ensembles or Bayesian neural networks to estimate uncertainty in $\Phi_\theta(x, t)$, guiding exploration or adaptive sampling. -
Adaptive Sampling:
Focus MC samples and PDE residual points in regions where the network error or uncertainty is high.
Summary Table of Neural Network Approaches
Method | Network Output | Loss Function | Notes |
---|---|---|---|
MC Regression | $\hat{\Phi}_\theta$ | $\mathcal{L}_{\text{MC}}$ | Simple, requires MC labels |
PINN (PDE Residual) | $\hat{\Phi}_\theta$ | $\mathcal{L}_{\text{PDE}}$ | Physics-informed, no MC labels needed |
FBSDE Consistency | $V_\theta$, $\nabla_x V_\theta$ | $\mathcal{L}_{\text{FBSDE}}$ | Learns value and gradient |
Direct Policy | $u_\theta$ | $\mathcal{L}_{\text{PI}}$ | End-to-end policy learning |
Hybrid | Multiple | $\mathcal{L}_{\text{total}}$ | Combine strengths of all above |
References:
- Han, J., Jentzen, A., & E, W. (2018). “Solving high-dimensional PDEs using deep learning.” PNAS.
- Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear PDEs.” JCP.
- Exarchos, I. P., Theodorou, E. A. (2018). “Stochastic Optimal Control via Forward and Backward Stochastic Differential Equations and Deep Learning.” arXiv:1807.09341