Flow Model Using MDP Homomorphism

Mathematical derivation and definitions: Continuous MDP Homomorphisms

Step 1. Continuous Markov Decision Processes (MDPs)

Definition (Continuous MDP): A continuous MDP is defined as a tuple:

\[\mathcal{M} = (\mathcal{S}, \Sigma, \mathcal{A}, \{\tau_a\}_{a\in\mathcal{A}}, R, \gamma)\]

where:

  • $\mathcal{S}$ is a measurable state space (often $\mathbb{R}^n$).
  • $\Sigma$ is the sigma-algebra on $\mathcal{S}$.
  • $\mathcal{A}$ is a measurable action space (typically locally compact metric space).
  • $\tau_a: \mathcal{S}\times\Sigma\to [0,1]$ is the transition probability kernel such that for every state $s \in \mathcal{S}$ and action $a \in \mathcal{A}$, $\tau_a(\cdot s)$ is a probability measure on $\mathcal{S}$.
  • $R: \mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is the reward function.
  • $\gamma\in[0,1)$ is the discount factor.

A stochastic policy $\pi$ is a measurable mapping:

\[\pi : \mathcal{S} \to \Delta(\mathcal{A})\]

The objective is to maximize expected return:

\[J(\pi) = \mathbb{E}_{s_0\sim\rho_0}\left[\sum_{t=0}^{\infty}\gamma^t\,R(s_t,a_t)\right]\]

Step 2. Continuous MDP Homomorphisms

Definition (Continuous MDP Homomorphism): A measurable surjective map:

\[h = (f, g),\quad\text{where}\quad f:\mathcal{S}\to\tilde{\mathcal{S}}, \quad g:\mathcal{S}\times\mathcal{A}\to\tilde{\mathcal{A}}\]

is an MDP homomorphism if the following two properties hold:

(1) Reward Invariance:

\[R(s,a) = \tilde{R}(f(s), g(s,a)) \quad \text{for all } (s,a)\in\mathcal{S}\times\mathcal{A}\]

(2) Transition Equivariance:

For any Borel set $B\subseteq\tilde{\mathcal{S}}$, we have:

\[\tau_a(f^{-1}(B)\mid s) = \tilde{\tau}_{g(s,a)}(B\mid f(s))\]

This ensures transitions on the abstract space correctly represent the transitions in the original space after applying the homomorphism.


Step 3. Lifted (Push-forward) Policy

Definition (Lifted Policy): Given a stochastic policy defined in the abstract MDP:

\[\pi : \tilde{\mathcal{S}}\to\Delta(\tilde{\mathcal{A}})\]

The lifted policy is defined by the push-forward measure via $g$:

\[\pi^\uparrow(g^{-1}(s,B)\mid s) = \pi(B\mid f(s)), \quad \text{for any measurable } B\subseteq\tilde{\mathcal{A}}\]

where:

  • $g^{-1}(s,B) = {a\in\mathcal{A}\mid g(s,a)\in B}$ is the set-valued preimage.

Step 4. Value Equivalence Theorem

Theorem (Value Equivalence): Let $h=(f,g)$ be a continuous MDP homomorphism. Then for all state-action pairs $(s,a)\in\mathcal{S}\times\mathcal{A}$:

\[Q^{\pi^\uparrow}(s,a)=Q^\pi(f(s),g(s,a))\]

Proof (Sketch): Induction on Bellman recursion.

  • Base case:
\[Q_0^{\pi^\uparrow}(s,a)=R(s,a)=\tilde{R}(f(s),g(s,a))=Q_0^\pi(f(s),g(s,a))\]
  • Inductive step: Assume equality holds at step $n$. For step $n+1$:
\[Q_{n+1}^{\pi^\uparrow}(s,a)=R(s,a)+\gamma\int_{\mathcal{S}}\tau_a(ds'|s)\int_{\mathcal{A}}\pi^\uparrow(da'|s')Q_n^{\pi^\uparrow}(s',a')\]

Change variables under $h=(f,g)$ and apply the inductive assumption and push-forward measure condition to conclude equality at step $n+1$. $\square$


Step 5. Stochastic Homomorphic Policy Gradient (HPG) theorem

Theorem (Stochastic HPG): The gradient of the performance measure in the original MDP can be computed solely in the abstract space:

\[\nabla_\theta J(\pi_\theta^\uparrow)=\int_{\tilde{\mathcal{S}}}\rho^{\pi_\theta}(\tilde s)\int_{\tilde{\mathcal{A}}}Q^{\pi_\theta}(\tilde s,\tilde a)\nabla_\theta\pi_\theta(d\tilde a|\tilde s)\,d\tilde s\]
  • Here, $\rho^{\pi_\theta}$ is the stationary distribution in the abstract MDP, and $Q^{\pi_\theta}$ is the abstract action-value function.
  • The theorem follows from Value Equivalence and the push-forward condition.

Step 6. Practical realization (Flow-Matching approach)

Use modern Flow Matching (FM) methods to instantiate $\pi_\theta$, $\tilde\tau$, and the lifting/decoding pipeline:

Component Implementation
$f_\phi$ State encoder (NN)
$g_\eta$ Action encoder (NN)
$\pi_\theta$ Flow-matching actor (EFM/ReinFlow)
$\tilde\tau$ Flow-matching transition dynamics
$h_\psi$ Flexible non-invertible decoder

All theoretical guarantees remain valid, provided training losses enforce the homomorphism properties.


Summary

This derivation gives a fully rigorous mathematical foundation for learning and exploiting compact latent MDPs via homomorphisms, backed by provable equivalence of policies and value functions, and supports modern flow-based methods for representation, control, and dynamics.

###