Neural Operator for RL | Hong Chul Nam

Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings (ICML 2024): Functional and Transformer Neural Operator Perspectives

The ICML 2024 paper “Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings” introduces a framework where reward functions are encoded as functionals over trajectories or policies, and leverages neural operators—specifically, transformer-based neural operators—to parameterize these functionals. This enables zero-shot generalization to new tasks specified by reward functionals.

1. Functional Reward Encodings: Mathematical Perspective

In classical RL, the reward is a function: $r: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ where $\mathcal{S}$ is the state space and $\mathcal{A}$ is the action space.

Functional reward encoding generalizes this by defining the reward as a functional: $\mathcal{R}: \mathcal{F} \to \mathbb{R}$ where $\mathcal{F}$ is a space of functions, such as trajectories $\tau = (s_0, a_0, s_1, a_1, \ldots)$, policies, or occupancy measures.

Linear functional reward: For a trajectory distribution $f \in \mathcal{F}$, $\mathcal{R}(f) = \int_{\mathcal{S} \times \mathcal{A}} r(s, a) f(s, a) \, d\mu(s, a)$ where $f(s, a)$ is the occupancy measure and $\mu$ is a reference measure.
Nonlinear functional reward: For more complex reward structures, $\mathcal{R}(f) = \Phi\left( \int_{\mathcal{S} \times \mathcal{A}} \psi(s, a, f(s, a)) \, d\mu(s, a) \right)$ where $\psi$ and $\Phi$ are nonlinear functions, allowing for rewards that depend on global or distributional properties.

2. Transformer Neural Operator Perspective

A transformer neural operator is a neural network architecture that learns mappings between function spaces by processing sets or sequences of function evaluations (e.g., trajectory steps) using transformer layers. In this framework, the reward functional $\mathcal{R}$ is parameterized as a transformer neural operator: $\mathcal{R}_\phi(f) \approx \mathcal{R}(f)$

Encoding Trajectories as Sets/Sequences

Given a trajectory $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$, we represent it as a sequence of tokens: $x_t = \text{Embed}(s_t, a_t), \quad t = 0, \ldots, T-1$ or as a set ${x_t}_{t=0}^{T-1}$.

Transformer Neural Operator Architecture

The transformer neural operator processes the sequence or set ${x_t}$:

Input Embedding: Each $(s_t, a_t)$ is embedded into a vector $x_t$.
Transformer Layers: The sequence ${x_t}$ is passed through several self-attention layers: $\{h_t\} = \text{Transformer}_\phi(\{x_t\})$
Pooling/Readout: The outputs ${h_t}$ are aggregated (e.g., via mean pooling, attention pooling, or a learned readout token) to produce a global representation $h_\text{traj}$.
Reward Prediction: The final reward is predicted as: $\mathcal{R}_\phi(f) = \text{MLP}_\phi(h_\text{traj})$

This architecture allows the reward functional to flexibly represent both local and global dependencies on the trajectory, and to capture complex, permutation-invariant or sequence-dependent reward structures.

Advantages of Transformer Neural Operators

Permutation invariance/equivariance: Can handle unordered sets (e.g., occupancy measures) or ordered sequences (trajectories).
Expressivity: Self-attention enables modeling of long-range and high-order interactions across the trajectory.
Generalization: The operator can generalize to new, unseen reward functionals by learning from a diverse set of reward specifications during training.

3. Unsupervised RL and Zero-Shot Generalization

During unsupervised RL, the agent learns a universal policy $\pi_\theta(a

s, z)$, where $z$ is a latent variable indexing diverse behaviors.

At test time, given a new reward functional $\mathcal{R}^$ (possibly specified as a transformer neural operator), the agent selects $z^$ to maximize expected reward: $z^* = \arg\max_z \mathbb{E}_{\tau \sim \pi_\theta(\cdot|\cdot, z)} [\mathcal{R}^*(\tau)]$

4. Policy Optimization with Transformer Neural Operator Rewards

The RL objective becomes: $J(\pi) = \mathbb{E}_{\tau \sim \pi} [\mathcal{R}_\phi(\tau)]$ where $\mathcal{R}_\phi$ is a transformer neural operator. Policy gradients and other optimization methods can be applied by differentiating through the transformer: $\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \mathcal{R}_\phi(\tau) \right]$

5. Summary

Functional reward encodings allow rewards to be specified as functionals over trajectories or policies, enabling expressive and compositional task definitions.
Transformer neural operators provide a powerful and flexible way to parameterize these functionals, supporting generalization to new, possibly complex, reward specifications.
This framework enables unsupervised zero-shot RL: the agent can generalize to new tasks at test time by optimizing for arbitrary reward functionals, thanks to the expressivity and generalization capacity of transformer neural operators.

Reference:

ICML 2024: “Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings”