<aside> đź’ˇ

This paper introduces an elegant, practical regularization method for trajectory optimization in model-based RL that leverages denoising autoencoders to implicitly model the distribution of previously seen trajectories. By penalizing unrealistic candidate plans through the denoising reconstruction error, the optimizer is guided to safer and more reliable solutions. This improves generalization and robustness in model-based control, addressing a key failure mode where trajectory optimizers exploit inaccurate learned models.

The use of denoising autoencoders enables this without costly explicit density estimation or uncertainty modeling, making it broadly applicable. The paper presents solid theoretical motivation, practical algorithms, and strong empirical results validating the approach.

Where the paper innovates is in using the DAE model as a measure of trajectory realism during trajectory optimization:

So, the DAE does not directly predict the next action but acts as an unsupervised realism regularizer in the trajectory optimization process, improving stability and reliability of planned trajectories.


</aside>

Model-Based Reinforcement Learning and Trajectory Optimization

Model-based reinforcement learning (RL) is an approach where an agent learns a model of its environment’s dynamics, i.e., a function that predicts the next state (and reward) given the current state and action. This contrasts with model-free RL, where agents learn policies or value functions directly from experience without an explicit predictive model. The primary advantage of model-based RL is data efficiency, as the learned model enables simulating future outcomes and planning ahead without needing to interact with the environment excessively.

Trajectory optimization is the process by which the agent uses its learned model to search for the best sequence of actions (called a trajectory) that maximizes expected cumulative reward over some planning horizon. Formally, given a learned model fff, the agent tries to find the trajectory $\tau = (o_t, a_t, ..., o_{t+H}, a_{t+H})$ that maximizes:

$$ G(\tau) = \sum_{i=t}^{t+H} r(o_i, a_i) $$

where o_i is the observation/state at step i, a_i is the action, and r the reward function. This optimization problem is non-trivial because the true environment dynamics are unknown and the learned model f is an approximation learned from finite data.


The Problem: Overfitting and Exploitation During Trajectory Optimization

One central challenge in model-based RL is that the learned model is an imperfect surrogate of the true dynamics. It may have inaccuracies and blind spots, especially in regions of the state-action space not sufficiently explored during training. When performing trajectory optimization, numerical optimizers (especially gradient-based ones) can "exploit" these imperfections, finding trajectories that look highly rewarding under the model but are unrealistic or even catastrophic in the real environment.

This phenomenon is akin to adversarial attacks on neural networks — the optimizer finds trajectories that maximize the predicted reward but correspond to states or sequences poorly modeled or outside the training distribution. Such trajectories are often highly unrealistic or risky, causing failures when deployed.


Core Idea: Regularize Trajectory Optimization Using Denoising Autoencoders (DAEs)

The paper proposes to counter this issue by adding a regularization term to the trajectory optimization objective that encourages solutions to stay close to "familiar" or "likely" trajectories—those similar to what the model has experienced during training.

The challenge is how to define and compute a realistic regularization term that reflects the likelihood or similarity of trajectories to the training data. Trajectories are complex, high-dimensional sequential data, making direct density estimation difficult.

The innovative solution is to use a Denoising Autoencoder (DAE) trained on trajectory segments sampled from past experience. A DAE is a neural network designed to reconstruct original inputs from corrupted versions. Crucially, DAEs can implicitly learn the structure and underlying distribution of data.


What is a Denoising Autoencoder (DAE) and How Does It Work?

image.png