Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming
Published in (AAAI-24) The Association for the Advancement of Artificial Intelligence, 2024
Abstract
Key Contributions
- Adaptable Safety Guarantee using RL: We developed a policy optimization framework where the resulting policy can achieve adaptable safety guarantee to unseen tasks.
- Improved performance: We develop adaptable safety benchmark and evaluate our algorithm across various baselines.
🎥 Comparative Results on Adaptation to Unseen Tasks
Environment: Point-Button-Hazard



🛠 Method Overview — Safe Meta-Policy Optimization
The core of Meta-CPO is a bi-level constrained optimization framework that enables end-to-end meta-training while maintaining strict safety guarantees across unseen (in-distributional) tasks.
1. Local Task Updates (Inner-Loop)
For each specific task \(\mathcal{T}_i\) among given \(M\) number of tasks \(\{\mathcal{T}_i \}_{i=1}^{M}\), we first initialize a local learner \(\phi_i\) with parameters of a meta-learner \(\theta\). For each local iteration \(k\), we solve constrained policy optimization (CPO) using differentiable convex programming as follows:
\[\phi^{k+1}_i = \text{argmax}_{\phi_i} \; g(\phi^k_i, D^{tr}_i)^\top (\phi_i - \phi^k_i)\] \[\qquad\text{s.t. } \frac{1}{2}\|\phi_i - \phi^k_i\|^2_H \leq \delta, \quad b_{\phi_i} + a(\phi^k_i, D^{tr}_i)^\top (\phi_i - \phi^k_i) \leq 0.\]This ensures that task-specific exploration respects both trust-region constraints and a safety constraint \(J_C(\pi) \leq h\).
2. Differentiable Meta-Update (Outer-Loop)
We update the meta-learner \(\theta\) by maximizing the average performance across \(M\) tasks. A major challenge in meta-learning is differentiating through the local learner updates, for which we employ computational graph generated by differentiable constrained optimization in local task updates to easily compute the meta-gradient via the chain rule as follows:
\[\frac{dF}{d\theta} = \frac{1}{M} \sum_{i=1}^M \left( \prod_{k=0}^{K-1} \frac{d\text{Alg}_i(\phi^{k+1}_i)}{d\phi^k_i} \right) \cdot g(\phi^K_i, D^{tr}_i).\]By differentiating through the local learner updates, our algorithm is now ready to update the meta-learner by aggregating the backpropagated gradients with projection onto a global safety satisfaction \(G(\theta) \leq 0\) as follows:
\[\theta' = \text{argmax}_{\theta'} \; \left( \frac{dF}{d\theta} \right)^\top (\theta' - \theta)\] \[\qquad\text{s.t. } \frac{1}{2}\|\theta' - \theta\|^2_H \leq \delta_\theta, \quad b_\theta + \left( \frac{dG}{d\theta} \right)^\top (\theta' - \theta) \leq 0.\]This mechanism allows the policy to generalize to new (but still in-distributional) tasks with adaptable safety guarantee.
📖 BibTeX Citation
@inproceedings{cho2024constrained,
title={Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming},
author={Cho, Minjae and Sun, Chuangchuang},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={19},
pages={20975--20983},
year={2024}
}Recommended citation: Cho, Minjae, and Chuangchuang Sun. "Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 19. 2024.
Download Paper
