Master Thesis on Regularized Deterministic Policy Gradient for Recurrent Networks

Abstract

Most state-of-the-art reinforcement learning algorithms adopt entropic regularization techniques to prevent the policy’s change from being too aggressive. Such entropic regularization techniques are usually very effective and allow efficient policy updates. However, entropic regularization requires stochastic policies. which makes gradient estimation more challenging compared to deterministic policies. The presence of partial observability exacerbates this issue since gradients already suffer high uncertainty caused by noisy observations and propagation issues in deep recurrent networks. In this thesis, we investigate a new regularization technique that prevents aggressive policy updates for deterministic policies. The student is expected to have knowledge of reinforcement learning and machine learning, as well as good programming skills with python and automatic differentiation library (e.g., PyTorch, TensorFlow, or Jax). The Master thesis will be supervised by an international team of reinforcement learning experts.

References

  1. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
  2. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International conference on machine learning (pp. 387-395). PMLR.

🎓 Supervisors

Samuele Tosatto (University of Alberta, Canada), Riad Akrour (INRIA, Lille, France), Joni Pajarinen (Aalto University)

✉ Contact

joni.pajarinen@aalto.fi, tosatto@ualberta.ca, akrouriad@gmail.com