Using offline data to speed up Reinforcement Learning in procedurally generated environments

Alain Andres*, Lukas Schäfer, Stefano V. Albrecht, Javier Del Ser

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

One of the key challenges of Reinforcement Learning (RL) is the ability of an agent to generalize its learned policy to unseen settings. Moreover, training an RL agent requires large numbers of interactions with the environment. Motivated by the success of Imitation Learning (IL), we conduct a study to investigate whether an agent can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyze the impact of the quality (optimality of trajectories), quantity and diversity of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for both pre-training and concurrently during online RL training, consistently improves sample-efficiency, and in some tasks achieves higher returns compared to using either IL or RL alone. Furthermore, we show that training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Evaluation in two tasks of the Procgen environment further highlights that the diversity of the training data is more important than its quality. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.

Original languageEnglish
Article number129079
JournalNeurocomputing
Volume618
DOIs
Publication statusPublished - 14 Feb 2025

Keywords

  • Diversity
  • Generalization
  • Imitation Learning
  • Procedurally generated environments
  • Reinforcement Learning

Fingerprint

Dive into the research topics of 'Using offline data to speed up Reinforcement Learning in procedurally generated environments'. Together they form a unique fingerprint.

Cite this