Abstract
One of the key challenges of Reinforcement Learning (RL) is the ability of an agent to generalize its learned policy to unseen settings. Moreover, training an RL agent requires large numbers of interactions with the environment. Motivated by the success of Imitation Learning (IL), we conduct a study to investigate whether an agent can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyze the impact of the quality (optimality of trajectories), quantity and diversity of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for both pre-training and concurrently during online RL training, consistently improves sample-efficiency, and in some tasks achieves higher returns compared to using either IL or RL alone. Furthermore, we show that training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Evaluation in two tasks of the Procgen environment further highlights that the diversity of the training data is more important than its quality. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.
Original language | English |
---|---|
Article number | 129079 |
Journal | Neurocomputing |
Volume | 618 |
DOIs | |
Publication status | Published - 14 Feb 2025 |
Keywords
- Diversity
- Generalization
- Imitation Learning
- Procedurally generated environments
- Reinforcement Learning