Skip to content

Qdagger

Overview

As an extension of the Q-learning, DQN's main technical contribution is the use of replay buffer and target network, both of which would help improve the stability of the algorithm.

Original papers:

Implemented Variants

Variants Implemented Description
qdagger_dqn_atari_impalacnn.py, docs For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
qdagger_dqn_atari_jax_impalacnn.py, docs For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.

Below are our single-file implementations of Qdagger:

qdagger_dqn_atari_impalacnn.py

The qdagger_dqn_atari_impalacnn.py has the following features:

  • For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
  • Works with the Atari's pixel Box observation space of shape (210, 160, 3)
  • Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari-seed1
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari-seed1
poetry install -E atari
poetry run python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari-seed1
poetry run python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari-seed1
pip install -r requirements/requirements-atari.txt
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari-seed1
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari-seed1

Explanation of the logged metrics

Running python cleanrl/qdagger_dqn_atari_impalacnn.py will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:

  • charts/episodic_return: episodic return of the game
  • charts/SPS: number of steps per second
  • losses/td_loss: the mean squared error (MSE) between the Q values at timestep \(t\) and the Bellman update target estimated using the reward \(r_t\) and the Q values at timestep \(t+1\), thus minimizing the one-step temporal difference. Formally, it can be expressed by the equation below. $$ J(\theta^{Q}) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \big[ (Q(s, a) - y)^2 \big], $$ with the Bellman update target is \(y = r + \gamma \, Q^{'}(s', a')\) and the replay buffer is \(\mathcal{D}\).
  • losses/q_values: implemented as qf1(data.observations, data.actions).view(-1), it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over estimation happens.
  • losses/distill_loss:
  • losses/loss:
  • charts/teacher/avg_episodic_return:
  • charts/offline/avg_episodic_return:
  • charts/offline/q_loss:
  • charts/offline/distill_loss:
  • charts/offline/loss:
  • Charts/distill_coeff:

Implementation details

WIP

Experiment results

Below are the average episodic returns for qdagger_dqn_atari_impalacnn.py.

Environment qdagger_dqn_atari_impalacnn.py 40M frames (Agarwal et al., 2022)1 10M frames
BreakoutNoFrameskip-v4 295.55 ± 12.30 275.15 ± 20.65
PongNoFrameskip-v4 19.72 ± 0.20 -
BeamRiderNoFrameskip-v4 9284.99 ± 242.28 6514.25 ± 411.1

Learning curves:

Tracked experiments and game play videos:

qdagger_dqn_atari_jax_impalacnn.py

The qdagger_dqn_atari_jax_impalacnn.py has the following features:

Usage

poetry install -E "atari jax"
poetry run pip install --upgrade "jax[cuda]==0.3.17" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
poetry run python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari_jax-seed1
poetry run python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari_jax-seed1
pip install -r requirements/requirements-atari.txt
pip install -r requirements/requirements-jax.txt
pip install --upgrade "jax[cuda]==0.3.17" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari_jax-seed1
python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari_jax-seed1
Warning

Note that JAX does not work in Windows . The official docs recommends using Windows Subsystem for Linux (WSL) to install JAX.

Explanation of the logged metrics

See related docs for qdagger_dqn_atari_impalacnn.py.

Implementation details

See related docs for qdagger_dqn_atari_impalacnn.py.

Experiment results

Below are the average episodic returns for qdagger_dqn_atari_jax_impalacnn.py.

Environment qdagger_dqn_atari_jax_impalacnn.py 40M frames (Agarwal et al., 2022)1 10M frames
BreakoutNoFrameskip-v4 335.08 ± 19.12 275.15 ± 20.65
PongNoFrameskip-v4 18.75 ± 0.19 -
BeamRiderNoFrameskip-v4 8024.75 ± 579.02 6514.25 ± 411.1

Learning curves:


  1. Agarwal, Rishabh, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G. Bellemare. “Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress.” arXiv, October 4, 2022. http://arxiv.org/abs/2206.01626.