Qdagger
Overview
As an extension of the Q-learning, DQN's main technical contribution is the use of replay buffer and target network, both of which would help improve the stability of the algorithm.
Original papers:
Implemented Variants
Variants Implemented | Description |
---|---|
qdagger_dqn_atari_impalacnn.py , docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
qdagger_dqn_atari_jax_impalacnn.py , docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
Below are our single-file implementations of Qdagger:
qdagger_dqn_atari_impalacnn.py
The qdagger_dqn_atari_impalacnn.py has the following features:
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari-seed1
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari-seed1
poetry install -E atari
poetry run python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari-seed1
poetry run python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari-seed1
pip install -r requirements/requirements-atari.txt
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari-seed1
python cleanrl/qdagger_dqn_atari_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari-seed1
Explanation of the logged metrics
Running python cleanrl/qdagger_dqn_atari_impalacnn.py
will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
charts/episodic_return
: episodic return of the gamecharts/SPS
: number of steps per secondlosses/td_loss
: the mean squared error (MSE) between the Q values at timestep \(t\) and the Bellman update target estimated using the reward \(r_t\) and the Q values at timestep \(t+1\), thus minimizing the one-step temporal difference. Formally, it can be expressed by the equation below. $$ J(\theta^{Q}) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \big[ (Q(s, a) - y)^2 \big], $$ with the Bellman update target is \(y = r + \gamma \, Q^{'}(s', a')\) and the replay buffer is \(\mathcal{D}\).losses/q_values
: implemented asqf1(data.observations, data.actions).view(-1)
, it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over estimation happens.losses/distill_loss
:losses/loss
:charts/teacher/avg_episodic_return
:charts/offline/avg_episodic_return
:charts/offline/q_loss
:charts/offline/distill_loss
:charts/offline/loss
:Charts/distill_coeff
:
Implementation details
WIP
Experiment results
Below are the average episodic returns for qdagger_dqn_atari_impalacnn.py
.
Environment | qdagger_dqn_atari_impalacnn.py 40M frames |
(Agarwal et al., 2022)1 10M frames |
---|---|---|
BreakoutNoFrameskip-v4 | 295.55 ± 12.30 | 275.15 ± 20.65 |
PongNoFrameskip-v4 | 19.72 ± 0.20 | - |
BeamRiderNoFrameskip-v4 | 9284.99 ± 242.28 | 6514.25 ± 411.1 |
Learning curves:



Tracked experiments and game play videos:
qdagger_dqn_atari_jax_impalacnn.py
The qdagger_dqn_atari_jax_impalacnn.py has the following features:
- Uses Jax, Flax, and Optax instead of
torch
. qdagger_dqn_atari_jax_impalacnn.py is roughly 25%-50% faster than qdagger_dqn_atari_impalacnn.py - For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E "atari jax"
poetry run pip install --upgrade "jax[cuda]==0.3.17" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
poetry run python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari_jax-seed1
poetry run python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari_jax-seed1
pip install -r requirements/requirements-atari.txt
pip install -r requirements/requirements-jax.txt
pip install --upgrade "jax[cuda]==0.3.17" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id BreakoutNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/BreakoutNoFrameskip-v4-dqn_atari_jax-seed1
python cleanrl/qdagger_dqn_atari_jax_impalacnn.py --env-id PongNoFrameskip-v4 --teacher-policy-hf-repo cleanrl/PongNoFrameskip-v4-dqn_atari_jax-seed1
Warning
Note that JAX does not work in Windows . The official docs recommends using Windows Subsystem for Linux (WSL) to install JAX.
Explanation of the logged metrics
See related docs for qdagger_dqn_atari_impalacnn.py
.
Implementation details
See related docs for qdagger_dqn_atari_impalacnn.py
.
Experiment results
Below are the average episodic returns for qdagger_dqn_atari_jax_impalacnn.py
.
Environment | qdagger_dqn_atari_jax_impalacnn.py 40M frames |
(Agarwal et al., 2022)1 10M frames |
---|---|---|
BreakoutNoFrameskip-v4 | 335.08 ± 19.12 | 275.15 ± 20.65 |
PongNoFrameskip-v4 | 18.75 ± 0.19 | - |
BeamRiderNoFrameskip-v4 | 8024.75 ± 579.02 | 6514.25 ± 411.1 |
Learning curves:


