“Playing Atari with Deep Reinforcement Learning” Summarized

Categories:

Updated:

9 minute read

https://arxiv.org/abs/1312.5602 (2013-12-19)

1. Reinforcement Learning & Deep Learning

  1. Reinforcement Learning

    Learning to control agents directly from high-dimensional sensory inputs like vision and speech has been relied on hand-crafted features.

  2. Deep Learning

    Advances in deep learning have made it possible to extract high-level features from raw sensory data.

  3. Challenges when combining two

    1. Data
      • Deep learning needs large amounts of hand labeled training data.
      • RL algorithms must be able to learn from scalar reward signal that is frequently sparse, noisy and delayed.
    2. Independence
      • Deep learning algorithms assume the data samples to be independent.
      • In RL, one typically encounters sequences of highly correlated states.
    3. Change in data distribution
      • Deep learning assume a fixed underlying distribution.
      • In RL, the data distribution changes as the algorithm learns new behaviors.

2. Problem Formulation of RL

Return

Sum of discounted future rewards.

\[R_t = \sum^T_{t'=t} \gamma^{t'-t}r_{t'}\]
  • $T$: time-step at which the game terminates
  • $\gamma$: discount factor
  • $r_t$: reward at timestamp $t$

Action-Value function

Expected return achievable after seeing some sequence and taking some action.

\[Q(s,a) = E[R_t|s_t=s, a_t=a, \pi]\]
  • $s$: sequence
  • $a$: action
  • $\pi$: policy mapping sequences to actions

Optimal Action-Value function

Bellman equation

If the optimal value function at the next time-step was known for all possible actions $a’$, then the optimal strategy is to select action $a’$ that maximizes the expected return.

\[Q^*(s,a) = \max_\pi E[R_t|s_t=s, a_t=a, \pi] \\ = \max_{a'}E_{s' \sim e}[r + \gamma Q^*(s', a')|s,a] \\ = E_{s' \sim e}[r + \gamma \max_{a'} Q^*(s', a')|s,a]\]
  • $e$: emulator(environment)
  • $s’$: sequence at the next time-step
  • $a’$: action at the next time-step

Iterative Update

The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update.

\[Q_{i+1}(s,a) = E[r+\gamma \max_{a'}Q_i(s', a')|s, a]\]

$Q_i$ converges to $Q^*$ as $i$ goes to infinity.

Function Approximator

We use a function with parameter $\theta$ to estimate the action-value function. It can be a linear function, a neural network function, etc.

\[Q(s, a; \theta) \approx Q^*(s, a)\]

Q-network Loss Function

Q-network is a neural network function approximator. It can be trained by minimizing a sequence of loss functions $L_i(\theta_i)$.

\[L_i(\theta_i) = E_{s,a\sim\rho(.)}[(y_i-Q(s,a;\theta_i))^2]\] \[y_i = E_{s' \sim e}[r + \gamma \max_{a'} Q^*(s', a';\theta_{i-1})|s,a]\]
  • $\rho(s,a)$: behavior distribution (probability distribution over sequences and actions) / In practice, it is often selected by an epsilon-greedy strategy.

Note that target $y_i$ is calculated by using the previously learned parameters $\theta_{i-1}$.

Q-learning Algorithm

We compute the gradient by differentiating the loss function with respect to the weights $\theta$.

We do stochastic gradient descent. If the weights are updated after every time-step, the expectations are replaced by single samples from the behavior distribution $\rho$ and the emulator $e$ respectively.

  • model-free: It directly uses samples from the emulator $e$ without explicitly constructing an estimate of $e$
  • off-policy: It learns about the greedy strategy $a=\max_a Q(s, a; \theta)$, while following a behavior distribution that ensures adequate exploration of the state space.

3. Experience Replay

We store the agent’s experiences at each time-step, $e_t = (s_t, a_t, r_t, s_{t+1})$ into a replay memory. We apply Q-learning updates to samples of experience drawn at random from the pool of stored samples.

Advantages

  1. Data Efficiency

    Each step of experience is potentially used in many weight updates, which allows for greater data efficiency.

  2. Correlation

    Randomizing the samples breaks the correlations between the samples and reduces the variance of the updates.

  3. Variety of the input samples

    In standard online Q-learning, current parameters determine the next data sample that the parameters are trained on. So for example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side. In consequence, parameters could get stuck in a poor local minimum or even diverge catastrophically. On the other hand, in experience replay, the behavior distribution is averaged over many of its previous states and includes off-policy. It smooths out learning and avoids oscillations or divergence in parameters.

4. Algorithm

Function $\phi$ from this algorithm applies three preprocessing to the last 4 frames of a history and stacks them to produce the input.

  1. RGB -> gray-scale
  2. down-sampling: 210 x 160 -> 110 x 84
  3. crop: 110 x 84 -> 84 x 84

5. Model Input & Output

  • Input: state $ action / Output: Q-value

    The above approach aligns with Q-value function which maps history-action pairs to scalar estimates of their Q-value. However since this architecture needs separate forward pass to compute the Q-value of each action, the authors used below input-output.

  • Input: state / Outputs: Q-value for each possible action

The authors used convolutional neural network for the model and calls this approach Deep Q-Networks(DQN).

6. Convergence & Stability

1. Reward

Total reward the agent collects in a game averaged over a number of games.

The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits .

2. Estimated Q

It provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state.

Q increases smoothly. This suggests that, despite lacking any theoretical convergence guarantees, the authors method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.

7. Visualizing the Value Function

Predicted Q jumps when enemy appears(A), etc.

8. Comparison

DQN gave state-of-the-art results in six of the seven games of Atari 2600 it was tested on, with no adjustment of the architecture or hyperparameters. It even surpassed a human expert on three of them.

Leave a Comment