“Playing Atari with Deep Reinforcement Learning” Summarized
https://arxiv.org/abs/1312.5602 (2013-12-19)
1. Reinforcement Learning & Deep Learning
-
Reinforcement Learning
Learning to control agents directly from high-dimensional sensory inputs like vision and speech has been relied on hand-crafted features.
-
Deep Learning
Advances in deep learning have made it possible to extract high-level features from raw sensory data.
-
Challenges when combining two
- Data
- Deep learning needs large amounts of hand labeled training data.
- RL algorithms must be able to learn from scalar reward signal that is frequently sparse, noisy and delayed.
- Independence
- Deep learning algorithms assume the data samples to be independent.
- In RL, one typically encounters sequences of highly correlated states.
- Change in data distribution
- Deep learning assume a fixed underlying distribution.
- In RL, the data distribution changes as the algorithm learns new behaviors.
- Data
2. Problem Formulation of RL
Return
Sum of discounted future rewards.
\[R_t = \sum^T_{t'=t} \gamma^{t'-t}r_{t'}\]- $T$: time-step at which the game terminates
- $\gamma$: discount factor
- $r_t$: reward at timestamp $t$
Action-Value function
Expected return achievable after seeing some sequence and taking some action.
\[Q(s,a) = E[R_t|s_t=s, a_t=a, \pi]\]- $s$: sequence
- $a$: action
- $\pi$: policy mapping sequences to actions
Optimal Action-Value function
Bellman equation
If the optimal value function at the next time-step was known for all possible actions $a’$, then the optimal strategy is to select action $a’$ that maximizes the expected return.
\[Q^*(s,a) = \max_\pi E[R_t|s_t=s, a_t=a, \pi] \\ = \max_{a'}E_{s' \sim e}[r + \gamma Q^*(s', a')|s,a] \\ = E_{s' \sim e}[r + \gamma \max_{a'} Q^*(s', a')|s,a]\]- $e$: emulator(environment)
- $s’$: sequence at the next time-step
- $a’$: action at the next time-step
Iterative Update
The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update.
\[Q_{i+1}(s,a) = E[r+\gamma \max_{a'}Q_i(s', a')|s, a]\]$Q_i$ converges to $Q^*$ as $i$ goes to infinity.
Function Approximator
We use a function with parameter $\theta$ to estimate the action-value function. It can be a linear function, a neural network function, etc.
\[Q(s, a; \theta) \approx Q^*(s, a)\]Q-network Loss Function
Q-network is a neural network function approximator. It can be trained by minimizing a sequence of loss functions $L_i(\theta_i)$.
\[L_i(\theta_i) = E_{s,a\sim\rho(.)}[(y_i-Q(s,a;\theta_i))^2]\] \[y_i = E_{s' \sim e}[r + \gamma \max_{a'} Q^*(s', a';\theta_{i-1})|s,a]\]- $\rho(s,a)$: behavior distribution (probability distribution over sequences and actions) / In practice, it is often selected by an epsilon-greedy strategy.
Note that target $y_i$ is calculated by using the previously learned parameters $\theta_{i-1}$.
Q-learning Algorithm
We compute the gradient by differentiating the loss function with respect to the weights $\theta$.
We do stochastic gradient descent. If the weights are updated after every time-step, the expectations are replaced by single samples from the behavior distribution $\rho$ and the emulator $e$ respectively.
- model-free: It directly uses samples from the emulator $e$ without explicitly constructing an estimate of $e$
- off-policy: It learns about the greedy strategy $a=\max_a Q(s, a; \theta)$, while following a behavior distribution that ensures adequate exploration of the state space.
3. Experience Replay
We store the agent’s experiences at each time-step, $e_t = (s_t, a_t, r_t, s_{t+1})$ into a replay memory. We apply Q-learning updates to samples of experience drawn at random from the pool of stored samples.
Advantages
-
Data Efficiency
Each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
-
Correlation
Randomizing the samples breaks the correlations between the samples and reduces the variance of the updates.
-
Variety of the input samples
In standard online Q-learning, current parameters determine the next data sample that the parameters are trained on. So for example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side. In consequence, parameters could get stuck in a poor local minimum or even diverge catastrophically. On the other hand, in experience replay, the behavior distribution is averaged over many of its previous states and includes off-policy. It smooths out learning and avoids oscillations or divergence in parameters.
4. Algorithm
Function $\phi$ from this algorithm applies three preprocessing to the last 4 frames of a history and stacks them to produce the input.
- RGB -> gray-scale
- down-sampling: 210 x 160 -> 110 x 84
- crop: 110 x 84 -> 84 x 84
5. Model Input & Output
-
Input: state $ action / Output: Q-value
The above approach aligns with Q-value function which maps history-action pairs to scalar estimates of their Q-value. However since this architecture needs separate forward pass to compute the Q-value of each action, the authors used below input-output.
-
Input: state / Outputs: Q-value for each possible action
The authors used convolutional neural network for the model and calls this approach Deep Q-Networks(DQN).
6. Convergence & Stability
1. Reward
Total reward the agent collects in a game averaged over a number of games.
The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits .
2. Estimated Q
It provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state.
Q increases smoothly. This suggests that, despite lacking any theoretical convergence guarantees, the authors method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.
7. Visualizing the Value Function
Predicted Q jumps when enemy appears(A), etc.
8. Comparison
DQN gave state-of-the-art results in six of the seven games of Atari 2600 it was tested on, with no adjustment of the architecture or hyperparameters. It even surpassed a human expert on three of them.
Leave a Comment