# Implementing Policy Gradients in Caffe

- 6 minsRecently a large portion of my research has come to involve reinforcement learning, in particular, *policy gradient methods*. These methods show great promise and lie at the heart of recent advancements, perhaps most notably by Google DeepMind. During my work, the need arose to implement *Vanilla Policy Gradient* in the auto-diff framework Caffe which lacks Tensorflow’s more advanced computational graphs. The rigidity of Caffe makes implementing the algorithm a bit complicated.

The implementation used Caffe’s Matlab interface *MatCaffe* but this guide should apply to *PyCaffe* as well – in truth doing this in PyCaffe is probably easier.

## Vanilla Policy Gradient

This blog post assumes that the reader is familiar with policy gradient methods. For a *really* good and accessible explanation check out Andrej Karpathy’s famous blog post on the subject. If you are interested in the mathematical background of auto-diff for reinforcement learning I recommend my colleague Aleksis Pirinen’s blog post.

As a quick reminder, the loss function for Vanilla Policy Gradient is defined as

\[ L(\theta) = -\sum_i A_i \log \pi_\theta(a_i | s_i) \tag{1}\label{eq:loss} \]

where \(\pi_\theta\) is the policy represented by our neural network parameterized by \(\theta\) and \(A_i\) is the advantage for taking the action \(a_i\) while in state \(s_i\).

## The Issues

When trying to implement a policy gradient method in Caffe you’ll encounter three fundamental problems:

- Caffe is built for computing the forward and backward in one step
- Caffe lacks support for matching advantages with actions
- Caffe has a fixed batch size defined in the network prototxt file

## Generating Trajectories

In classical reinforcement learning the experiences are generated sequentially by interacting with the environment through exploration.

Implementing the forward pass to generate trajectories is quite straight forward. We just create a network that takes the state as the input with a standard softmax layer at the top, from which we then sample (discrete) actions.

We store the state as well as the sampled action for each experience since we’ll use these later during the backward pass that will train the network.

Given the sequential nature of reinforcement learning, it is reasonable to have a batch size of one. However we will later note that this will become problematic in the backward pass, but for now, assume it is one.

## The Backward Pass

The reason we can’t do the forward and the backward pass in one go is since the advantage isn’t usually known until *after* the end of each trajectory. So we need to first generate the trajectories by computing many forward passes. Afterward, we then use the saved states, the sampled actions, and the resulting advantages to compute the backward pass and update the network parameters \(\theta\).

Note that we need to add another input layer to hold the advantage for the selected action.

## Dealing with the Batch Size

Two questions issues remain: how do we match the advantage to the probability of the sampled action as well how do we increase the batch size.

To solve both of these issues we create an *advantage matrix* used during the backward pass. It contains zeros in all places except for indices corresponding to sampled actions. We then perform element-wise matrix multiplication between the advantage matrix and the log probabilities.

\[ \begin{bmatrix} A_1^1 & 0 & 0 \

A_2^1 & 0 & 0 \

0 & 0 & A_3^3 \

0 & A_4^2 & 0 \end{bmatrix} \odot \begin{bmatrix} \log\pi_\theta(a_1 | s_1 ) & \log\pi_\theta(a_2 | s_1 ) & \log\pi_\theta(a_3 | s_1 ) \

\log\pi_\theta(a_1 | s_2 ) & \log\pi_\theta(a_2 | s_2 ) & \log\pi_\theta(a_3 | s_2 ) \

\log\pi_\theta(a_1 | s_3 ) & \log\pi_\theta(a_2 | s_3 ) & \log\pi_\theta(a_3 | s_3 ) \

\log\pi_\theta(a_1 | s_4 ) & \log\pi_\theta(a_2 | s_4 ) & \log\pi_\theta(a_3 | s_4 ) \

\end{bmatrix} \]

In the example above the row dimension is the batch dimension and columns contain the different type of actions. The right matrix is the log probability matrix and its rows contain the logarithm of the softmax outputs and the left matrix is the advantage matrix containing the advantages corresponding to the sampled actions. In this specific example, we see that during the first two experiences we sampled action \(a_1\), then in the third experience we sampled action \(a_3\) and in the fourth experience we sampled action \(a_2\).

Finally, we sum over the columns of the product of the two matrices yielding a column vector containing the loss for each experience in the batch. Summing once more over the batch dimension gives us the loss \((\ref{eq:loss})\) for the entire batch.

## Optimizing the Implementation

While generating the trajectories by interacting with the environment we only have a single state to work with, i.e. the current state the agent is in. Therefore we simply set the input to zero in all except for the first element in the batch dimension. As you can tell this means that for a batch size of 20 there will be 19 empty inputs in the batch. This inefficiency can be addressed by generating several trajectories in parallel. Again, if the network has a batch size of 20, then during the forward pass we would explore 20 parallel trajectories – one independent trajectory in every batch dimension.

Furthermore, we see that the advantage matrix is very sparse. If this matrix becomes too large the easiest solution would be to implement a custom layer. This layer would take the advantage vector along with the indices for the sampled action for each experience. Essentially we gather the log probabilities by index and then multiply them by their corresponding advantages.

## Final Notes

It should be noted that since we recompute the action probabilities for the backward pass they might diverge slightly from the original distributions used during the interaction with the environment. As an example, imagine that we have 40 experiences \(\{A_i, s_i, a_i\}\) that we intend to backpropagate in batches of 20. During the first batch the network parameters \(\theta\) are the same as when the experiences were generated. However, the backward pass of the first batch will update the network parameters \(\theta \rightarrow \theta'\). Therefore during the second batch the log probabilities multiplicated with the advantage matrix might be slightly different from the probabilities sampled from during trajectory generation. Experimentally we could show that this didn’t impede learning. However, a possible solution would be to collect the gradients from all experiences and then manually applying them. This, however, isn’t possible in MatCaffe but might be possible using another interface.

## Code

I’ve attached a prototxt file of a very simple network that prototypes the solution described in this blog post. The prototxt file assumes a batch size of 20 and a state dimension of 1 and an action space of 2.

Along with the prototxt file you will also need to write a program in using MatCaffe or PyCaffe that handles the interaction with the environment, the storage of the states and the sampled actions and the advantages. If you aren’t familiar with implementing policy gradients I suggest this excellent blog post on how to do it in Tensorflow.