karpathy · brandonmburroughs · Apr 17, 2017
diff --git a/_posts/2016-05-31-rl.markdown b/_posts/2016-05-31-rl.markdown
@@ -179,7 +179,7 @@ Policy gradients to the rescue! We'll think about the part of the network that d
 <img src="/assets/rl/nondiff2.png" width="600">
 </div>
 
-**Trainable Memory I/O**. You'll also find this idea in many other papers. For example, a [Neural Turing Machine](https://arxiv.org/abs/1410.5401) has a memory tape that they it read and write from. To do a write operation one would like to execute something like `m[i] = x`, where `i` and `x` are predicted by an RNN controller network. However, this operation is non-differentiable because there is no signal telling us what would have happened to the loss if we were to write to a different location `j != i`. Therefore, the NTM has to do *soft* read and write operations. It predicts an attention distribution `a` (with elements between 0 and 1 and summing to 1, and peaky around the index we'd like to write to), and then doing `for all i: m[i] = a[i]*x`. This is now differentiable, but we have to pay a heavy computational price because we have to touch every single memory cell just to write to one position. Imagine if every assignment in our computers had to touch the entire RAM!
+**Trainable Memory I/O**. You'll also find this idea in many other papers. For example, a [Neural Turing Machine](https://arxiv.org/abs/1410.5401) has a memory tape that they read and write from. To do a write operation one would like to execute something like `m[i] = x`, where `i` and `x` are predicted by an RNN controller network. However, this operation is non-differentiable because there is no signal telling us what would have happened to the loss if we were to write to a different location `j != i`. Therefore, the NTM has to do *soft* read and write operations. It predicts an attention distribution `a` (with elements between 0 and 1 and summing to 1, and peaky around the index we'd like to write to), and then doing `for all i: m[i] = a[i]*x`. This is now differentiable, but we have to pay a heavy computational price because we have to touch every single memory cell just to write to one position. Imagine if every assignment in our computers had to touch the entire RAM!
 
 However, we can use policy gradients to circumvent this problem (in theory), as done in [RL-NTM](http://arxiv.org/abs/1505.00521). We still predict an attention distribution `a`, but instead of doing the soft write we sample locations to write to: `i = sample(a); m[i] = x`. During training we would do this for a small batch of `i`, and in the end make whatever branch worked best more likely. The large computational advantage is that we now only have to read/write at a single location at test time. However, as pointed out in the paper this strategy is very difficult to get working because one must accidentally stumble by working algorithms through sampling. The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.