Simple Conclusion

Statement

Let be the sequence of states and actions we get when following policy which is sampled from

Then

and we can calculate the gradient by

Intuition

When is high, increase the probability of the action we took, and vice versa

Steps

  • Initial random weights
  • Repeat:
    • Collect trajectories and reward using policy
    • Compute
    • Gradient ascent on

Derivation

Objective

We want to find expression for which is able to be computed effectively

Derivation

Revise Original Expression

The second to last ”=” comes from:

Compute

is impossible to compute because transition is decided by the environment, which we can’t backpropagate

However, surprisingly, we’ll eliminate this term when computing gradient

Conclusion

Putting the two derivations together we have: