めもめも

このブログに記載の内容は個人の見解であり、必ずしも所属組織の立場、戦略、意見を代表するものではありません。

Reinforcement Learning 2nd Edition: Exercise Solutions (Chapter 6 - Chapter 8)

Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning series)

Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning series)

Chapter 6


Exercise 6.4

github.com


Exercise 6.9, 6.10

github.com

Exercise 6.14

github.com

Chapter 7


Exercise 7.2

The "TD error sum" algorithm is worse than TD(n) because it doesn't get the benefit of newer value estimation achieved in the later steps.

github.com

Exercise 7.3

github.com

Chapter 8

Exercise 8.1

Dyna-Q has an advantage over TD(n) because the learned model can be reused in the later episodes in Dyna-Q whereas the learned path is used only for a single set of updates in TD(n). It is also possible for Dyna-Q to update values on the previous paths using data learned in the later paths.

github.com

Exercise 8.2

It's because Dyna-Q+ has more exploratory nature, but actual advantage depends on hyper-parameters and environment.

Exercise 8.3

Since Dyna-Q+ has more exploratory nature, once the optimal path has been found, Dyna-Q has less fluctuation (i.e. follows the optimal path) than Dyna-Q+.

Exercise 8.4

In terms of performance, the alternate approach is worse than Dyna-Q+ since the alternate approach doesn't propagate the "unvisited information" to other cells.

github.com

Exercise 8.6

It strengthens the sample updates. With highly skewed distribution, you can get the better estimation with fewer samples.

Exercise 8.8

Uniform sampling works better for highly non-deterministic environment.

github.com