- 作者: Richard S. Sutton,Andrew G. Barto
- 出版社/メーカー: A Bradford Book
- 発売日: 2018/11/13
- メディア: ハードカバー
The "TD error sum" algorithm is worse than TD(n) because it doesn't get the benefit of newer value estimation achieved in the later steps.
Dyna-Q has an advantage over TD(n) because the learned model can be reused in the later episodes in Dyna-Q whereas the learned path is used only for a single set of updates in TD(n). It is also possible for Dyna-Q to update values on the previous paths using data learned in the later paths.
It's because Dyna-Q+ has more exploratory nature, but actual advantage depends on hyper-parameters and environment.
Since Dyna-Q+ has more exploratory nature, once the optimal path has been found, Dyna-Q has less fluctuation (i.e. follows the optimal path) than Dyna-Q+.
In terms of performance, the alternate approach is worse than Dyna-Q+ since the alternate approach doesn't propagate the "unvisited information" to other cells.
It strengthens the sample updates. With highly skewed distribution, you can get the better estimation with fewer samples.
Uniform sampling works better for highly non-deterministic environment.