DreamerV3

class srl.algorithms.dreamer_v3.Config(batch_size: int = 32, memory_capacity: int = 100000, memory_warmup_size: int = 1000, memory_compress: bool = True, memory_compress_level: int = -1, observation_mode: str | ~srl.base.define.ObservationModes = ObservationModes.ENV, override_observation_type: ~srl.base.define.SpaceTypes = SpaceTypes.UNKNOWN, override_action_type: str | ~srl.base.define.RLBaseActTypes = <RLBaseActTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: ~typing.Type[ExtendWorker] | None = None, parameter_path: str = '', memory_path: str = '', use_rl_processor: bool = True, processors: ~typing.List[RLProcessor] = <factory>, render_image_processors: ~typing.List[RLProcessor] = <factory>, enable_state_encode: bool = True, enable_action_decode: bool = True, enable_reward_encode: bool = True, enable_done_encode: bool = True, window_length: int = 1, render_image_window_length: int = 1, enable_sanitize: bool = True, enable_assertion: bool = False, rssm_deter_size: int = 4096, rssm_stoch_size: int = 32, rssm_classes: int = 32, rssm_hidden_units: int = 1024, rssm_use_norm_layer: bool = True, rssm_use_categorical_distribution: bool = True, rssm_activation: ~typing.Any = 'silu', rssm_unimix: float = 0.01, reward_type: str = 'twohot', reward_twohot_bins: int = 255, reward_twohot_low: int = -20, reward_twohot_high: int = 20, reward_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), cont_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), critic_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), actor_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), dense_act: ~typing.Any = 'silu', use_symlog: bool = True, encoder_decoder_mlp: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), encoder_decoder_dist: str = 'linear', cnn_depth: int = 96, cnn_blocks: int = 0, cnn_activation: ~typing.Any = 'silu', cnn_normalization_type: str = 'layer', cnn_resize_type: str = 'stride', cnn_resized_image_size: int = 4, cnn_use_sigmoid: bool = False, free_nats: float = 1.0, loss_scale_pred: float = 1.0, loss_scale_kl_dyn: float = 0.5, loss_scale_kl_rep: float = 0.1, warmup_world_model: int = 0, critic_target_update_interval: int = 0, critic_target_soft_update: float = 0.02, critic_type: str = 'twohot', critic_twohot_bins: int = 255, critic_twohot_low: int = -20, critic_twohot_high: int = 20, actor_discrete_type: str = 'categorical', actor_discrete_unimix: float = 0.01, actor_continuous_enable_normal_squashed: bool = True, horizon: int = 15, horizon_policy: str = 'actor', critic_estimation_method: str = 'h-return', horizon_ewa_disclam: float = 0.1, horizon_h_return: float = 0.95, discount: float = 0.997, enable_train_model: bool = True, enable_train_critic: bool = True, enable_train_actor: bool = True, batch_length: int = 64, lr_model: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 0.0001, lr_critic: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 3e-05, lr_actor: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 3e-05, actor_loss_type: str = 'dreamer_v3', actor_reinforce_rate: float = 0.0, entropy_rate: float = 0.0003, reinforce_baseline: str = 'v', epsilon: float = 0, clip_rewards: str = 'none')

<ExperienceReplayBuffer> <RLConfigComponentInput>

rssm_deter_size: int = 4096

決定的な遷移のユニット数、内部的にはGRUのユニット数

rssm_stoch_size: int = 32

確率的な遷移のユニット数

rssm_classes: int = 32

確率的な遷移のクラス数(rssm_use_categorical_distribution=Trueの場合有効)

rssm_hidden_units: int = 1024

隠れ状態のユニット数

rssm_use_norm_layer: bool = True

Trueの場合、LayerNormalization層が追加されます

rssm_use_categorical_distribution: bool = True

Falseの場合、確率的な遷移をガウス分布、Trueの場合カテゴリカル分布で表現します

rssm_activation: Any = 'silu'

RSSM Activation

rssm_unimix: float = 0.01

カテゴリカル分布で保証する最低限の確率(rssm_use_categorical_distribution=Trueの場合有効)

reward_type: str = 'twohot'

学習する報酬の分布のタイプ

パラメータ:
  • "linear" -- MSEで学習(use_symlogの影響を受けます)

  • "normal" -- ガウス分布による学習(use_symlogの影響はうけません)

  • "normal_fixed_scale" -- ガウス分布による学習ですが、分散は1で固定(use_symlogの影響はうけません)

  • "twohot" -- TwoHotエンコーディングによる学習(use_symlogの影響を受けます)

reward_twohot_bins: int = 255

reward_typeが"twohot"の時のみ有効、bins

reward_twohot_low: int = -20

reward_typeが"twohot"の時のみ有効、low

reward_twohot_high: int = 20

reward_typeが"twohot"の時のみ有効、high

reward_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)

reward modelの隠れ層

cont_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)

continue modelの隠れ層

critic_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)

critic modelの隠れ層

actor_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)

actor modelの隠れ層

dense_act: Any = 'silu'

各層のactivation

use_symlog: bool = True

symlogを使用するか

encoder_decoder_mlp: Tuple[int, ...] = (1024, 1024, 1024, 1024)

入力がIMAGE以外の場合の隠れ層

encoder_decoder_dist: str = 'linear'

decoder出力層の分布

パラメータ:
  • "linear" -- mse

  • "normal" -- 正規分布

cnn_depth: int = 96

[入力がIMAGEの場合] Conv2Dのユニット数

cnn_blocks: int = 0

[入力がIMAGEの場合] ResBlockの数

cnn_activation: Any = 'silu'

[入力がIMAGEの場合] activation

cnn_normalization_type: str = 'layer'

[入力がIMAGEの場合] 正規化層を追加するか

パラメータ:
  • "none" -- 何もしません

  • "layer" -- LayerNormalization層が追加されます

cnn_resize_type: str = 'stride'

[入力がIMAGEの場合] 画像を縮小する際のアルゴリズム

パラメータ:
  • "stride" -- Conv2Dのスライドで縮小します

  • "stride3" -- Conv2Dの3スライドで縮小します

cnn_resized_image_size: int = 4

[入力がIMAGEの場合] 画像縮小後のサイズ

cnn_use_sigmoid: bool = False

[入力がIMAGEの場合] Trueの場合、画像の出力層をsigmoidにします。Falseの場合はLinearです。

free_nats: float = 1.0

free bit

loss_scale_pred: float = 1.0

reconstruction loss rate

loss_scale_kl_dyn: float = 0.5

dynamics kl loss rate

loss_scale_kl_rep: float = 0.1

rep kl loss rate

warmup_world_model: int = 0

序盤はworld modelのみ学習します

critic_target_update_interval: int = 0

critic target update interval

critic_target_soft_update: float = 0.02

critic target soft update tau

critic_type: str = 'twohot'

critic model type

パラメータ:
  • "linear" -- MSEで学習(use_symlogの影響を受けます)

  • "normal" -- 正規分布(use_symlogの影響は受けません)

  • "normal_fixed_scale" -- 分散1固定の正規分布(use_symlogの影響は受けません)

  • "twohot" -- TwoHotカテゴリカル分布(use_symlogの影響を受けます)

critic_twohot_bins: int = 255

critic_typeが"dreamer_v3"の時のみ有効、bins

critic_twohot_low: int = -20

critic_typeが"dreamer_v3"の時のみ有効、low

critic_twohot_high: int = 20

critic_typeが"dreamer_v3"の時のみ有効、high

actor_discrete_type: str = 'categorical'

actor model type

パラメータ:
  • "categorical" -- カテゴリカル分布

  • "gumbel_categorical" -- Gumbelカテゴリ分布

actor_discrete_unimix: float = 0.01

カテゴリカル分布で保証する最低限の確率(actionタイプがDISCRETEの時のみ有効)

actor_continuous_enable_normal_squashed: bool = True

actionが連続値の時、正規分布をtanhで-1~1に丸めるか

horizon: int = 15

horizonのstep数

horizon_policy: str = 'actor'

"actor" or "random", random is debug.

critic_estimation_method: str = 'h-return'

horizon時の価値の計算方法

パラメータ:
  • "simple" -- 単純な総和

  • "discount" -- 割引報酬

  • "ewa" -- EWA

  • "h-return" -- λ-return

horizon_ewa_disclam: float = 0.1

EWAの係数、小さいほど最近の値を反映("ewa"の時のみ有効)

horizon_h_return: float = 0.95

λ-returnの係数("h-return"の時のみ有効)

discount: float = 0.997

割引率

enable_train_model: bool = True

dynamics model training flag

enable_train_critic: bool = True

critic model training flag

enable_train_actor: bool = True

actor model training flag

batch_length: int = 64

batch length

lr_model: float | SchedulerConfig = 0.0001

<Scheduler> dynamics model learning rate

lr_critic: float | SchedulerConfig = 3e-05

<Scheduler> critic model learning rate

lr_actor: float | SchedulerConfig = 3e-05

<Scheduler>actor model learning rate

actor_loss_type: str = 'dreamer_v3'

loss計算の方法

パラメータ:
  • "dreamer_v1" -- Vの最大化

  • "dreamer_v2" -- Vとエントロピーの最大化

  • "dreamer_v3" -- V2 + パーセンタイルによる正規化

actor_reinforce_rate: float = 0.0

actionがCONTINUOUSの場合のReinforceとDynamics backpropの比率

entropy_rate: float = 0.0003

entropy rate

reinforce_baseline: str = 'v'

baseline

パラメータ:
  • "v" -- -v

  • other -- none

epsilon: float = 0

action ε-greedy(for debug)

clip_rewards: str = 'none'

報酬の前処理

パラメータ:
  • "none" -- なし

  • "tanh" -- tanh