DreamerV3
- class srl.algorithms.dreamer_v3.Config(batch_size: int = 32, memory_capacity: int = 100000, memory_warmup_size: int = 1000, memory_compress: bool = True, memory_compress_level: int = -1, observation_mode: str | ~srl.base.define.ObservationModes = ObservationModes.ENV, override_observation_type: ~srl.base.define.SpaceTypes = SpaceTypes.UNKNOWN, override_action_type: str | ~srl.base.define.RLBaseActTypes = <RLBaseActTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: ~typing.Type[ExtendWorker] | None = None, parameter_path: str = '', memory_path: str = '', use_rl_processor: bool = True, processors: ~typing.List[RLProcessor] = <factory>, render_image_processors: ~typing.List[RLProcessor] = <factory>, enable_state_encode: bool = True, enable_action_decode: bool = True, enable_reward_encode: bool = True, enable_done_encode: bool = True, window_length: int = 1, render_image_window_length: int = 1, enable_sanitize: bool = True, enable_assertion: bool = False, rssm_deter_size: int = 4096, rssm_stoch_size: int = 32, rssm_classes: int = 32, rssm_hidden_units: int = 1024, rssm_use_norm_layer: bool = True, rssm_use_categorical_distribution: bool = True, rssm_activation: ~typing.Any = 'silu', rssm_unimix: float = 0.01, reward_type: str = 'twohot', reward_twohot_bins: int = 255, reward_twohot_low: int = -20, reward_twohot_high: int = 20, reward_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), cont_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), critic_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), actor_layer_sizes: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), dense_act: ~typing.Any = 'silu', use_symlog: bool = True, encoder_decoder_mlp: ~typing.Tuple[int, ...] = (1024, 1024, 1024, 1024), encoder_decoder_dist: str = 'linear', cnn_depth: int = 96, cnn_blocks: int = 0, cnn_activation: ~typing.Any = 'silu', cnn_normalization_type: str = 'layer', cnn_resize_type: str = 'stride', cnn_resized_image_size: int = 4, cnn_use_sigmoid: bool = False, free_nats: float = 1.0, loss_scale_pred: float = 1.0, loss_scale_kl_dyn: float = 0.5, loss_scale_kl_rep: float = 0.1, warmup_world_model: int = 0, critic_target_update_interval: int = 0, critic_target_soft_update: float = 0.02, critic_type: str = 'twohot', critic_twohot_bins: int = 255, critic_twohot_low: int = -20, critic_twohot_high: int = 20, actor_discrete_type: str = 'categorical', actor_discrete_unimix: float = 0.01, actor_continuous_enable_normal_squashed: bool = True, horizon: int = 15, horizon_policy: str = 'actor', critic_estimation_method: str = 'h-return', horizon_ewa_disclam: float = 0.1, horizon_h_return: float = 0.95, discount: float = 0.997, enable_train_model: bool = True, enable_train_critic: bool = True, enable_train_actor: bool = True, batch_length: int = 64, lr_model: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 0.0001, lr_critic: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 3e-05, lr_actor: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 3e-05, actor_loss_type: str = 'dreamer_v3', actor_reinforce_rate: float = 0.0, entropy_rate: float = 0.0003, reinforce_baseline: str = 'v', epsilon: float = 0, clip_rewards: str = 'none')
<ExperienceReplayBuffer> <RLConfigComponentInput>
- rssm_deter_size: int = 4096
決定的な遷移のユニット数、内部的にはGRUのユニット数
- rssm_stoch_size: int = 32
確率的な遷移のユニット数
- rssm_classes: int = 32
確率的な遷移のクラス数(rssm_use_categorical_distribution=Trueの場合有効)
隠れ状態のユニット数
- rssm_use_norm_layer: bool = True
Trueの場合、LayerNormalization層が追加されます
- rssm_use_categorical_distribution: bool = True
Falseの場合、確率的な遷移をガウス分布、Trueの場合カテゴリカル分布で表現します
- rssm_activation: Any = 'silu'
RSSM Activation
- rssm_unimix: float = 0.01
カテゴリカル分布で保証する最低限の確率(rssm_use_categorical_distribution=Trueの場合有効)
- reward_type: str = 'twohot'
学習する報酬の分布のタイプ
- パラメータ:
"linear" -- MSEで学習(use_symlogの影響を受けます)
"normal" -- ガウス分布による学習(use_symlogの影響はうけません)
"normal_fixed_scale" -- ガウス分布による学習ですが、分散は1で固定(use_symlogの影響はうけません)
"twohot" -- TwoHotエンコーディングによる学習(use_symlogの影響を受けます)
- reward_twohot_bins: int = 255
reward_typeが"twohot"の時のみ有効、bins
- reward_twohot_low: int = -20
reward_typeが"twohot"の時のみ有効、low
- reward_twohot_high: int = 20
reward_typeが"twohot"の時のみ有効、high
- reward_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)
reward modelの隠れ層
- cont_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)
continue modelの隠れ層
- critic_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)
critic modelの隠れ層
- actor_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024)
actor modelの隠れ層
- dense_act: Any = 'silu'
各層のactivation
- use_symlog: bool = True
symlogを使用するか
- encoder_decoder_mlp: Tuple[int, ...] = (1024, 1024, 1024, 1024)
入力がIMAGE以外の場合の隠れ層
- encoder_decoder_dist: str = 'linear'
decoder出力層の分布
- パラメータ:
"linear" -- mse
"normal" -- 正規分布
- cnn_depth: int = 96
[入力がIMAGEの場合] Conv2Dのユニット数
- cnn_blocks: int = 0
[入力がIMAGEの場合] ResBlockの数
- cnn_activation: Any = 'silu'
[入力がIMAGEの場合] activation
- cnn_normalization_type: str = 'layer'
[入力がIMAGEの場合] 正規化層を追加するか
- パラメータ:
"none" -- 何もしません
"layer" -- LayerNormalization層が追加されます
- cnn_resize_type: str = 'stride'
[入力がIMAGEの場合] 画像を縮小する際のアルゴリズム
- パラメータ:
"stride" -- Conv2Dのスライドで縮小します
"stride3" -- Conv2Dの3スライドで縮小します
- cnn_resized_image_size: int = 4
[入力がIMAGEの場合] 画像縮小後のサイズ
- cnn_use_sigmoid: bool = False
[入力がIMAGEの場合] Trueの場合、画像の出力層をsigmoidにします。Falseの場合はLinearです。
- free_nats: float = 1.0
free bit
- loss_scale_pred: float = 1.0
reconstruction loss rate
- loss_scale_kl_dyn: float = 0.5
dynamics kl loss rate
- loss_scale_kl_rep: float = 0.1
rep kl loss rate
- warmup_world_model: int = 0
序盤はworld modelのみ学習します
- critic_target_update_interval: int = 0
critic target update interval
- critic_target_soft_update: float = 0.02
critic target soft update tau
- critic_type: str = 'twohot'
critic model type
- パラメータ:
"linear" -- MSEで学習(use_symlogの影響を受けます)
"normal" -- 正規分布(use_symlogの影響は受けません)
"normal_fixed_scale" -- 分散1固定の正規分布(use_symlogの影響は受けません)
"twohot" -- TwoHotカテゴリカル分布(use_symlogの影響を受けます)
- critic_twohot_bins: int = 255
critic_typeが"dreamer_v3"の時のみ有効、bins
- critic_twohot_low: int = -20
critic_typeが"dreamer_v3"の時のみ有効、low
- critic_twohot_high: int = 20
critic_typeが"dreamer_v3"の時のみ有効、high
- actor_discrete_type: str = 'categorical'
actor model type
- パラメータ:
"categorical" -- カテゴリカル分布
"gumbel_categorical" -- Gumbelカテゴリ分布
- actor_discrete_unimix: float = 0.01
カテゴリカル分布で保証する最低限の確率(actionタイプがDISCRETEの時のみ有効)
- actor_continuous_enable_normal_squashed: bool = True
actionが連続値の時、正規分布をtanhで-1~1に丸めるか
- horizon: int = 15
horizonのstep数
- horizon_policy: str = 'actor'
"actor" or "random", random is debug.
- critic_estimation_method: str = 'h-return'
horizon時の価値の計算方法
- パラメータ:
"simple" -- 単純な総和
"discount" -- 割引報酬
"ewa" -- EWA
"h-return" -- λ-return
- horizon_ewa_disclam: float = 0.1
EWAの係数、小さいほど最近の値を反映("ewa"の時のみ有効)
- horizon_h_return: float = 0.95
λ-returnの係数("h-return"の時のみ有効)
- discount: float = 0.997
割引率
- enable_train_model: bool = True
dynamics model training flag
- enable_train_critic: bool = True
critic model training flag
- enable_train_actor: bool = True
actor model training flag
- batch_length: int = 64
batch length
- lr_model: float | SchedulerConfig = 0.0001
<Scheduler> dynamics model learning rate
- lr_critic: float | SchedulerConfig = 3e-05
<Scheduler> critic model learning rate
- lr_actor: float | SchedulerConfig = 3e-05
<Scheduler>actor model learning rate
- actor_loss_type: str = 'dreamer_v3'
loss計算の方法
- パラメータ:
"dreamer_v1" -- Vの最大化
"dreamer_v2" -- Vとエントロピーの最大化
"dreamer_v3" -- V2 + パーセンタイルによる正規化
- actor_reinforce_rate: float = 0.0
actionがCONTINUOUSの場合のReinforceとDynamics backpropの比率
- entropy_rate: float = 0.0003
entropy rate
- reinforce_baseline: str = 'v'
baseline
- パラメータ:
"v" -- -v
other -- none
- epsilon: float = 0
action ε-greedy(for debug)
- clip_rewards: str = 'none'
報酬の前処理
- パラメータ:
"none" -- なし
"tanh" -- tanh