DreamerV3

class srl.algorithms.dreamer_v3.Config(observation_mode: Literal['', 'render_image'] = '', override_env_observation_type: srl.base.define.SpaceTypes = <SpaceTypes.UNKNOWN: 1>, override_observation_type: Union[str, srl.base.define.RLBaseTypes] = <RLBaseTypes.NONE: 1>, override_action_type: Union[str, srl.base.define.RLBaseTypes] = <RLBaseTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: Optional[Type[ForwardRef('ExtendWorker')]] = None, processors: List[ForwardRef('RLProcessor')] = <factory>, render_image_processors: List[ForwardRef('RLProcessor')] = <factory>, enable_rl_processors: bool = True, enable_state_encode: bool = True, enable_action_decode: bool = True, window_length: int = 1, render_image_window_length: int = 1, render_last_step: bool = True, render_rl_image: bool = True, render_rl_image_size: Tuple[int, int] = (128, 128), enable_sanitize: bool = True, enable_assertion: bool = False, dtype: str = 'float32', batch_size: int = 32, memory: srl.rl.memories.replay_buffer.ReplayBufferConfig = <factory>, rssm_deter_size: int = 4096, rssm_stoch_size: int = 32, rssm_classes: int = 32, rssm_hidden_units: int = 1024, rssm_use_norm_layer: bool = True, rssm_use_categorical_distribution: bool = True, rssm_activation: Any = 'silu', rssm_unimix: float = 0.01, reward_type: str = 'twohot', reward_twohot_bins: int = 255, reward_twohot_low: int = -20, reward_twohot_high: int = 20, reward_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024), cont_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024), critic_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024), actor_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024), dense_act: Any = 'silu', use_symlog: bool = True, encoder_decoder_mlp: Tuple[int, ...] = (1024, 1024, 1024, 1024), encoder_decoder_dist: str = 'linear', cnn_depth: int = 96, cnn_blocks: int = 0, cnn_activation: Any = 'silu', cnn_normalization_type: str = 'layer', cnn_resize_type: str = 'stride', cnn_resized_image_size: int = 4, cnn_use_sigmoid: bool = False, free_nats: float = 1.0, loss_scale_pred: float = 1.0, loss_scale_kl_dyn: float = 0.5, loss_scale_kl_rep: float = 0.1, warmup_world_model: int = 0, critic_target_update_interval: int = 0, critic_target_soft_update: float = 0.02, critic_type: str = 'twohot', critic_twohot_bins: int = 255, critic_twohot_low: int = -20, critic_twohot_high: int = 20, actor_discrete_type: str = 'categorical', actor_discrete_unimix: float = 0.01, actor_continuous_enable_normal_squashed: bool = True, horizon: int = 15, horizon_policy: str = 'actor', critic_estimation_method: str = 'h-return', horizon_ewa_disclam: float = 0.1, horizon_h_return: float = 0.95, discount: float = 0.997, enable_train_model: bool = True, enable_train_critic: bool = True, enable_train_actor: bool = True, batch_length: int = 64, lr_model: float = 0.0001, lr_model_scheduler: srl.rl.schedulers.lr_scheduler.LRSchedulerConfig = <factory>, lr_critic: float = 3e-05, lr_critic_scheduler: srl.rl.schedulers.lr_scheduler.LRSchedulerConfig = <factory>, lr_actor: float = 3e-05, lr_actor_scheduler: srl.rl.schedulers.lr_scheduler.LRSchedulerConfig = <factory>, actor_loss_type: str = 'dreamer_v3', actor_reinforce_rate: float = 0.0, entropy_rate: float = 0.0003, reinforce_baseline: str = 'v', epsilon: float = 0, clip_rewards: str = 'none')

batch_size: int = 32: Batch size

memory: ReplayBufferConfig: <ReplayBuffer>

rssm_deter_size: int = 4096: 決定的な遷移のユニット数、内部的にはGRUのユニット数

rssm_stoch_size: int = 32: 確率的な遷移のユニット数

rssm_classes: int = 32: 確率的な遷移のクラス数（rssm_use_categorical_distribution=Trueの場合有効）

rssm_hidden_units: int = 1024: 隠れ状態のユニット数

rssm_use_norm_layer: bool = True: Trueの場合、LayerNormalization層が追加されます

rssm_use_categorical_distribution: bool = True: Falseの場合、確率的な遷移をガウス分布、Trueの場合カテゴリカル分布で表現します

rssm_activation: Any = 'silu': RSSM Activation

rssm_unimix: float = 0.01: カテゴリカル分布で保証する最低限の確率（rssm_use_categorical_distribution=Trueの場合有効）

reward_type: str = 'twohot'

学習する報酬の分布のタイプ

パラメータ:

"linear" -- MSEで学習（use_symlogの影響を受けます）
"normal" -- ガウス分布による学習（use_symlogの影響はうけません）
"normal_fixed_scale" -- ガウス分布による学習ですが、分散は1で固定（use_symlogの影響はうけません）
"twohot" -- TwoHotエンコーディングによる学習（use_symlogの影響を受けます）

reward_twohot_bins: int = 255: reward_typeが"twohot"の時のみ有効、bins

reward_twohot_low: int = -20: reward_typeが"twohot"の時のみ有効、low

reward_twohot_high: int = 20: reward_typeが"twohot"の時のみ有効、high

reward_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024): reward modelの隠れ層

cont_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024): continue modelの隠れ層

critic_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024): critic modelの隠れ層

actor_layer_sizes: Tuple[int, ...] = (1024, 1024, 1024, 1024): actor modelの隠れ層

dense_act: Any = 'silu': 各層のactivation

use_symlog: bool = True: symlogを使用するか

encoder_decoder_mlp: Tuple[int, ...] = (1024, 1024, 1024, 1024): 入力がIMAGE以外の場合の隠れ層

encoder_decoder_dist: str = 'linear'

decoder出力層の分布

パラメータ:

"linear" -- mse
"normal" -- 正規分布

cnn_depth: int = 96: [入力がIMAGEの場合] Conv2Dのユニット数

cnn_blocks: int = 0: [入力がIMAGEの場合] ResBlockの数

cnn_activation: Any = 'silu': [入力がIMAGEの場合] activation

cnn_normalization_type: str = 'layer'

[入力がIMAGEの場合] 正規化層を追加するか

パラメータ:

"none" -- 何もしません
"layer" -- LayerNormalization層が追加されます

cnn_resize_type: str = 'stride'

[入力がIMAGEの場合] 画像を縮小する際のアルゴリズム

パラメータ:

"stride" -- Conv2Dのスライドで縮小します
"stride3" -- Conv2Dの3スライドで縮小します

cnn_resized_image_size: int = 4: [入力がIMAGEの場合] 画像縮小後のサイズ

cnn_use_sigmoid: bool = False: [入力がIMAGEの場合] Trueの場合、画像の出力層をsigmoidにします。Falseの場合はLinearです。

free_nats: float = 1.0: free bit

loss_scale_pred: float = 1.0: reconstruction loss rate

loss_scale_kl_dyn: float = 0.5: dynamics kl loss rate

loss_scale_kl_rep: float = 0.1: rep kl loss rate

warmup_world_model: int = 0: 序盤はworld modelのみ学習します

critic_target_update_interval: int = 0: critic target update interval

critic_target_soft_update: float = 0.02: critic target soft update tau

critic_type: str = 'twohot'

critic model type

パラメータ:

"linear" -- MSEで学習（use_symlogの影響を受けます）
"normal" -- 正規分布（use_symlogの影響は受けません）
"normal_fixed_scale" -- 分散1固定の正規分布（use_symlogの影響は受けません）
"twohot" -- TwoHotカテゴリカル分布（use_symlogの影響を受けます）

critic_twohot_bins: int = 255: critic_typeが"dreamer_v3"の時のみ有効、bins

critic_twohot_low: int = -20: critic_typeが"dreamer_v3"の時のみ有効、low

critic_twohot_high: int = 20: critic_typeが"dreamer_v3"の時のみ有効、high

actor_discrete_type: str = 'categorical'

actor model type

パラメータ:

"categorical" -- カテゴリカル分布
"gumbel_categorical" -- Gumbelカテゴリ分布

actor_discrete_unimix: float = 0.01: カテゴリカル分布で保証する最低限の確率（actionタイプがDISCRETEの時のみ有効）

actor_continuous_enable_normal_squashed: bool = True: actionが連続値の時、正規分布をtanhで-1～1に丸めるか

horizon: int = 15: horizonのstep数

horizon_policy: str = 'actor': "actor" or "random", random is debug.

critic_estimation_method: str = 'h-return'

horizon時の価値の計算方法

パラメータ:

"simple" -- 単純な総和
"discount" -- 割引報酬
"ewa" -- EWA
"h-return" -- λ-return

horizon_ewa_disclam: float = 0.1: EWAの係数、小さいほど最近の値を反映（"ewa"の時のみ有効）

horizon_h_return: float = 0.95: λ-returnの係数（"h-return"の時のみ有効）

discount: float = 0.997: 割引率

enable_train_model: bool = True: dynamics model training flag

enable_train_critic: bool = True: critic model training flag

enable_train_actor: bool = True: actor model training flag

batch_length: int = 64: batch length

lr_model: float = 0.0001: dynamics model learning rate

lr_model_scheduler: LRSchedulerConfig: <LRSchaduler>

lr_critic: float = 3e-05: critic model learning rate

lr_critic_scheduler: LRSchedulerConfig: <LRSchaduler>

lr_actor: float = 3e-05: actor model learning rate

lr_actor_scheduler: LRSchedulerConfig: <LRSchaduler>

actor_loss_type: str = 'dreamer_v3'

loss計算の方法

パラメータ:

"dreamer_v1" -- Vの最大化
"dreamer_v2" -- Vとエントロピーの最大化
"dreamer_v3" -- V2 + パーセンタイルによる正規化

actor_reinforce_rate: float = 0.0: actionがCONTINUOUSの場合のReinforceとDynamics backpropの比率

entropy_rate: float = 0.0003: entropy rate

reinforce_baseline: str = 'v'

baseline

パラメータ:

"v" -- -v
other -- none

epsilon: float = 0: action ε-greedy(for debug)

clip_rewards: str = 'none'

報酬の前処理

パラメータ:

"none" -- なし
"tanh" -- tanh

get_processors(prev_observation_space: SpaceBase) → List[RLProcessor]: 前処理を追加したい場合設定