SAC(Soft-Actor-Critic)
- class srl.algorithms.sac.Config(observation_mode: Literal['', 'render_image'] = '', override_env_observation_type: srl.base.define.SpaceTypes = <SpaceTypes.UNKNOWN: 0>, override_observation_type: Union[str, srl.base.define.RLBaseTypes] = <RLBaseTypes.NONE: 1>, override_action_type: Union[str, srl.base.define.RLBaseTypes] = <RLBaseTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: Optional[Type[ForwardRef('ExtendWorker')]] = None, processors: List[ForwardRef('RLProcessor')] = <factory>, render_image_processors: List[ForwardRef('RLProcessor')] = <factory>, enable_rl_processors: bool = True, enable_state_encode: bool = True, enable_action_decode: bool = True, window_length: int = 1, render_image_window_length: int = 1, render_last_step: bool = True, render_rl_image: bool = True, render_rl_image_size: Tuple[int, int] = (128, 128), enable_sanitize: bool = True, enable_assertion: bool = False, dtype: str = 'float32', input_value_block: srl.rl.models.config.input_value_block.InputValueBlockConfig = <factory>, input_image_block: srl.rl.models.config.input_image_block.InputImageBlockConfig = <factory>, batch_size: int = 32, memory: srl.rl.memories.replay_buffer.ReplayBufferConfig = <factory>, discount: float = 0.9, lr_policy: float = 0.001, lr_policy_scheduler: srl.rl.schedulers.lr_scheduler.LRSchedulerConfig = <factory>, lr_q: float = 0.001, lr_q_scheduler: srl.rl.schedulers.lr_scheduler.LRSchedulerConfig = <factory>, lr_alpha: float = 0.001, lr_alpha_scheduler: srl.rl.schedulers.lr_scheduler.LRSchedulerConfig = <factory>, soft_target_update_tau: float = 0.02, hard_target_update_interval: int = 100, enable_normal_squashed: bool = True, entropy_alpha_auto_scale: bool = True, entropy_alpha: float = 0.2, entropy_bonus_exclude_q: float = False, enable_stable_gradients: bool = True, stable_gradients_scale_range: tuple = (1e-10, 10))
- input_value_block: InputValueBlockConfig
- input_image_block: InputImageBlockConfig
<MLPBlock> policy layer
<MLPBlock>
- batch_size: int = 32
Batch size
- memory: ReplayBufferConfig
- discount: float = 0.9
discount
- lr_policy: float = 0.001
policy learning rate
- lr_policy_scheduler: LRSchedulerConfig
- lr_q: float = 0.001
q learning rate
- lr_q_scheduler: LRSchedulerConfig
- lr_alpha: float = 0.001
alpha learning rate
- lr_alpha_scheduler: LRSchedulerConfig
- soft_target_update_tau: float = 0.02
soft_target_update_tau
- hard_target_update_interval: int = 100
hard_target_update_interval
- enable_normal_squashed: bool = True
actionが連続値の時、正規分布をtanhで-1~1に丸めるか
- entropy_alpha_auto_scale: bool = True
entropy alphaを自動調整するか
- entropy_alpha: float = 0.2
entropy alphaの初期値
- entropy_bonus_exclude_q: float = False
Q値の計算からエントロピーボーナスを除外します
- enable_stable_gradients: bool = True
勾配爆発の対策, 平均、分散、ランダムアクションで大きい値を出さないようにclipする
- stable_gradients_scale_range: tuple = (1e-10, 10)
enable_stable_gradients状態での標準偏差のclip