PPO(Proximal Policy Optimization)
- class srl.algorithms.ppo.Config(batch_size: int = 32, memory_capacity: int = 100000, memory_warmup_size: int = 1000, memory_compress: bool = True, memory_compress_level: int = -1, observation_mode: str | ~srl.base.define.ObservationModes = ObservationModes.ENV, override_observation_type: ~srl.base.define.SpaceTypes = SpaceTypes.UNKNOWN, override_action_type: str | ~srl.base.define.RLBaseActTypes = <RLBaseActTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: ~typing.Type[ExtendWorker] | None = None, parameter_path: str = '', memory_path: str = '', use_rl_processor: bool = True, processors: ~typing.List[RLProcessor] = <factory>, render_image_processors: ~typing.List[RLProcessor] = <factory>, enable_state_encode: bool = True, enable_action_decode: bool = True, enable_reward_encode: bool = True, enable_done_encode: bool = True, window_length: int = 1, render_image_window_length: int = 1, enable_sanitize: bool = True, enable_assertion: bool = False, experience_collection_method: str = 'MC', discount: float = 0.9, gae_discount: float = 0.9, baseline_type: str = 'ave', surrogate_type: str = 'clip', policy_clip_range: float = 0.2, adaptive_kl_target: float = 0.01, enable_value_clip: float = False, value_clip_range: float = 0.2, lr: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 0.01, value_loss_weight: float = 1.0, entropy_weight: float = 0.1, enable_state_normalized: bool = False, global_gradient_clip_norm: float = 0.5, state_clip: ~typing.Tuple[float, float] | None = None, reward_clip: ~typing.Tuple[float, float] | None = None, enable_stable_gradients: bool = True, stable_gradients_scale_range: tuple = (1e-10, 10))
<ExperienceReplayBuffer> <RLConfigComponentInput>
<MLPBlock> hidden layers
- value_block: MLPBlockConfig
<MLPBlock> value layers
- policy_block: MLPBlockConfig
<MLPBlock> policy layers
- experience_collection_method: str = 'MC'
割引報酬の計算方法
- パラメータ:
"MC" -- モンテカルロ法
"GAE" -- Generalized Advantage Estimator
- discount: float = 0.9
discount
- gae_discount: float = 0.9
GAEの割引率
- baseline_type: str = 'ave'
baseline
- パラメータ:
"none" ("") -- none
"ave" -- (adv - mean)
"std" -- adv/std
"normal" -- (adv - mean)/std
"v" ("advantage") -- adv - v
- surrogate_type: str = 'clip'
surrogate type
- パラメータ:
"" -- none
"clip" -- Clipped Surrogate Objective
"kl" -- Adaptive KLペナルティ
- policy_clip_range: float = 0.2
Clipped Surrogate Objective
- adaptive_kl_target: float = 0.01
Adaptive KLペナルティ内の定数
- enable_value_clip: float = False
value clip flag
- value_clip_range: float = 0.2
value clip range
- lr: float | SchedulerConfig = 0.01
<Scheduler> Learning rate
- value_loss_weight: float = 1.0
状態価値の反映率
- entropy_weight: float = 0.1
エントロピーの反映率
- enable_state_normalized: bool = False
状態の正規化 flag
- global_gradient_clip_norm: float = 0.5
勾配のL2におけるclip値(0で無効)
- state_clip: Tuple[float, float] | None = None
状態のclip(Noneで無効、(-10,10)で指定)
- reward_clip: Tuple[float, float] | None = None
報酬のclip(Noneで無効、(-10,10)で指定)
- enable_stable_gradients: bool = True
勾配爆発の対策, 平均、分散、ランダムアクションで大きい値を出さないようにclipする
- stable_gradients_scale_range: tuple = (1e-10, 10)
enable_stable_gradients状態での標準偏差のclip