DDPG(Deep Deterministic Policy Gradient)
- class srl.algorithms.ddpg.Config(batch_size: int = 32, memory_capacity: int = 100000, memory_warmup_size: int = 1000, memory_compress: bool = True, memory_compress_level: int = -1, observation_mode: str | ~srl.base.define.ObservationModes = ObservationModes.ENV, override_observation_type: ~srl.base.define.SpaceTypes = SpaceTypes.UNKNOWN, override_action_type: str | ~srl.base.define.RLBaseActTypes = <RLBaseActTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: ~typing.Type[ExtendWorker] | None = None, parameter_path: str = '', memory_path: str = '', use_rl_processor: bool = True, processors: ~typing.List[RLProcessor] = <factory>, render_image_processors: ~typing.List[RLProcessor] = <factory>, enable_state_encode: bool = True, enable_action_decode: bool = True, enable_reward_encode: bool = True, enable_done_encode: bool = True, window_length: int = 1, render_image_window_length: int = 1, enable_sanitize: bool = True, enable_assertion: bool = False, lr: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 0.005, discount: float = 0.9, soft_target_update_tau: float = 0.02, hard_target_update_interval: int = 100, noise_stddev: float = 0.2, target_policy_noise_stddev: float = 0.2, target_policy_clip_range: float = 0.5, actor_update_interval: int = 2)
<ExperienceReplayBuffer> <RLConfigComponentInput>
- policy_block: MLPBlockConfig
<MLPBlock> policy layers
- q_block: MLPBlockConfig
<MLPBlock> q layers
- lr: float | SchedulerConfig = 0.005
<Scheduler> Learning rate
- discount: float = 0.9
discount
- soft_target_update_tau: float = 0.02
soft_target_update_tau
- hard_target_update_interval: int = 100
hard_target_update_interval
- noise_stddev: float = 0.2
ノイズ用の標準偏差
- target_policy_noise_stddev: float = 0.2
Target policy ノイズの標準偏差
- target_policy_clip_range: float = 0.5
Target policy ノイズのclip範囲
- actor_update_interval: int = 2
Actorの学習間隔