MuZero
- class srl.algorithms.muzero.Config(batch_size: int = 32, memory_capacity: int = 100000, memory_warmup_size: int = 1000, memory_compress: bool = True, memory_compress_level: int = -1, observation_mode: str | ~srl.base.define.ObservationModes = ObservationModes.ENV, override_observation_type: ~srl.base.define.SpaceTypes = SpaceTypes.UNKNOWN, override_action_type: str | ~srl.base.define.RLBaseActTypes = <RLBaseActTypes.NONE: 1>, action_division_num: int = 10, observation_division_num: int = 1000, frameskip: int = 0, extend_worker: ~typing.Type[ExtendWorker] | None = None, parameter_path: str = '', memory_path: str = '', use_rl_processor: bool = True, processors: ~typing.List[RLProcessor] = <factory>, render_image_processors: ~typing.List[RLProcessor] = <factory>, enable_state_encode: bool = True, enable_action_decode: bool = True, enable_reward_encode: bool = True, enable_done_encode: bool = True, window_length: int = 1, render_image_window_length: int = 1, enable_sanitize: bool = True, enable_assertion: bool = False, num_simulations: int = 20, discount: float = 0.99, lr: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 0.001, v_min: int = -10, v_max: int = 10, policy_tau: float | ~srl.rl.schedulers.scheduler.SchedulerConfig = 0.25, unroll_steps: int = 3, root_dirichlet_alpha: float = 0.3, root_exploration_fraction: float = 0.25, c_base: float = 19652, c_init: float = 1.25, dynamics_blocks: int = 15, reward_dense_units: int = 0, weight_decay: float = 0.0001, enable_rescale: bool = True)
<PriorityExperienceReplay> <RLConfigComponentInput>
- num_simulations: int = 20
シミュレーション回数
- discount: float = 0.99
割引率
- lr: float | SchedulerConfig = 0.001
<Scheduler> Learning rate
- v_min: int = -10
カテゴリ化する範囲
- v_max: int = 10
カテゴリ化する範囲
- policy_tau: float | SchedulerConfig = 0.25
policyの温度パラメータのリスト
- unroll_steps: int = 3
unroll_steps
- root_dirichlet_alpha: float = 0.3
Root prior exploration noise.
- root_exploration_fraction: float = 0.25
Root prior exploration noise.
- c_base: float = 19652
PUCT
- c_init: float = 1.25
PUCT
- dynamics_blocks: int = 15
Dynamics networkのブロック数
- reward_dense_units: int = 0
reward dense units
- weight_decay: float = 0.0001
weight decay
- enable_rescale: bool = True
rescale