Argument説明

整理无用变量，对环境3.6进行适配
参数Critic_coef修改
2024-03-02 17:36:33 +09:00 · 2024-01-24 17:07:45 +09:00 · 2023-11-23 15:29:13 +09:00
6 changed files with 124 additions and 22 deletions
--- a/Aimbot-PPO-Python/Pytorch/MultiNN-PPO.py
+++ b/Aimbot-PPO-Python/Pytorch/MultiNN-PPO.py
@ -20,7 +20,7 @@ import torch.optim as optim
 SIDE_CHANNEL_UUID = uuid.UUID("8bbfb62a-99b4-457c-879d-b78b69066b5e")
 # tensorboard names
 GAME_NAME = "Aimbot_Hybrid_Full_MNN_MultiLevel"
-GAME_TYPE = "GotoOnly-Level2345"
+GAME_TYPE = "GotoOnly-3.6-Level0123-newModel-Onehot"

 if __name__ == "__main__":
    args = parse_args()
--- a/Aimbot-PPO-Python/Pytorch/aimemory.py
+++ b/Aimbot-PPO-Python/Pytorch/aimemory.py
@ -58,6 +58,7 @@ class PPOMem:
            # print("Win! Broadcast reward!",rewardBF[-1])
            print(sum(thisRewardBF) / len(thisRewardBF))
            thisRewardBF[-1] = rewardBF[-1] - self.base_win_reward
+            # broadcast result reward, increase all reward in this round by remainTime * self.result_broadcast_ratio
            thisRewardBF = (np.asarray(thisRewardBF) + (remainTime * self.result_broadcast_ratio)).tolist()
        else:
            print("!!!!!DIDNT GET RESULT REWARD!!!!!!", rewardBF[-1])
@ -88,7 +89,7 @@ class PPOMem:
                self.dones_bf[i].append(done[i])
                self.values_bf[i].append(value_cpu[i])
                if now_step % self.decision_period == 0:
-                    # on decision period, add last skiped round's reward
+                    # on decision period, add last skiped round's reward, only affact in decision_period != 1
                    self.rewards_bf[i].append(reward[i] + last_reward[i])
                else:
                    # not on decision period, only add this round's reward
--- a/Aimbot-PPO-Python/Pytorch/arguments-cn.md
+++ b/Aimbot-PPO-Python/Pytorch/arguments-cn.md
@ -0,0 +1,56 @@
+
+
+本项目使用以下命令行参数来配置运行环境和模型训练参数：
+
+- `--seed <int>`：实验的随机种子。默认值为`9331`。
+- `--path <str>`：环境路径。默认值为`"./Build/3.6/Aimbot-ParallelEnv"`。
+- `--workerID <int>`：Unity worker ID。默认值为`1`。
+- `--baseport <int>`：用于连接Unity环境的端口。默认值为`500`。
+- `--lr <float>`：优化器的默认学习率。默认值为`5e-5`。
+- `--cuda`：如果启用，将默认使用cuda。可以通过传入`true`或`false`来开启或关闭。
+- `--total-timesteps <int>`：实验的总时间步数。默认值为`3150000`。
+
+### 模型参数
+
+- `--train`：是否训练模型。默认启用。
+- `--freeze-viewnet`：是否冻结视图网络(raycast)。默认为`False`。
+- `--datasetSize <int>`：训练数据集的大小，当数据集收集足够的数据时开始训练。默认值为`6000`。
+- `--minibatchSize <int>`：minibatch大小。默认值为`512`。
+- `--epochs <int>`：更新策略的K次迭代。默认值为`3`。
+- `--annealLR`：是否对策略和价值网络进行学习率退火。默认为`True`。
+- `--wandb-track`：是否在wandb上跟踪。默认为`False`。
+- `--save-model`：是否保存模型。默认为`False`。
+- `--wandb-entity <str>`：wandb项目的实体。默认值为`"koha9"`。
+- `--load-dir <str>`：模型加载目录。默认值为`None`。
+- `--decision-period <int>`：Timestep之间的动作执行间隔。默认值为`1`。
+- `--result-broadcast-ratio <float>`：当赢得回合时，对结果的reward进行broadcast的比例，默认值为`1/30`。
+- `--target-lr <float>`：下调学习率的目标值。默认值为`1e-6`。
+
+### 损失函数参数
+
+- `--policy-coef <float>`：策略损失的系数。默认值为`[0.8, 0.8, 0.8, 0.8]`。
+- `--entropy-coef <float>`：熵损失的系数。默认值为`[0.05, 0.05, 0.05, 0.05]`。
+- `--critic-coef <float>`：评论家损失的系数。默认值为`[1.0, 1.0, 1.0, 1.0]`。
+- `--loss-coef <float>`：总损失的系数。默认值为`[1.0, 1.0, 1.0, 1.0]`。
+
+### GAE损失参数
+
+- `--gae`：是否使用GAE进行优势计算。默认启用。
+- `--norm-adv`：是否标准化优势。默认为`False`。
+- `--gamma <float>`：折扣因子gamma。默认值为`0.999`。
+- `--gaeLambda <float>`：GAE的lambda值。默认值为`0.95`。
+- `--clip-coef <float>`：替代裁剪系数。默认值为`0.11`。
+- `--clip-vloss`：是否使用论文中的裁剪价值函数损失。默认启用。
+- `--max-grad-norm <float>`：梯度裁剪的最大范数。默认值为`0.5`。
+
+### 环境参数
+
+- `--target-num <int>`：目标种类数量。默认值为`4`。
+- `--env-timelimit <int>`：每轮的时间限制。默认值为`30`。
+- `--base-win-reward <int>`：赢得回合的基础奖励。默认值为`999`。
+- `--base-lose-reward <int>`：输掉回合的基础奖励。默认值为`-999`。
+- `--target-state-size <int>`：target状态的大小。默认值为`6`。
+- `--time-state-size <int>`：游戏剩余时间状态的大小。默认值为`1`。
+- `--gun-state-size <int>`：枪状态的大小。默认值为`1`。
+- `--my-state-size <int>`：我的状态大小。默认值为`4`。
+- `--total-target-size <int>`：总target状态的大小。默认值为`12`。
--- a/Aimbot-PPO-Python/Pytorch/arguments-jp.md
+++ b/Aimbot-PPO-Python/Pytorch/arguments-jp.md
@ -0,0 +1,52 @@
+- `--seed <int>`：実験の乱数Seed。デフォルト値は`9331`。
+- `--path <str>`：環境パス。デフォルト値は`"./Build/3.6/Aimbot-ParallelEnv"`。
+- `--workerID <int>`：Unity Worker ID。デフォルト値は`1`。
+- `--baseport <int>`：Unity環境への接続用Port。デフォルト値は`500`。
+- `--lr <float>`：Optimizerのデフォルト学習率。デフォルト値は`5e-5`。
+- `--cuda`：有効にすると、デフォルトでcudaを使用します。`true`または`false`を渡すことで有効/無効を切り替えられます。
+- `--total-timesteps <int>`：実験の合計タイムステップ数。デフォルト値は`3150000`。
+
+### モデルパラメータ
+
+- `--train`：モデルを訓練するかどうか。デフォルトで有効。
+- `--freeze-viewnet`：ビューネットワーク(raycast)をfreezeする。デフォルトは`False`。
+- `--datasetSize <int>`：訓練データセットのサイズ。データセットが十分なデータを集めたら訓練を開始する。デフォルト値は`6000`。
+- `--minibatchSize <int>`：minibatchのサイズ。デフォルト値は`512`。
+- `--epochs <int>`：epochs。デフォルト値は`3`。
+- `--annealLR`：ポリシーとバリューネットワークの学習率を退火するかどうか。デフォルトは`True`。
+- `--wandb-track`：wandbでトラッキングするかどうか。デフォルトは`False`。
+- `--save-model`：モデルを保存するかどうか。デフォルトは`False`。
+- `--wandb-entity <str>`：wandbプロジェクトのエンティティ。デフォルト値は`"koha9"`。
+- `--load-dir <str>`：モデルのロードディレクトリ。デフォルト値は`None`。
+- `--decision-period <int>`：実際動作を実行する時のタイムステップの間隔。デフォルト値は`1`。
+- `--result-broadcast-ratio <float>`：ラウンドに勝った場合の報酬のbroadcast ratio、デフォルト値は`1/30`。
+- `--target-lr <float>`：学習率を下げる時の目標値。デフォルト値は`1e-6`。
+
+### 損失関数パラメータ
+
+- `--policy-coef <float>`：policy損失の係数。デフォルト値は`[0.8, 0.8, 0.8, 0.8]`。
+- `--entropy-coef <float>`：entropy損失の係数。デフォルト値は`[0.05, 0.05, 0.05, 0.05]`。
+- `--critic-coef <float>`：critic損失の係数。デフォルト値は`[1.0, 1.0, 1.0, 1.0]`。
+- `--loss-coef <float>`：全体の損失の係数。デフォルト値は`[1.0, 1.0, 1.0, 1.0]`。
+
+### GAE損失パラメータ
+
+- `--gae`：GAEを使用してアドバンテージを計算するかどうか。デフォルトで有効。
+- `--norm-adv`：アドバンテージを正規化するかどうか。デフォルトは`False`。
+- `--gamma <float>`：割引因子gamma。デフォルト値は`0.999`。
+- `--gaeLambda <float>`：GAEのlambda値。デフォルト値は`0.95`。
+- `--clip-coef <float>`：代替クリッピング係数。デフォルト値は`0.11`。
+- `--clip-vloss`：論文で述べられている価値関数の損失のクリッピングを使用するかどうか。デフォルトで有効。
+- `--max-grad-norm <float>`：勾配のクリッピングの最大ノルム。デフォルト値は`0.5`。
+
+### 環境パラメータ
+
+- `--target-num <int>`：Targetの種類数。デフォルト値は`4`。
+- `--env-timelimit <int>`：ラウンドごとの時間制限。デフォルト値は`30`。
+- `--base-win-reward <int>`：ラウンドに勝った場合の基本報酬。デフォルト値は`999`。
+- `--base-lose-reward <int>`：ラウンドに負けた場合の基本報酬。デフォルト値は`-999`。
+- `--target-state-size <int>`：Targetの状態サイズ。デフォルト値は`6`。
+- `--time-state-size <int>`：ゲームの残り時間の状態サイズ。デフォルト値は`1`。
+- `--gun-state-size <int>`：銃の状態サイズ。デフォルト値は`1`。
+- `--my-state-size <int>`：自分の状態サイズ。デフォルト値は`4`。
+- `--total-target-size <int>`：全Targetの状態サイズ。デフォルト値は`12`。
--- a/Aimbot-PPO-Python/Pytorch/arguments.py
+++ b/Aimbot-PPO-Python/Pytorch/arguments.py
@ -4,7 +4,7 @@ import uuid
 from distutils.util import strtobool

 DEFAULT_SEED = 9331
-ENV_PATH = "../Build/3.4/Aimbot-ParallelEnv"
+ENV_PATH = "../Build/3.6/Aimbot-ParallelEnv"
 WAND_ENTITY = "koha9"
 WORKER_ID = 1
 BASE_PORT = 1000
@ -16,15 +16,15 @@ TOTAL_STEPS = 3150000
 BATCH_SIZE = 512
 MAX_TRAINNING_DATASETS = 6000
 DECISION_PERIOD = 1
-LEARNING_RATE = 1.5e-4
-GAMMA = 0.99
+LEARNING_RATE = 5e-5
+GAMMA = 0.999
 GAE_LAMBDA = 0.95
 EPOCHS = 3
 CLIP_COEF = 0.11
 LOSS_COEF = [1.0, 1.0, 1.0, 1.0] # free go attack defence
-POLICY_COEF = [1.0, 1.0, 1.0, 1.0]
+POLICY_COEF = [0.8, 0.8, 0.8, 0.8]
 ENTROPY_COEF = [0.05, 0.05, 0.05, 0.05]
-CRITIC_COEF = [0.5, 0.5, 0.5, 0.5]
+CRITIC_COEF = [1.0, 1.0, 1.0, 1.0]
 TARGET_LEARNING_RATE = 1e-6

 FREEZE_VIEW_NETWORK = False
@ -35,7 +35,7 @@ TRAIN = True
 SAVE_MODEL = True
 WANDB_TACK = True
 LOAD_DIR = None
-LOAD_DIR = "../PPO-Model/GotoOnly-Level1234_9331_1697122986/8.853553.pt"
+# LOAD_DIR = "../PPO-Model/GotoOnly-Level1234_9331_1697122986/8.853553.pt"

 # Unity Environment Parameters
 TARGET_STATE_SIZE = 6
--- a/Aimbot-PPO-Python/Pytorch/ppoagent.py
+++ b/Aimbot-PPO-Python/Pytorch/ppoagent.py
@ -14,6 +14,8 @@ def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    nn.init.constant_(layer.bias, bias_const)
    return layer

+neural_size_1 = 400
+neural_size_2 = 300

 class PPOAgent(nn.Module):
    def __init__(
@ -31,15 +33,6 @@ class PPOAgent(nn.Module):
        self.unity_action_size = env.unity_action_size
        self.state_size = self.unity_observation_shape[0]
        self.agent_num = env.unity_agent_num
-        self.target_size = self.args.target_state_size
-        self.time_state_size = self.args.time_state_size
-        self.gun_state_size = self.args.gun_state_size
-        self.my_state_size = self.args.my_state_size
-        self.ray_state_size = env.unity_observation_shape[0] - self.args.total_target_size
-        self.state_size_without_ray = self.args.total_target_size
-        self.head_input_size = (
-                env.unity_observation_shape[0] - self.target_size - self.time_state_size - self.gun_state_size
-        )  # except target state input

        self.unity_discrete_type = env.unity_discrete_type
        self.discrete_size = env.unity_discrete_size
@ -49,9 +42,9 @@ class PPOAgent(nn.Module):
        self.hidden_networks = nn.ModuleList(
            [
                nn.Sequential(
-                    layer_init(nn.Linear(self.state_size, 256)),
+                    layer_init(nn.Linear(self.state_size, neural_size_1)),
                    nn.LeakyReLU(),
-                    layer_init(nn.Linear(256, 128)),
+                    layer_init(nn.Linear(neural_size_1, neural_size_2)),
                    nn.LeakyReLU(),
                    )
                for i in range(self.target_num)
@ -59,16 +52,16 @@ class PPOAgent(nn.Module):
        )

        self.actor_dis = nn.ModuleList(
-            [layer_init(nn.Linear(128, self.discrete_size), std=0.5) for i in range(self.target_num)]
+            [layer_init(nn.Linear(neural_size_2, self.discrete_size), std=0.5) for i in range(self.target_num)]
        )
        self.actor_mean = nn.ModuleList(
-            [layer_init(nn.Linear(128, self.continuous_size), std=0) for i in range(self.target_num)]
+            [layer_init(nn.Linear(neural_size_2, self.continuous_size), std=0) for i in range(self.target_num)]
        )
        self.actor_logstd = nn.ParameterList(
            [nn.Parameter(torch.zeros(1, self.continuous_size)) for i in range(self.target_num)]
        )
        self.critic = nn.ModuleList(
-            [layer_init(nn.Linear(128, 1), std=0) for i in range(self.target_num)]
+            [layer_init(nn.Linear(neural_size_2, 1), std=0) for i in range(self.target_num)]
        )

    def get_value(self, state: torch.Tensor):
Author	SHA1	Message	Date
Koha9	573b09a920	Argument説明	2024-03-02 17:36:33 +09:00
Koha9	9d9524429c	整理无用变量，对环境3.6进行适配	2024-01-24 17:07:45 +09:00
Koha9	5aa7e0936a	参数Critic_coef修改	2023-11-23 15:29:13 +09:00