前言

今天做了什么？

做了实验。

今天做了哪些实验，目的是什么？

少样例测速——优化正常LLM-MAS执行数据集的速度
对比single agent 与 llm-mas ——验证LLM-MAS的有效性，即llm-mas的正确率>single agent

1. 少样例测速

通过5个来自truthfulqa的样本、3轮竞争、4个正常agent的反复测试->修改得到目前能够接受的LLM-MAS答题速度。

数据结构上的优化：

review 简化成：
reviewer_agent_id
target_agent_id
score
stance
main_reason

不再将完整的reviews即每个agent的评价传入下一轮，而是对 reviews进行总结形成prior_feedback传入下一轮。

{
  "previous_answer": {
    "selected_option_ids": ["B"],
    "reasoning": "..."
  },
  "feedback_summary": {
    "total_score": 18,
    "average_score": 6.0,
    "reviews": [
      {
        "reviewer_agent_id": "agent_1",
        "stance": "oppose",
        "score": 4,
        "main_reason": "题干限定了 multiple_choice，B 不完整"
      },
      {
        "reviewer_agent_id": "agent_3",
        "stance": "oppose",
        "score": 5,
        "main_reason": "C 也符合题意，被遗漏了"
      },
      {
        "reviewer_agent_id": "agent_4",
        "stance": "support",
        "score": 9,
        "main_reason": "B 抓住了题干核心"
      }
    ]
  }
}

2.答题方式的改变

以前：一道题过3轮，再过下一题

第 1 题
  Round 1
    4 个 agent 各自答题
    4 个 agent 两两互评
    生成每个 agent 的 prior_feedback
  Round 2
    4 个 agent 基于自己上一轮收到的反馈再答题
    两两互评
    更新 prior_feedback
  Round 3
    4 个 agent 再答题
    两两互评
  保存第 1 题结果

第 2 题
  Round 1
  Round 2
  Round 3
  保存第 2 题结果

...

第 5 题
  Round 1
  Round 2
  Round 3
  保存第 5 题结果

调整后：

所有题过完一轮再过下一轮
将数据集划分patch，
patch不是静态的，而是根据当前问题的token数来动态改变的。
这里我们设置5000token的问题长度上限（12000时发现长时间卡在同一回合）。

Round 1
  第 1 题 第 2 题 ...
    4 个 agent 各自答题
    4 个 agent 两两互评
    生成每个 agent 的 prior_feedback

Round 2
    4 个 agent 基于自己上一轮收到的反馈prior_feedback再答题
    两两互评
    更新 prior_feedback

Round 3
    4 个 agent 基于自己上一轮收到的反馈prior_feedback再答题
    两两互评

答题数量变化

Round 1
  Batch 1: questions 1-5
    agent_1 一次回答 5 题
    agent_2 一次回答 5 题
    agent_3 一次回答 5 题
    agent_4 一次回答 5 题
    agent_1 一次 review 这 5 题中其他 agents 的答案
    ...
  Batch 2: questions 6-10
  ...

Round 2
  Batch 1
  Batch 2
  ...

理论llm调用量从 120 次降到 24 次
普通优化版目前大概是：

5 题 × 3 轮 × (4 answer + 4 batch-review) = 120 次

batched 版 5 题在同一个batch中，则是：

3 轮 × (4 batch-answer + 4 batch-review) = 24 次

优化结果：时间省一半

2.对比single agent 与 llm-mas

单个数据集、50道题、3轮、4agent、seed=42

证明了llm-mas准确率优于single agent，之后攻击llm-mas才有意义。

前言#

1. 少样例测速#

数据结构上的优化：#

2.答题方式的改变#

2.对比single agent 与 llm-mas#

前言

1. 少样例测速

数据结构上的优化：

2.答题方式的改变

2.对比single agent 与 llm-mas