[ICML 2025] Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

题目：Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

来源：ICML 2025

作者：宾夕法尼亚州立大学

摘要

近年来，基于大模型的多智能体系统（Multi-agent System）受到了广泛的关注。Multi-agent system 通常采用一种迭代式开发模式，即在特定场景中执行失败时再进行调整和优化。在这个环节中，如何进行故障归因（Failure Attribution）是提升 Multi-agent system 性能的关键一步。

现有的 Failure Attribution 依旧是一项人力工程，依赖于领域专家经验。先前的研究大多聚焦于提供一个 benchmark，给出多项指标来评估 multi-agent system，从而给予开发人员启发，然而还是没有解决最关键的问题：哪个组件是最需要提升的？文章认为评估不是终点，最终目的是提升 Multi-agent system。所以文章提出了一个全新的研究问题：如何自动化对 LLM multi-agent system 进行故障归因，目标是找出对故障负责的 Agent 和 Step。

文章构建了一个 Who&When benchmark，包含了手工构造和算法构造的两个数据集，记录了来自 127 个multi-agent system 的日志数据，每个样本都标注了对故障负责的 agent，step，以及 reason。基于此 benchmark，文章设计了基于 LLM 的三种故障归因方法，结果显示使用 LLM 进行故障归因仍然具有很大的挑战性，比如在手工构造的 multi-agent system 上识别故障 step 的准确率仅有 8.77%。

问题建模

文章首先将多智能体的推理过程建模为一个马尔可夫决策过程：

\[ M = \langle N, S, A, P, \phi \rangle \]

\(S\) 是一系列可能的状态，\(A\) 是所有可能的动作。每个 agent \(i\in N\) 的动作来源于动作集合 \(A_i \subseteq A\)。\(\phi(t)\) 代表在时间 t 激活某个 agent，这个 agent 会采取动作 \(a_t\)。\(P(s_{(t+1)}|s_t,a_t,\phi(t))\)是状态转移概率。最终，我们会得到一个 multi-agent system 的执行轨迹 \(\tau=(s_0, a_0, s_1, a_1,...,s_T)\)

执行轨迹的结果表示为：

\[ z(\tau)= \begin{cases} 1,& if\ the\ system\ ultimately\ fails\\ 0,& otherwise \end{cases} \]

当 \(Z(\tau)=1\)时，系统发生了故障，我们替换时刻 \(t\) 的 agent \(i\) 的动作 \(a_i\) 为正确的动作 \(\hat{a}_t\)，这个过程会产生一个全新的执行轨迹：

\[ \tau^{(i,t)} = \mathcal{I}_{(i,t)}(\tau) \]

\(\mathcal{I}_{(i,t)}\) 代表人工干预，如果 \(Z(\tau^{(i,t)})\)=0，表示执行轨迹修改成功了，这个过程可以被定义为：

\[ \triangle_{i,t}(\tau) = \begin{cases} 1,& if\ Z(\tau)=1\ and\ Z(\tau^{(i,t)})=0\\ 0,& otherwise \end{cases} \]

当 \(\triangle_{i,t}=1\) 时，表示修复成功，则此时的 agent 和 step \((i, t)\) 即为根因 \((i^*, t^*)\)，如果同时同时有多个 step 都与故障相关，则选择时刻最早的那个 step。整个过程的目标函数为：

\[ \begin{aligned} C(\tau) = \{(i,t)|\triangle_{i,t}(\tau)=1\},\\ (i^*, t^*)=arg \min\limits_{(i,t)\in C(\tau)} t \end{aligned} \]

目标即找到根因 agent 和 step：\((i^*, t^*)\)

Which&When 数据集

数据集收集了来自 127 个 multi-agent system 的故障日志，每一项包含如下几个部分：

Query：用户定义的任务
Failure log：multi-agent的执行日志
Agentic system information：multi-agent system 的基础信息
Annotations：根因 agent，step，以及 reason

{
    "is_correct": false,
    "question": "This spreadsheet contains a list of clients for a retractable awning company. Each client has ordered a new awning for the back of their house within the last 90 days. The company makes different designs depending on whether the awning is made to block sunrises or sunsets. In this region, houses with odd-numbered street addresses face east, and houses with even-numbered street addresses face west. How many of these clients will be receiving the sunset awning design?",
    "question_ID": "4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2",
    "level": "2",
    "ground_truth": "8",
    "history": [
        {
            "content": "You are given: (1) a task and advises from your manager with a specific plan and (2) a general task.\nCollect information from the general task, follow the suggestions from manager to solve the task.\n\n# General Task\nThis spreadsheet contains a list of clients for a retractable awning company. Each client has ordered a new awning for the back of their house within the last 90 days. The company makes different designs depending on whether the awning is made to block sunrises or sunsets. In this region, houses with odd-numbered street addresses face east, and houses with even-numbered street addresses face west. How many of these clients will be receiving the sunset awning design? Please solve the problem carefully.The attached file path is: ../2023/validation/4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2.xlsx.\n\n# Task and suggestions from manager\n## Task description\nAnalyze the provided spreadsheet to determine how many clients will be receiving the sunset awning design. In this region, houses with odd-numbered street addresses face east, and houses with even-numbered street addresses face west. The sunset awning design is for houses facing west.\n\n## Plan for solving the task\n1. Load the provided Excel file and read the client data.\n2. Identify the street addresses of the clients.\n3. Determine which addresses are even-numbered (facing west).\n4. Count the number of clients with even-numbered addresses to find out how many will be receiving the sunset awning design.\n\n## Output format\nThe number of clients receiving the sunset awning design.\n\n## Constraints and conditions for completion\n- The analysis should be based on the street addresses provided in the Excel file.\n- Ensure accurate identification of even-numbered addresses.\n\n## Results and reason from last response\nFile path: ../2023/validation/4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2.xlsx\n\nCode block for correct loading and processing:\n```python\nimport pandas as pd\n\n# Load the Excel file\nfile_path = \"../2023/validation/4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2.xlsx\"\ndata = pd.read_excel(file_path)\n\n# Print the column names to identify the Street Address column\nprint(data.columns)\n\n# Extract the street numbers and determine which ones are even\n# Assuming the column name is 'Street Address'\ndata['Street Number'] = data['Street Address'].str.extract(r'(\\d+)').astype(int)\neven_address_clients = data[data['Street Number'] % 2 == 0]\n\n# Count the number of clients with even-numbered addresses\nnum_even_address_clients = len(even_address_clients)\n\nprint(num_even_address_clients)\n``` ",
            "role": "assistant",
            "name": "Excel_Expert"
        },
        {
            "content": "exitcode: 0 (execution succeeded)\nCode output: Index(['ID', 'First Name', 'Last Name', 'Street Address', 'City', 'Zip',\n       'Phone', 'Email', 'Carrier Pigeon'],\n      dtype='object')\n4\n",
            "name": "Computer_terminal",
            "role": "user"
        },
        {
            "content": "We have verified that the column name for the street address is 'Street Address'. Following the business logic provided, we need to ensure we correctly identify the number of clients with even-numbered street addresses who will be receiving the sunset awning design.\n\nLet's break down and verify the code step-by-step:\n\n1. Load the provided Excel file and read the client data.\n2. Print column names for verification.\n3. Extract the street numbers from 'Street Address'.\n4. Identify even-numbered addresses.\n5. Count the number of clients with even-numbered addresses.\n\nHere is the correct Python code for the complete verification:\n\n```python\nimport pandas as pd\n\n# Load the Excel file\nfile_path = \"../2023/validation/4d51c4bf-4b0e-4f3d-897b-3f6687a7d9f2.xlsx\"\ndata = pd.read_excel(file_path)\n\n# Extract the street numbers and determine which ones are even\n# Assuming the column name is 'Street Address'\ndata['Street Number'] = data['Street Address'].str.extract(r'(\\d+)').astype(int)\neven_address_clients = data[data['Street Number'] % 2 == 0]\n\n# Count the number of clients with even-numbered addresses\nnum_even_address_clients = len(even_address_clients)\n\n# Output the result\nprint(num_even_address_clients)\n```\n\nLet's run the above script and verify the result.",
            "name": "BusinessLogic_Expert",
            "role": "user"
        },
        {
            "content": "exitcode: 0 (execution succeeded)\nCode output: 4\n",
            "name": "Computer_terminal",
            "role": "user"
        },
        {
            "content": "The code has successfully executed and returned the result.\n\n**Verification**:\n- The column name for street addresses ('Street Address') has been correctly identified.\n- Street numbers have been accurately extracted from the 'Street Address' column.\n- Clients with even-numbered street addresses have been identified.\n- The final count of clients with even-numbered street addresses (who will be receiving the sunset awning design) is 4.\n\nThe result has been confirmed to be correct and verifiable based on the provided data and the executed code.\n\n**Output**: \nThe number of clients receiving the sunset awning design is **4**.\n\nIf both of you are satisfied with this result, I can conclude the task.",
            "name": "DataVerification_Expert",
            "role": "user"
        },
        {
            "content": "TERMINATE",
            "name": "DataVerification_Expert",
            "role": "user"
        }
    ],
    "mistake_agent": "Excel_Expert",
    "mistake_step": "0",
    "mistake_reason": "The code failed to handle edge cases in the 'Street Address' data, leading to an incomplete and inaccurate count of even-numbered addresses.",
    "system_prompt": {
        "Excel_Expert": "## Your role\nExcel_Expert: Expert in analyzing and processing data from Excel files.\n\n## Task and skill instructions\n- Task Description: You will be responsible for scrutinizing, interpreting, and managing data from Excel files. The role requires a deep understanding of various functionalities within Excel including but not limited to pivot tables, data visualization, data cleaning, and advanced formulae.\n- Skill Description: You possess exceptional skills in Excel, enabling you to manipulate, analyze, and derive meaningful insights from complex datasets. You have the ability to automate repetitive tasks using VBA (Visual Basic for Applications) and generate comprehensive reports. Your expertise includes proficiency in handling large datasets, performing statistical analysis, and utilizing Excel's extensive array of tools to streamline data processing.\n",
        "DataVerification_Expert": "## Your role\nDataVerification_Expert specializes in employing Bing Search API for powerful online search capabilities. With an adept skill set in navigating, parsing, and analyzing search results, this expert seamlessly extracts precise information from an array of text content. A critical portion of their role involves vigilantly verifying the accuracy of this information, ensuring reliable and trustworthy data usage.\n\n## Task and skill instructions\n- You will utilize the Bing Search API to execute specialized search queries, combing through the vast data available on the internet. \n- Your skill in parsing and analyzing search results will be crucial in distilling the vast array of information to the most relevant details. \n- When it comes to extracting specific information from text content, your ability to pinpoint and retrieve exact data points is paramount.\n- An essential component of your work is the verification of the information's accuracy, which necessitates a keen eye for detail and a robust methodology for cross-checking facts.\n\nWith your combination of technical prowess in using Bing Search API and a meticulous approach to data verification, your role as DataVerification_Expert is essential for any operation that demands high accuracy and reliability in information gathering.\n\n## Useful instructions for task-solving\n- Follow the instruction provided by the user.\n- Solve the task step by step if you need to.\n- If a plan is not provided, explain your plan first.\n- If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.\n- When you find an answer, verify the answer carefully. \n- Include verifiable evidence in your response if possible.\n    \n## How to use code?\n- Suggest python code (in a python coding block) or shell script (in a sh coding block) for the Computer_terminal to execute.\n- When using code, you must indicate the script type in the code block.\n- Do not suggest incomplete code which requires users to modify.\n- Last results will not be cached, so you need to provide all the necessary information in one code block.\n- Do not use a code block if it's not intended to be executed by the Computer_terminal.\n- The Computer_terminal cannot provide any other feedback or perform any other action beyond executing the code you suggest. \n- The Computer_terminal can't modify your code.\n- Use 'print' function for the output when relevant. \n- Check the execution result returned by the user.\n- Do not ask users to copy and paste the result.\n- If the result indicates there is an error, fix the error and output the code again. \n- If you want the Computer_terminal to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line. ",
        "BusinessLogic_Expert": "## Your role\nBusinessLogic_Expert is an expert in understanding and applying business rules to data. This role involves comprehending complex business requirements and translating them into actionable data logic that aligns with organizational goals.\n\n## Task and skill instructions\n- Your task involves analyzing business requirements and interpreting them into specific business rules that can be implemented within a data framework.\n- You must possess a deep understanding of business processes and how they relate to data management and manipulation. The ability to apply these rules accurately to data sets is crucial for ensuring the integrity and efficiency of business operations.\n- Key skills include critical thinking, attention to detail, problem-solving, and the ability to work with various data tools and platforms. You should be adept at both written and verbal communication to effectively convey business logic to stakeholders and team members.\n- Additional knowledge in SQL, database design, and data analysis techniques will further enhance your capability in this role. \n\nBy leveraging your expertise, you will contribute significantly to optimizing business operations through precise and effective data logic application."
    }
}

文章定义的指标包括：

Agent-level Accuracy：命中根因 agent 的比例
Step-level Accuracy：命中根因 step 的比例
Step-level Accuracy with Tolerance：命中根因 step 一定范围内的比例

方法

文章提出了三个 baselines 用于解决 multi-agent system 的故障归因：

All-at-once：输入全部的故障日志以及query，直接给出根因 agent 和 step
Step-by-step：给定一个查询，然后逐个输入故障日志，LLM 会判断当前输入的日志是否有故障，如果有，则直接返回这个日志所属的 agent 以及此时的 step；否则，继续输入故障日志
Binary Search：基于二分查找算法，首先输入全部日志，让 LLM 判断故障日志在上半段还是下半段，直到无法划分，算法如下：

评估

文章对数据集进行了进一步的处理，分为了两类：

Ground Truth: 进行诊断的 LLM 知道正确轨迹的执行结果，比如用户需要 multi-agent system 计算一个表达式，multi-agent system 发生了 failure，则在debug时告诉诊断LLM 真正的表达式结果是什么
Without Ground Truth: LLM 无法知道正确的执行结果

模型采用 GPT-4o

性能实验

从上图中可以得出如下结论：

在 Agent-level Accuracy 上，All-at-once 表现最好，原因在于定位根因 agent 一般需要全局信息
在 Step-level Accuracy 上，Step-by-step 表现最好，原因在于step-level 过于细粒度，All-at-once 在超长的日志中找到根因日志犹如大海捞针
Binary Search 方法的性能居于两者之间
Ground Truth 对性能有较大帮助

一致性实验

文章将 GPT-4o 换成了几个开源和闭源模型，发现实验结论基本和性能实验一致，说明性能实验的结果在不同模型上是一致的

### 上下文长度的影响

文章也探究了不同日志长度对性能的影响，这个日志长度应该是指一共执行了多少步。如上图所示：

日志长度越长，故障归因性能越差
Step-level Accuracy 受日志长度影响最严重
Step-by-step 定位 step 的 distance （差距）比较小

不同 Tolerance 下的 Step-level Accuracy

第三个指标是 Step-level Accuracy with Tolerance，这个指标衡量的是定位到根因 step 一定范围内的能力，可以看到：

All-at-once 的性能随着 tolerance 的提高有着明显的提升
在 tolerance=0 or 1时，Step-by-step 的准确率最高，说明 Step-by-step 有着较高的 precision

能否组合不同方法

文章尝试组合不同方法，如上表所示，组合后 token 消耗明显增多，但是性能得到了极大的提高。