快来看，n8n更新了！使用n8n构建您自己的LLM评估框架

qimuai 发布于 2025-12-16 22:03 阅读：35 一手编译

内容来源：https://blog.n8n.io/llm-evaluation-framework/

内容总结：

告别“玄学调参”：手把手教你用n8n搭建AI工作流评估框架

在生成式AI应用开发中，开发者常面临一个共同痛点：对提示词、模型或工作流节点进行微小调整，就可能导致原本稳定的输出变得混乱不堪。这种由AI非确定性带来的“惊喜”与“烦恼”并存的状态，凸显了仅凭经验部署AI系统的巨大风险。为此，引入一套可重复、可量化的评估框架，已成为确保AI工作流可靠性的工程化必需。

为何必须构建AI评估框架？
评估框架的核心价值在于将开发流程从“依赖猜测”转变为“依赖证据”。其主要优势体现在四个方面：

自信部署：作为专用测试路径，它能通过对一致性数据集的测试，确保输出的长期可靠性与高质量，帮助团队在用户发现问题前捕捉回归错误，并优雅处理边界情况。
客观验证变更：调整提示词究竟提升了效果，还是仅仅改变了写作风格？评估框架提供客观证据，避免主观臆断，明确揭示每次修改是带来改进还是引入新问题。
加速实验迭代：评估框架提供了一个安全的“沙盒”，允许团队在不影响真实用户的前提下，对逻辑或提示词进行激进修改并立即获得量化反馈，从而支持快速的A/B测试与迭代。
数据驱动的模型决策：面对不断涌现的新模型，框架支持快速对比评测。团队可据此判断，是为追求速度或成本效益而切换模型，还是针对特定任务，较小模型（如Gemini Flash Lite）已能媲美更大模型的性能。

为何选择n8n构建评估框架？
n8n将评估视为一个持续、内生于工作流的实践，而非一次性基准测试。其通过内置及自定义指标、自动化反馈循环和持续监控，直接支持迭代改进与生产可靠性。其核心优势在于：

画布上的直观实现：无需配置复杂API、日志库或外部基础设施，直接利用可视化拖拽界面连接节点即可构建评估流程，大幅降低学习与实施门槛。
作为独立工作流路径：评估序列被设计为现有工作流中的专用分支，实现了关注点分离。开发人员可独立于生产触发器（如Webhook或定时任务）执行测试，专注于分析而不干扰终端用户服务逻辑。
高度可定制的输入与指标：框架支持使用多样化的测试输入运行工作流，并观察输出。关键的是，评估指标完全可定制，可测量与特定用例相关的任何方面，如输出正确性、安全性、工具调用准确性以及执行时间、令牌数等确定性指标。

核心评估方法与实践
在n8n中，评估结合了情境感知的定性判断与可量化的硬数据。主要方法包括：

LLM即评委：针对开放性任务（如创意写作、摘要），此方法利用高性能模型（如GPT-4、Claude）评估目标模型输出的质量。n8n的评估节点内置了基于AI的正确性、有用性等评分选项，也支持完全自定义评估标准。
评估复杂智能体工作流：对于使用检索增强生成或依赖工具调用的工作流，需评估整个系统。n8n提供“工具使用”等内置指标，并可结合基于AI的正确性指标来验证答案与文档事实的一致性。
量化指标：作为定性评估的补充，提供明确的数值数据点，如自动跟踪的令牌计数、执行时间，以及用于分类任务匹配度检查的“分类”指标、用于检测格式错误的字符串相似度等。
实时策略与安全评估：通过“护栏”节点，可在工作流中实时验证用户输入或AI输出，强制执行安全、安全及内容策略，将违规内容路由至失败分支或进行信息脱敏处理。

实战案例：构建情感分析工作流评估框架
为具体展示n8n评估能力，我们构建了一个对入站邮件进行情感分析（正面/中性/负面）并路由至相应销售团队的工作流。为确保分类准确，我们为其搭建了评估工作流，旨在通过测试棘手用例，找到能可靠完成任务且成本最低的模型。

实施步骤：

建立基准真值：利用n8n的“数据表”功能创建包含10个测试用例的数据集。这些用例经过精心设计，包含“竞争对手抱怨”（实为潜在商机）、反讽语句、混合信号等复杂场景，以压力测试模型。
创建评估工作流：从数据表获取记录并循环处理，将数据发送至情感分析节点。关键步骤是使用“检查是否在评估”节点，将工作流拆分为评估路径与生产路径。在评估路径上，利用评估节点的“设置输出”功能，将分析结果写回数据表。
计算指标：在评估节点中使用“设置指标”选项，选择内置的“分类”指标，该指标将预期类别与实际输出进行比对（匹配得1分，不匹配得0分），提供直观的性能概览。
运行测试与决策：在画布上或通过专门的“评估”选项卡运行测试。在对Gemini 3 Pro、Gemini 2.5 Flash和Gemini 2.5 Flash Lite三款模型的对比测试中，三者均能完美处理所有棘手用例，但性能差异显著：Gemini 2.5 Flash Lite仅用时650毫秒，速度最快且成本最低，成为该任务的最优选择。这清晰体现了评估框架在实现成本与性能优化方面的价值。

构建评估框架的最佳实践
为确保评估的可靠性与可扩展性，建议遵循以下原则：

逻辑分离：始终使用“检查是否在评估”节点，将测试逻辑与生产操作彻底分离，避免“测试污染”并确保指标仅在测试运行时计算。
精心构建“黄金数据集”：评估质量取决于数据质量。数据集应包含真实边界案例、历史失败点和对抗性输入，并随生产中新问题的发现而持续更新。
结合定性与定量指标：避免依赖单一指标。应将确定性指标（如执行时间）与定性评估（如LLM即评委）结合，以获得全面视图。
隔离测试变量：对比性能时，每次只更改一个变量（如先固定提示词测试不同模型，或先固定模型调整提示词），以明确归因。
保持“人类在循环”监督：定期审核“LLM评委”节点的判断，特别是对于“有用性”等主观指标，必要时优化评委自身的系统提示词。

结语
通过n8n构建评估框架，团队得以将AI输出的“不可预测混沌”转化为结构化的工程级流程。这套系统使开发者能够：在生产前捕捉回归错误、量化每次提示词调整的影响、客观比较模型以优化成本与速度。它不仅是安全网，更是加速创新、确保AI智能体按预期运行的信心基石。建议从构建小型测试数据集开始，逐步实践，迈向可靠、高效的AI自动化。

中文翻译：

如果你曾构建过由生成式人工智能驱动的应用，一定深有体会：对提示词稍作修改、更换模型，或是微调某个节点，都可能让原本运行完美的工作流陷入难以预料的混乱。与确定性代码不同，AI输出总会带来一种令人惊喜又沮丧的随机性。

正是这种不可预测性，决定了部署AI时绝不能依赖猜测。你需要一套专有的、可重复的测试机制：一个LLM评估框架。

本实践教程将引导你理解构建低代码AI评估框架的必要性与实现方法。你将掌握核心概念，了解"LLM即裁判"等技术，并构建自定义评估路径，从而能自信地部署更新、测试新模型并持续保障质量。

为何需要为AI工作流建立评估框架？

评估框架是从依赖猜测转向依赖具体可衡量证据的基础实践。以下是为工作流构建评估框架的五大理由：

放心部署：评估框架如同专用测试通道。通过对恒定数据集运行测试，可确保长期可靠性与高质量输出。这使你能在终端用户发现问题前捕捉回归缺陷或新问题，还能针对"边缘案例"测试，确保系统能从容处理意外数据。
客观验证变更：调整提示词后，输出质量真的提升了吗？还是仅仅改变了写作风格？没有框架时答案纯属主观。评估提供客观证据，让你能明确判断提示词调整或特定错误修复是否真正改善结果，或是否引入了新问题。
加速实验迭代：对影响生产环境的担忧常会拖慢实验进度。评估提供安全沙盒，你可以在不影响真实用户的情况下测试逻辑或提示词的重大改动，并立即获得量化影响。这支持快速A/B测试，例如比较两种系统提示词方案，根据基准指标判断孰优孰劣。
基于数据选择模型：新模型不断涌现。评估框架让你能快速比较模型表现，判断是否应该为追求速度或成本效益而切换模型，或者针对特定任务，较小模型（如Gemini Flash Lite）能否达到与大型模型相当的效果。

为何选择n8n进行LLM评估？

我们以n8n为例构建LLM评估框架，因为它将评估视为持续、工作流原生的实践，而非一次性基准测试。通过内置与自定义指标、自动化反馈循环和持续监控，它展示了评估如何直接支持迭代改进与生产可靠性。

以下是n8n灵活评估方法的核心特点：

1. 直观的画布实现
传统监控测试工具（如LangSmith）通常学习曲线陡峭，需要配置API、日志库和外部基础设施。n8n通过将AI评估直接融入画布消除了这种摩擦。这种基于可视化拖拽界面的实现方式直观且不易出错，无需编写Python脚本，只需连接节点。

2. 作为独立工作流路径的评估
在n8n中，评估序列被设计为现有工作流中的独立路径。这种设计至关重要，因为它确保：

关注点分离：测试序列可与生产触发器（如Webhook或定时任务）分开执行
聚焦迭代：开发者可专注于测试、分析和指标计算，而不会干扰服务终端用户的生产逻辑

3. 可定制的输入与指标
该框架高度灵活，允许对工作流运行多组测试输入并观察输出。关键的是，所用指标完全可定制。你可以测量与具体用例相关的任何维度，例如：

输出正确性：基于预定义标准判断生成答案是否事实准确
安全性与公平性：检测毒性内容、偏见或是否符合安全准则
工具调用：AI代理是否正确调用了外部工具或函数（对复杂代理至关重要）
确定性指标：执行时间、令牌数等效率度量

这些聚合数据对于分析特定变更（如更换模型、修改系统提示词）的影响以及长期比较整体性能至关重要。

可在n8n中实现的关键AI评估指标与方法

LLM评估需要细致的方法，将定性、上下文感知的评估与可量化的硬数据相结合。n8n的灵活性让你能同时实现这两者，通常只需配置专用的评估节点。

以下是可直接在画布部署的核心方法：

1. LLM即裁判（开放式任务的黄金标准）
这是处理传统指标失效的开放式任务（如创意写作或摘要生成）的标准方法。它使用高性能模型（如GPT-5或Claude 4.5 Sonnet）来评估目标模型（通常是更小、更高效的模型）的输出质量。

n8n实现方式：
无需手动配置"裁判"LLM并解析JSON响应。只需打开评估节点并选择基于AI的指标：

正确性（AI驱动）：自动评分（1-5分）判断答案含义是否与参考答案一致
帮助性（AI驱动）：评分（1-5分）判断回复是否成功解决初始查询
自定义指标：如需测试特定要求（如"AI是否采用了海盗人设？"），可使用自定义指标功能定义专属标准

2. 评估复杂代理工作流（RAG与工具使用）
若工作流使用检索增强生成（RAG）或依赖LLM调用外部工具，需要评估整个系统而非最终文本生成。

n8n实现方式：

工具使用：使用评估节点的内置"工具使用"指标，检查代理是否在需要时正确触发工具调用
RAG忠实度：使用"正确性（AI驱动）"指标验证生成答案是否与文档中的事实相符

3. 量化指标
这些指标提供明确数据点，与"LLM即裁判"的定性评估形成互补。

n8n实现方式：

确定性指标：由n8n评估过程自动追踪：
- 令牌计数：成本追踪的关键
- 执行时间：监控用户体验延迟的核心
分类准确度：适用于分类任务（如情感分析示例），检查输出是否完全匹配预期类别（匹配得1分，不匹配得0分）
字符串相似度：测量结果与期望值的字符级差异，适用于捕捉次要格式错误或拼写问题，同时不因有效答案惩罚模型
安全与性能：对于尚未内置的指标（如特定毒性检查或详细执行延迟），可直接通过评估节点的自定义指标功能定义
传统机器学习指标：对于结构化任务（如实体提取），使用自定义指标功能，通过准确率、精确率、召回率或F1分数等经典指标将输出与事实标准比较

4. 护栏节点的策略与安全评估
为实时执行安全、安保和内容策略，护栏节点至关重要。可用于在将用户输入发送给AI模型前进行验证，或在AI模型输出被工作流进一步使用前进行检查。这让团队能实时验证AI响应，在将失败案例路由至备用代理或人工审核前，检查内容质量、安全性或自定义规则。

该节点提供两种主要操作：

检查文本违规：任何违规都会将项目发送至"失败"分支，适合需要在发现问题时暂停工作流的评估场景
净化文本：检测并替换违规内容（如URL、密钥或个人身份信息）为占位符，适用于工作流内部数据清洗

这种方法的真正威力在于组合使用。你可以设置单个评估节点来检查分类准确度、工具使用、帮助性以及安全性，同时获得性能的整体视图。

如何用n8n为情感分析工作流构建LLM评估框架？

为展示n8n评估功能，我们构建了一个工作流：对收到的电子邮件进行情感分析，将其分类为积极、中性或消极，并路由至相应的销售团队。

我们需要确保分类准确。为此将构建评估工作流，输入一些复杂用例，比较不同模型以找到能可靠完成任务且成本最低的方案。我们将测试Gemini 3 Pro、Gemini 2.5 Flash和Gemini 2.5 Flash Lite。最后者是最经济的选择，若能胜任工作流将意义重大。

n8n实现的最大优势之一是允许将评估逻辑直接整合在实际工作流旁。

步骤1：使用数据表建立事实标准
运行评估首先需要建立测试用例和事实标准。n8n的新数据表功能完美契合此需求。可将其视为直接存在于n8n中的数据库表，工作流可对其进行读写操作。

本例创建了10个测试用例。目标是找到能正确执行任务的最小（最经济！）模型，指导我们调整系统提示词以实现完美准确度。

为压力测试模型，测试用例在以下几方面设置巧妙：

竞争对手抱怨：文本可能列出对竞争对手解决方案的不满。虽然包含负面词汇，但实际意图对我们有利（他们想转换）。传统机器学习常在此失败，但LLM应能捕捉细微差别。我们想观察较小的Flash Lite模型能否在此达到与Gemini 3 Pro相当的准确度
讽刺表达：如"昨天看到我的项目管道冻结六小时真是太兴奋了"应归类为消极。只有能力强的LLM能捕捉这种语气
混合信号：将小称赞与大抱怨结合。整体应归类为消极

数据表设置完成后如下所示：
"预期"列包含事实标准，"结果"列初始为空，将用于存储最新评估输出。

步骤2：创建评估工作流
现在构建评估工作流。首先从数据表获取所有记录并循环处理。

在循环内部，将数据传递至情感分析节点，该节点配置为将邮件分类至三个类别：积极、中性或消极。

通常工作流会根据类别将邮件转发至相应团队。但评估期间我们不希望发送真实邮件。为此使用"检查是否评估中"节点，将工作流拆分为两条路径：一条用于主动评估，一条用于正常生产运行。

在评估路径上，使用评估节点的"设置输出"选项。选择"情感分析评估"表，并将分析节点的输出映射至结果列。

步骤3：计算指标
计算指标对于快速理解性能至关重要。通过评估节点的"设置指标"选项实现。可选择内置的"分类准确度"指标，该指标专为此用例设计。

该指标简单比较预期答案与实际答案，不匹配返回0，匹配返回1，完全符合需求。

步骤4：运行测试
现在可直接从画布运行评估进行测试。也可使用画布顶部新的"评估"标签页，此处启动的运行会被保存，并提供随时间变化的指标可视化图表。

我们运行了三次评估，获得了宝贵洞察：尽管存在复杂边缘案例，所有三个模型都完美处理了测试用例。但性能差异显著：

Gemini 3 Pro耗时超过30秒
Gemini 2.5 Flash耗时约1.6秒
Gemini 2.5 Flash Lite仅用650毫秒

这让决策变得简单明了：Gemini 2.5 Flash Lite对此任务足够准确，且是最快、最经济的选择。这正是拥有评估框架的明显优势！

在n8n中构建LLM评估框架的最佳实践

构建LLM评估框架既关乎工具也关乎流程。以下五项最佳实践可确保评估保持可靠与可扩展：

始终分离评估逻辑：切勿将测试逻辑与生产操作混合。始终使用"检查是否评估中"节点创建清晰分离。这防止"测试污染"（如向销售团队发送50封测试邮件），并确保指标仅在真实测试运行时计算
精心策划"黄金数据集"：评估质量取决于数据质量。不要仅生成随机字符串，应构建包含真实边缘案例、历史失败点和复杂对抗输入（如讽刺示例）的数据表。在生产中发现新失败案例时，将其添加至表中以防止未来回归
结合定性与定量指标：依赖单一指标可能产生误导。模型可能速度快（低延迟）但虚构事实（低正确性）。始终将确定性指标（如执行时间或JSON有效性）与定性指标（如LLM即裁判）配对以获得完整视图
测试时隔离变量：比较性能时，每次仅改变一个变量。若同时更换模型和修改提示词，将无法判断哪个变更带来了改进（或回归）。首先在同一模型上测试提示词变更，然后用固定提示词测试不同模型
保持人工参与"裁判"环节：虽然"LLM即裁判"功能强大，但并非绝对可靠。定期审核裁判节点的决策，特别是对"帮助性"等主观指标。若裁判持续误解标准，可能需要像优化主代理那样完善其系统提示词

总结

我们已从不可预测AI输出的"惊喜混沌"转向结构化、工程级的流程。通过在n8n中直接构建评估框架，你已从猜测转向确知。

现在你拥有了一套系统，能够：

在问题影响生产前捕捉回归缺陷
量化每次提示词调整的影响
客观比较模型以优化成本与速度

这个框架是你的安全网，让你能更快创新，并自信部署确保AI代理完全按预期运行。

下一步行动

理解概念后，最佳学习方式是观察这些工作流的实际运行。我们强烈推荐观看以下社区教程以加深理解：

《n8n工作流评估入门指南（停止猜测！）》——精彩概述评估重要性及如何为AI设置首次"考试"
《n8n评估入门指南（优化你的AI代理）》——深入探讨使用所述评估工具优化代理
《用N8N评估你的RAG系统》——如果你正在构建检索增强生成工作流并需要测试事实准确性，这是必看内容

从小处着手，构建你的第一个测试数据集，祝自动化愉快！

英文来源：

If you’ve ever built an application powered by Generative AI, you know the feeling: one small change to a prompt, a model swap, or a slight tweak to a node can turn a perfectly functional workflow into an unpredictable mess. Unlike deterministic code, AI outputs introduce an element of delightful, yet frustrating, chaos.
This unpredictability is exactly why you can't just rely on guesswork when deploying AI. You need a dedicated, repeatable testing mechanism: an LLM evaluation framework.
In this hands-on tutorial, we'll guide you through the why and how of creating a low-code AI Evaluation Framework using n8n. You'll learn the key concepts, understand techniques such as “LLM-as-a-Judge”, and build a custom evaluation path that ensures you can deploy updates, test new models, and maintain quality with total confidence.
Why do you need an evaluation framework for your AI workflows?
An evaluation framework is the foundational practice that shifts your development process from relying on guesswork to relying on concrete, measurable evidence. Here are the five main reasons to build an evaluation framework into your workflows:

Deploy with confidence: An evaluation framework acts as a dedicated testing path. By running tests against a consistent dataset, you ensure long-term reliability and high-quality outputs. This allows you to catch regressions or new issues before your end users do. It also lets you test against "edge cases" to ensure your system handles unexpected data gracefully.
Validate changes objectively: When you tweak a prompt, did it actually improve the output, or did it just change the writing style? Without a framework, the answer is purely subjective. Evaluations give you evidence. You can definitely see if a prompt tweak or a fix for a specific error actually improved results or if it introduced new problems.
Experiment and iterate faster: Fear of breaking production often slows down experimentation. Evaluations provide a safe sandbox. You can test radical changes to your logic or prompts and see the quantified impact immediately without affecting real users. This allows for rapid A/B testing, like comparing two different system prompts, to see which scores better against your benchmarks.
Make data-driven decisions on models: New models are released constantly. An evaluation framework lets you quickly compare them. You can decide if switching to a new model makes sense for speed or cost-efficiency, or if a smaller model (like a Gemini Flash Lite) can perform just as well as a larger one for your specific task.
Why use n8n for LLM evaluation?
We’ll use n8n as an example for building your own LLM evaluation framework because it treats evaluation as a continuous, workflow-native practice rather than a one-off benchmark. With built-in and custom metrics, automated feedback loops, and ongoing monitoring, it shows how evaluation can directly support iterative improvement and production reliability.
Here are the key aspects of n8n’s flexible evaluation approach:
1. Straightforward implementation on the canvas
  Traditional monitoring and testing tools, such as LangSmith, often come with a steep learning curve that requires configuring APIs, logging libraries, and external infrastructure. n8n eliminates this friction by bringing AI evaluation directly to its canvas.
  This approach allows for straightforward and less error-prone implementations, relying on the visual, drag-and-drop interface you know and trust. You don't need to write custom Python scripts; you just need to connect nodes.
2. Evaluation as a dedicated workflow path
  In n8n, an evaluation sequence is structured as a dedicated path within your existing workflow. This design is crucial because it ensures:
Separation of concerns: You can execute the testing sequence separately from your production triggers (like webhooks or schedules).
Focus on iteration: Developers can focus purely on testing, analysis, and metric calculation without disrupting the production logic that serves end-users.
1. Customizable inputs and metrics
  The framework is highly flexible, allowing you to run a range of test inputs against your workflow and observe the outputs. Critically, the metrics applied are completely customizable. You can measure anything relevant to your specific use case, for example:
Output correctness: Is the generated answer factually accurate based on predefined criteria?
Safety and fairness: The presence of toxicity, bias, or alignment with safety guidelines.
Tool calling: Whether the AI agent correctly invoked the right external tool or function (essential for complex agents).
Deterministic metrics: Efficiency measurements like execution time or token count.
This collective data is essential for analyzing the effect of specific changes (e.g., swapping a model, modifying a system prompt) and comparing overall performance over time.
Key AI evaluation metrics and methods you can implement with n8n
LLM evaluation requires a nuanced approach, combining qualitative, context-aware assessments with quantifiable hard data. The flexibility of n8n allows you to implement both, often just by configuring the dedicated Evaluation node.
Here are the key methods you can deploy directly on your canvas:
1. LLM-as-a-Judge (the gold standard for open-ended tasks)
  This is the standard approach for open-ended tasks where traditional metrics fail (e.g., creative writing or summarization). It involves using a highly capable model (like GPT-5 or Claude 4.5 Sonnet) to evaluate the quality of outputs generated by a target model (often a smaller, more efficient model).
  How to implement it in n8n?
  You no longer need to manually configure a "Judge" LLM and parse JSON responses. Instead, open the Evaluation node and select one of the AI-based metrics:
Correctness (AI-based): automatically scores (1-5) whether the answer’s meaning is consistent with your reference answer.
Helpfulness (AI-based): scores (1-5) whether the response successfully addresses the initial query.
Custom Metrics: If you need to test for something specific, like "Did the AI adopt a pirate persona?", you can use the Custom Metrics option to define your own criteria.
1. Evaluating complex agent workflows (RAG and tool use)
  If your workflow uses Retrieval-Augmented Generation (RAG) or relies on the LLM to call external tools, you need to evaluate the entire system, not just the final text generation.
  How to implement it in n8n:
Tool usage: Use the built-in Tools Used metric in the Evaluation node. This returns a score checking if the agent correctly triggered a tool call when expected.
RAG faithfulness: You can use the Correctness (AI-based) metric to verify that the generated answer aligns with the ground truth found in your documents.
1. Quantitative metrics
  These provide unambiguous data points that complement the qualitative assessments from the LLM-as-a-Judge.
  How to implement it in n8n:
Deterministic Metrics: These are tracked automatically by n8n’s evaluation process:
Token Count: Essential for tracking cost.
Execution Time: Critical for monitoring user experience latency.
Categorization: Perfect for classification tasks (like our sentiment analysis example). It checks if the output exactly matches the expected class (returning 1 for a match, 0 for a miss).
String similarity: Measures the character-by-character distance between the result and the expectation. This is useful when you want to catch minor formatting errors or typos without penalizing the model for a valid answer.
Safety & performance: For metrics not yet built-in (like specific Toxicity checks or detailed execution latency), you can simply define them using the Custom Metrics feature within the Evaluation node.
Traditional ML Metrics: For structured tasks (like entity extraction), use the Custom Metrics feature to compare the output against a ground truth using classic metrics like Accuracy, Precision, Recall, or F1 Score.
1. Policy and safety evaluation with the guardrails node
  For enforcing safety, security, and content policies in real-time, the Guardrails node is essential. You can use it to validate user input before sending it to an AI model, or to check the output from an AI model before it's used further in your workflow. This allows teams to validate AI responses in real-time, checking for content quality, safety, or custom rules before routing failures to fallback agents or human review.
  The node offers two primary operations:
The check text for violations: Any violation sends items to a “Fail” branch, which is ideal for evaluation where you want to halt the workflow on an issue.
Sanitize text: Detects and replaces violations like URLs, secret keys, or personally identifiable information (PII) with placeholders. This is useful for cleaning data within the workflow.
The true power of this approach lies in combination. You can set up a single Evaluation node to check for Categorization (accuracy), Tools Used (logic), Helpfulness (quality) as well as safety, while simultaneously giving you an overall view of performance.
How to build an LLM evaluation framework for a sentiment analysis workflow with n8n?
To illustrate the capabilities of n8n’s evaluation features, we are building a workflow that performs sentiment analysis on incoming emails, categorizes them as Positive, Neutral, or Negative, and routes them to the appropriate sales team.
We want to ensure the categorization works properly. To do that, we will build an evaluation workflow, feed it some tricky use cases, and compare different models to find the cheapest one that performs the task reliably. We will test Gemini 3 Pro, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. The latter is the most affordable option. If it can handle our workflow, that is a huge win.
One of the best parts of the n8n implementation is that it allows you to house the evaluation logic directly alongside your actual workflow.
Step 1: Setting up the ground truths using Data Tables
To run evaluations, we first need to establish test cases and ground truths. The new Data Table feature in n8n is perfect for this. Think of it as a database table that lives directly in n8n, which workflows can read from and write to.
For this example, I created 10 test cases. The goal is to find the smallest (and cheapest!) model that performs correctly, guiding us to tweak the system prompt for perfect accuracy.
To stress-test the models, the test cases are tricky in a few ways:
Competitor frustration: The text might list frustrations with a competitor’s solution. While it contains negative words, the intent is actually positive for us (they want to switch). Traditional ML often fails here, but LLMs should catch the nuance. We want to see if the smaller Flash Lite model can match the accuracy of Gemini 3 Pro on this.
Sarcasm: Phrases like "I was thrilled to see my project pipeline freeze for six hours yesterday" should be classified as Negative. Only a capable LLM will catch this tone.
Mixed signals: Combining a small compliment with a major complaint. This should be classified as Negative overall.
Once the data table is set up, it looks like this:
The expected column contains our ground truth, and the result column starts empty, this is where we will store the latest evaluation output.
Step 2: Creating the evaluation workflow
Now, let’s build the evaluation workflow. We start by fetching all records from the data table and looping over them.
Inside the loop, we pass the data to the Sentiment Analysis node, configured to categorize emails into three buckets: Positive, Neutral, or Negative.
Normally, the workflow forwards the email to the appropriate team based on the category. However, we don't want to send real emails during an evaluation. To solve this, we use the Check if Evaluating node. This splits the workflow into two paths: one for the active evaluation, and one for the normal production run.
On the evaluation path, we use the Set Outputs option of the Evaluation node. We select our "Sentiment Analysis Evaluation" table and map the output of the analysis node to the result column.
Step 3: Computing metrics
Computing metrics is vital to understanding performance at a glance. We do this using the Set Metrics option of the Evaluation node. We can select the built-in Categorization metric, which is designed specifically for this use case.
This metric simply compares the expected answer with the actual one. It returns a 0 for a mismatch or a 1 for a match, exactly what we need.
Step 4: Running the tests
Now we can run the evaluation directly from the canvas to test it. Alternatively, we can use the new Evaluations tab at the top of the canvas. Runs started here are saved, providing a visual chart of metrics over time.
We ran this evaluation three times, yielding valuable insights: All three models handled the test cases perfectly, despite the tricky edge cases. However, the performance differed significantly:
Gemini 3 Pro took over 30 seconds.
Gemini 2.5 Flash took about 1.6 seconds.
Gemini 2.5 Flash Lite finished in just 650 milliseconds.
This makes the decision a no-brainer: Gemini 2.5 Flash Lite is accurate enough for this task and is the fastest and cheapest option. This is the clear benefit of having an evaluation framework!
Best practices for building your LLM evaluation framework in n8n
Building an LLM evaluation framework is as much about process as it is about the tools. Here are five best practices to ensure your evaluations remain reliable and scalable:
Always separate evaluation logic: Never mix testing logic with production actions. Always use the Check if Evaluating node to create a clean separation. This prevents "test pollution"—like sending 50 test emails to your sales team—and ensures your metrics are calculated only during actual test runs.
Curate a "Golden Dataset": Your evaluation is only as good as your data. Don't just generate random strings; build a Data Table containing real-world edge cases, previous failure points, and tricky adversarial inputs (like the sarcasm example). As you discover new failures in production, add them to this table to prevent future regressions.
Combine qualitative and quantitative metrics: Reliance on a single metric can be misleading. A model might be fast (low latency) but hallucinate facts (low correctness). Always pair deterministic metrics (like Execution Time or JSON Validity) with qualitative ones (like LLM-as-a-Judge) to get the full picture.
Isolate variables during testing: When comparing performance, change only one variable at a time. If you swap the model and change the prompt simultaneously, you won't know which change caused the improvement (or regression). Test a prompt change on the same model first, then test different models with that fixed prompt.
Keep human-in-the-loop for the "Judge": While "LLM-as-a-Judge" is powerful, it isn't infallible. Periodically audit the decisions made by your Judge node, especially for subjective metrics like "Helpfulness." If the Judge is consistently misinterpreting your criteria, you may need to refine its system prompt just as you would for your main agent
Wrap up
We have moved from the "delightful chaos" of unpredictable AI outputs to a structured, engineering-grade process. By building an evaluation framework directly in n8n, you have shifted from guessing to knowing.
You now have a system that allows you to:
Catch regressions before they hit production.
Quantify the impact of every prompt tweak.
Compare models objectively to optimize for cost and speed.
This framework is your safety net, allowing you to innovate faster and deploy with the confidence that your AI agents will perform exactly as expected.
What’s next?
Now that you understand the concepts, the best way to learn is to see these workflows in action. We highly recommend watching these tutorials from the community to deepen your understanding:
Beginner's Guide to Workflow Evaluation in n8n (Stop Guessing!) – A fantastic overview of why evaluation matters and how to set up your first "exam" for your AI.
The Beginner’s Guide to n8n Evaluations (Optimize Your AI Agents) – A deep dive into optimizing agents using the evaluation tools we discussed.
Evaluate Your RAG System with N8N – Essential viewing if you are building Retrieval-Augmented Generation workflows and need to test factual accuracy.
Start small, build your first test dataset, and happy automating!

n8n

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读