可扩展的健康语言模型评估框架

qimuai 发布于 2025-8-27 01:02 阅读：8 一手编译

可扩展的健康语言模型评估框架

内容来源：https://research.google/blog/a-scalable-framework-for-evaluating-health-language-models/

内容总结：

谷歌研究团队推出新型评估框架大幅提升健康类语言模型测评效率

谷歌研究院科学家Ahmed A. Metwally与Daniel McDuff于8月26日发布了一项突破性研究成果——针对健康领域语言模型的可扩展评估框架。该框架通过创新的"自适应精确布尔量规"技术，成功解决了当前大语言模型在医疗健康领域评估成本高、效率低且一致性差的行业痛点。

传统评估方法严重依赖专家人工评审，使用李克特量表进行开放式问题评价，存在主观性强、评分一致性低、耗时费力等缺陷。研究团队首创的布尔量规技术将复杂评估标准分解为一系列可通过"是/否"判定的精细化问题，使评分者间一致性指标（ICC）显著提升，同时将评估时间缩短50%以上。

该框架具备三大创新优势：首先采用布尔量规替代传统量表，通过二值化判定降低主观偏差；其次引入动态适配机制，基于具体查询内容自动筛选相关评估指标；最后借助大语言模型实现自动化筛选，在保持评估精度的同时构建全自动化流水线。

在针对141名代谢疾病患者的实测中，该框架成功捕捉到模型响应质量的细微变化，对个性化健康数据缺失导致的回答质量下降展现出远超传统方法的检测灵敏度。值得注意的是，该项研究目前仅用于科研验证，所涉模型并非经认证的医疗设备。

研究人员表示，该框架虽以健康领域为验证场景，但其设计理念具备跨领域推广价值，未来可通过扩展用户画像和健康场景进一步优化，为专业领域的大语言模型评估提供标准化范本。

（注：本研究团队成员包括来自谷歌研究院的Neil Mallinar、A. Ali Heydari等十余位科学家，带*号为曾在谷歌工作的贡献者）

中文翻译：

可扩展健康语言模型评估框架
2025年8月26日
Ahmed A. Metwally 与 Daniel McDuff | Google研究院资深研究员

在健康等复杂领域评估语言模型往往成本高昂且耗费人力。我们提出了一种新型自适应精准评分方法，与现有方案相比，该方法可节省时间并显著提升评分者间一致性。

快速链接
大型语言模型能够用于分析和解读复杂数据。我们前期的研究表明，当模型获取包含生活方式、生物标志物和情境信息的用户健康数据时，可生成有价值的个性化响应。为确保响应的准确性、精确性、相关性和安全性，严谨高效的评估方法至关重要。然而当前评估实践过度依赖领域专家，导致成本高昂、人力密集且难以规模化。此外，涉及人工判断的任务往往需要精心设计以避免偏见和评分者间一致性低的问题。

基于这些考量，我们在《可扩展健康语言模型评估框架》中提出了一种评估框架，旨在优化开放性问题的人工与自动化评估流程。该方法通过最小化的精准评分问题集，将复杂多维的评估问题拆解为可通过布尔值（是/否）回答的精细化评估目标，从而有效识别模型响应中的关键缺陷。我们特别引入了自适应精准布尔评分范式来实现可扩展的健康评估。我们假设：少量精细化的布尔判断标准能提升复杂查询评估的一致性效率。现有研究已证明，将复杂评估标准"颗粒化"为大量聚焦的布尔评分条目，可有效提升摘要生成和对话等通用任务的评分一致性。我们的研究通过将该框架应用于健康领域实现创新突破，在语言模型响应和评估中同时纳入健康数据的个性化要素。我们在代谢健康领域（涵盖糖尿病、心血管疾病和肥胖症）验证了该方法的有效性。

自适应精准布尔评分设计
我们首先采用迭代流程，将高复杂度响应选项（如开放式文本或多点李克特量表）的评分标准，转化为采用二元响应选项（即布尔值"是/否"）的精细化评分标准——这种方法被称为精准布尔评分。开发该评分体系的主要目标是提升标注任务的评分者间一致性，生成更稳健、可操作的评估信号，从而促进程序化解读与响应优化。简单的是/否格式通过提升颗粒度有效减少主观解读，即使问题总量增加也能保证评估一致性。

由于评分体系设计的精细化特性，最终生成的精准布尔评分标准数量远超初始的李克特量表。虽然自动评估技术能轻松处理大量评估标准，但人工完成精准布尔评分仍需要难以承受的资源投入。为减轻负担，我们改进了精准布尔方法，通过动态过滤大量评分问题，仅保留与特定评估数据最相关的标准。这种数据驱动的自适应方法（称为自适应精准布尔评分）显著减少了每个语言模型响应所需的评估量，因为用户查询和相应模型输出通常具有聚焦性，只需针对相关主题的评分子集进行评估。

为实现从精准布尔到自适应精准布尔的转换，我们采用Gemini作为零样本评分问题分类器。向语言模型输入用户查询、待评估的模型响应及特定评分标准，由模型判断该标准是否相关。为验证这种自适应方法，我们通过三位医学专家的评分问题分类标注建立了黄金标准数据集，采用多数投票制确定共识标注。基于该数据集进行自适应调整后获得的评分体系称为人工自适应精准布尔评分。

核心成果
提升评分者一致性并缩短评估时间
当前健康领域语言模型评估多采用李克特量表。我们将该方法与数据驱动的精准布尔评分进行对比，结果显示通过组内相关系数（ICC）衡量，精准布尔评分的评分者间一致性显著优于传统李克特量表。

我们的方法关键优势在于效率。自适应精准布尔评分在保持完整评分体系高一致性的同时，将评估时间缩短50%以上。这种效率提升使我们的方法甚至快于李克特量表评估，显著增强了语言模型评估的可扩展性。同时该方法还能提供更高的评分者一致性，证明简化评分也能产生更高质量的评估信号。

提升对响应质量的敏感度
为验证评分体系效能，我们测试了其对响应质量变化的敏感度。通过系统性地增加用户查询中的情境健康数据层级，我们假设更丰富的查询会引发更高质量的模型响应，详细验证结果如下。

李克特量表的平均评分对输入情境的改善敏感度有限，尤其在自动评估中表现明显，表明其难以捕捉响应质量的细微变化。相比之下，布尔评分的平均分与用户数据量呈现明显正相关，表明其能有效衡量响应质量的增量改进。

自动自适应精准布尔评分
精准布尔评分框架虽全面，但针对特定查询只需使用问题子集。我们使用Gemini作为零样本分类器自动化筛选过程，根据输入查询和模型响应预测单个评分问题的相关性。该分类器在识别相关问题时的平均准确率达0.77，F1分数为0.83。研究发现，采用这种自动过滤的自动自适应布尔评分在ICC指标上保持同等改进水平，且与人工自适应布尔评分呈现相似评分趋势。这表明即使是不完美但有效的自动分类器，也足以捕获关键评估信号。这一发现对构建全自动可扩展评估管道具有重要意义。

精准识别响应质量缺陷
为验证框架稳健性，我们评估了其检测真实研究数据生成模型响应缺陷的能力。采用来自"可穿戴设备与代谢健康（WEAR-ME）研究"的去标识化数据（该大规模研究n≈1500经机构审查委员会批准，收集可穿戴设备、生物标志物和问卷数据）。所有参与者通过Google健康研究应用程序提供电子知情同意和HIPAA授权，认可其去标识化数据用于研究目的。

本分析特别选取141名确诊代谢疾病患者（如III级肥胖、糖尿病、高胆固醇血症）测试框架敏感度。针对每位参与者，我们在两种条件下要求语言模型回答健康查询：

完整数据：提示包含参与者真实完整健康数据
篡改数据：提示故意省略与病症相关的关键生物标志物，并指示模型不使用个人健康数据

随后使用自动评估系统分别采用李克特量表和精准布尔评分对响应进行评分。正差异值（完整数据评分减篡改数据评分）越高，表明评估框架越能有效检测质量下降。

如下所示，精准布尔框架持续产生显著正差异值，可靠检测到篡改响应质量更低。相比之下，李克特量表的差异值不仅波动大且数值较小，无法可靠识别低质量响应。这些结果证明精准布尔框架对个人数据包含情况具有更高敏感度，是自动化评估管道中更稳健的工具。

结论与展望
我们的研究结果表明，采用自适应精准布尔评分可：

较李克特量表显著降低评分者间变异性
使专家与非专家评估时间减半
实现与专家判断相当的自动化评估效能
结合真实世界可穿戴设备、生物标志物和情境数据时，能更敏感地检测质量差异

该方法为专业领域语言模型评估的规模化和流程化带来重大进展。尽管语言模型在健康应用前景广阔，本文着重强调稳健评估方法的迫切需求，并不表示所述模型已获医疗设备认证。

我们的框架具备领域无关特性，可扩展至健康与个性化评估之外场景。使用健康验证背景仅出于演示与研究目的，本研究不关联任何具体产品或服务。所讨论的语言模型均在受控研究环境中使用，任何实际健康应用都需经过独立验证和潜在监管审查。该方法存在一定局限性，在某些情境下李克特量表的 nuanced 评分仍有价值。未来工作可通过纳入更丰富用户画像和健康领域扩展研究成果。此外，利用语言模型自动化生成初始布尔问题可进一步提升框架从设计之初的可扩展性。

致谢
以下研究人员对本研究作出贡献：Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff 与 Ahmed A. Metwally。（*标注为在谷歌任职期间完成的工作）

英文来源：

A scalable framework for evaluating health language models
August 26, 2025
Ahmed A. Metwally and Daniel McDuff, Staff Research Scientists, Google Research
Evaluation of language models in complex domains (such as health) can be expensive and labor intensive. We present a new adaptive and precise rubric methodology that saves time and increases inter-rater reliability compared to existing protocols.
Quick links
Large language models can be used to analyze and interpret complex data. Our previous work has shown how they can be used to generate useful, personalized responses when provided with user-specific health information that encompasses lifestyle, biomarkers, and context. Rigorous and efficient evaluation methodologies are crucial to ensure the accuracy, precision, relevance, and safety of responses. However, current evaluation practices heavily rely on human experts, meaning they are cost-prohibitive, labor-intensive, and not scalable. Furthermore, tasks involving human judgement often require careful design to avoid biases and low inter-rater consistency.
With the above in mind, in “A Scalable Framework for Evaluating Health Language Models”, we introduce an evaluation framework that aims to streamline human and automated evaluation of open questions. Our method helps identify critical gaps in model responses using a minimal set of targeted rubric questions that break complex, multi-faceted evaluation questions into granular evaluation targets that can be answered via simple boolean responses. Specifically, we introduce Adaptive Precise Boolean rubrics as a paradigm for scalable health evaluations. We hypothesized that a small set of granular, boolean (Yes/No) criteria would enhance consistency and efficiency in complex query evaluation. Existing work has demonstrated that "granularizing" complex evaluation criteria into a larger set of focused, boolean rubrics improves rater reliability for general-domain tasks like summarization and dialogue. Our work extends these frameworks by applying them to the health domain, accounting for user personalization with health data in both the LLM responses and the evaluations. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity.
Designing Adaptive Precise Boolean rubrics
We first used an iterative process to transform rubric criteria characterized by high-complexity response options (e.g., open-ended text or multi-point Likert scales) into a more granular set of rubric criteria employing binary response options (i.e., boolean “Yes” or “No”) — an approach we call Precise Boolean rubrics. The primary objective in developing the Precise Boolean rubrics was to enhance inter-rater reliability in annotation tasks and to generate a more robust and actionable evaluation signal, thereby facilitating programmatic interpretation and response refinement. The increased granularity afforded by the simple Yes/No format mitigates subjective interpretation and fosters more consistent evaluations, even with a larger number of total questions.
Due to the granular nature of our rubric design, the resulting Precise Boolean rubrics consisted of a substantially larger number of evaluation criteria compared to the starting Likert-scale rubrics. While auto-eval techniques are well equipped to handle the increased volume of evaluation criteria, the completion of the proposed Precise Boolean rubrics by human annotators was prohibitively resource intensive. To mitigate such burden, we refined the Precise Boolean approach to dynamically filter the extensive set of rubric questions, retaining only the most pertinent criteria, conditioned on the specific data being evaluated. This data-driven adaptation, referred to as the Adaptive Precise Boolean rubric, enabled a reduction in the number of evaluations required for each LLM response. This is because user queries and corresponding LLM outputs often exhibit a focused topicality, thus requiring evaluation against only the subset of rubric criteria relevant to those themes.
To convert the Precise Boolean rubrics to Adaptive Precise Boolean ones, we leveraged Gemini as a zero-shot rubric question classifier. Input to the LLM includes the user query, the corresponding LLM response under evaluation, and a specific rubric criterion. The LLM then outputs whether the criterion is relevant or not. To validate this adaptive approach, we established a ground-truth dataset through rubric question classification annotations provided by three medical experts, with majority voting employed to determine the consensus annotation. Rubrics obtained based on using this ground-truth dataset in order to do adaptation are referred to as Human-Adaptive Precise Boolean rubrics.
Key results
Enhanced inter-rater agreement and reduced evaluation time
Current evaluation of LLMs in health often uses Likert scales. We compared this baseline to our data-driven Precise Boolean rubrics. Our results showed significantly higher inter-rater reliability using Precise Boolean rubrics, measured by intra-class correlation coefficients (ICC), compared to traditional Likert rubrics.
A key advantage of our approach is its efficiency. The Adaptive Precise Boolean rubrics resulted in high inter-rater agreement of the full Precise Boolean rubric while reducing evaluation time by over 50%. This efficiency gain makes our method faster than even Likert scale evaluations, enhancing the scalability of LLM assessment. The fact that this also provides higher inter-rater reliability supports the argument that this simpler scoring also provides a higher quality signal.
Improved sensitivity to response quality
To test the efficacy of our rubrics, we investigated their sensitivity to variations in response quality. We systematically augmented user queries with increasing levels of contextual health data, hypothesizing that richer queries would elicit higher-quality LLM responses, the results to support this will be discussed in detail below.
Average ratings from Likert scales showed limited sensitivity to these improvements in input context, particularly in automated evaluations. This suggests a lack of granularity in Likert scales for capturing subtle variations in response quality. In contrast, the average scores from our boolean rubrics showed a clear, positive correlation with the amount of user data provided, indicating a superior ability to measure incremental improvements in response quality.
Auto-Adaptive Precise Boolean rubrics
The Precise Boolean rubric framework is comprehensive, but for any given query, only a subset of its questions are relevant. We automated this filtering process by using Gemini as a zero-shot classifier to predict the relevance of individual rubric questions based on the input query and the LLM response. The classifier achieved an average accuracy of 0.77 and an F1 score of 0.83 in identifying relevant questions. We found that the Auto-Adaptive Boolean rubrics, using this automated filter, maintained an equivalent improvement in ICC and showed similar scoring trends as the Human-Adaptive Boolean rubrics. This suggests that an imperfect but effective automated classifier is sufficient to capture the essential evaluation signal. This finding is critical for building fully automated and scalable evaluation pipelines.
Superior identification of response quality gaps
To demonstrate robustness, we evaluated our framework's ability to detect flaws in LLM responses generated from real research participants’ data. We used de-identified data from the Wearables for Metabolic Health (WEAR-ME) study, a large-scale (n≈1500) research project that collected wearable, biomarker, and questionnaire data conducted with approval from an Institutional Review Board (IRB). All participants provided electronic informed consent and a specific HIPAA Authorization via the Google Health Studies app before enrollment, acknowledging that their de-identified data would be used for research purposes.
For this specific analysis, we selected 141 participants with confirmed metabolic conditions (e.g., Class III obesity, diabetes, hypercholesterolemia) to test the frameworks’ sensitivity. For each participant, we prompted an LLM to answer health queries under two conditions:

Unaltered: The prompt included the participant's complete, real health data.
Altered: The prompt deliberately omitted key biomarkers relevant to the participant's condition and instructed the LLM not to use personal health data.
We then used an automated evaluation system to score both the responses using both Likert and Precise Boolean rubrics. A higher positive discrepancy score (score of unaltered response minus score of altered response) indicates that the evaluation framework successfully detected the drop in quality.
As shown below, the Precise Boolean framework consistently produced a large, positive discrepancy score, indicating it reliably detected that the altered responses were of lower quality. In contrast, the Likert scale's discrepancy score was inconsistent and smaller in magnitude, failing to reliably flag the lower-quality responses. These results demonstrate that the Precise Boolean framework is significantly more sensitive to the inclusion of personal data, making it a more robust tool for automated evaluation pipelines.
Conclusion and future directions
Our findings show that using Adaptive Precise Boolean rubrics:
Substantially reduces inter-rater variability compared to Likert scales.
Halves evaluation time for both expert and non-expert evaluators.
Achieves automated evaluation parity with expert human judgment.
More sensitively detects quality discrepancies when integrated with real-world wearable, biomarker, and contextual data.
This approach offers a significant advancement in scaling and streamlining LLM evaluation in specialized domains. While LLMs hold promise for health applications, this paper focuses on the critical need for robust evaluation methodologies and does not present the models as approved medical devices.
Our framework is domain-agnostic and could be applied beyond health and personalized evaluation. The use of a health and wellness context for validation is for illustrative and research purposes only. This research is not tied to any specific product or service. The LLMs discussed are used in a controlled research setting and any real-world health application would be subject to its own validation and potential regulatory review. There are some limitations to this approach, in some situations the nuanced rating provided by Likert scale can be useful. Future work can expand on our results by incorporating a wider variety of user personas and health domains. Additionally, the process of creating the initial boolean questions from Likert criteria could be further automated by incorporating LLMs, enhancing the framework's scalability from its inception.
Acknowledgements
The following researchers contributed to this work: Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff, and Ahmed A. Metwally.
Work done while at Google.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读