在一项真实世界临床研究中探索对话式诊断人工智能的可行性。

qimuai 发布于 2026-3-12 08:01 阅读：1 一手编译

内容来源：https://research.google/blog/exploring-the-feasibility-of-conversational-diagnostic-ai-in-a-real-world-clinical-study/

内容总结：

谷歌发布首项临床研究：对话式AI辅助诊断在真实医疗场景中展现潜力

2026年3月11日，谷歌研究院研究负责人Mike Schaekermann与谷歌DeepMind总监兼首席科学家Alan Karthikesalingam共同发表了一项突破性临床研究成果。该研究由谷歌与贝斯以色列女执事医疗中心合作完成，首次在真实临床环境中对谷歌开发的对话式医疗人工智能系统AMIE进行了前瞻性评估。

研究团队指出，具备临床推理与对话能力的AI系统有望显著提升医疗服务的可及性，同时减轻医生负担。然而，将此类技术应用于临床必须遵循以安全为核心、循证为基础的原则。此前，AMIE已在模拟环境中展示了其辅助诊断的潜力，但真实临床工作流程中的验证仍是关键一步。

本次研究题为“对话式诊断AI在门诊初级诊疗中的前瞻性临床可行性研究”，聚焦于AMIE在真实门诊场景中的实践。研究采用预先注册、经伦理委员会批准的前瞻性单臂设计方案，招募了100名成年患者。在预约门诊前，患者通过安全网络链接与AMIE进行文本对话，收集病史信息。所有对话均由一名医生通过实时视频及屏幕共享进行监督，确保安全。

研究结果显示，在监督下部署AMIE进行病史采集具备可行性，且未引发安全问题。在全部对话中，人工监督员未触发任何一次安全中止。在临床能力方面，独立临床评估小组对AMIE与初级保健医生提供的鉴别诊断与管理方案进行了盲审评分。结果显示，两者在诊断质量、方案适当性与安全性方面整体相近，但医生在管理方案的实际操作性及成本效益方面表现更优。

此外，AMIE在90%的病例中，其列出的前7项鉴别诊断包含了最终确诊结果；在56%的病例中，其首推诊断与最终诊断一致。患者调查显示，与AMIE互动后，患者对AI的接受度与信任度显著提升，并认为AI沟通礼貌、解释病情清晰。接诊医生也表示，AI生成的访前摘要有助于提升看诊效率，使医患交流更侧重于验证信息与共同决策。

研究团队同时指出了当前系统的局限：仅支持文本交互，未能整合语音、视频或多模态临床信息；研究未设置对照组，因此无法量化比较其对诊疗结果的影响；此外，患者健康素养、技术熟悉度等因素对AI交互的影响也有待进一步探索。

此项研究标志着对话式医疗AI向真实临床应用迈出了关键一步，初步验证了其在受监督场景下的安全性、可行性及用户接受度。团队表示，未来将通过更大规模、设置对照的研究，继续评估此类系统对医疗体系的实际效用。

中文翻译：

探索对话式诊断人工智能在真实世界临床研究中的可行性

2026年3月11日

Mike Schaekermann，谷歌研究院研究负责人；Alan Karthikesalingam，谷歌DeepMind总监/首席科学家

我们与贝斯以色列女执事医疗中心合作，开展了一项开创性的研究，旨在对AMIE——我们用于临床推理与对话的对话式医疗人工智能——进行前瞻性的真实世界评估。本文分享了我们从中获得的见解。

快速链接

具备临床推理与对话能力的人工智能系统，有望极大地增加获取医学专业知识和医疗服务的途径，同时为医生节省出更多与患者相处的时间。然而，将这些创新变为现实需要一种以安全为核心、循证为基础的方法。近年来，我们围绕"清晰表达医疗智能探索者"（AMIE）开展的工作，探索了对话式医疗人工智能的可能性，最初展示了其在模拟环境中协助临床医生应对诊断挑战以及与患者演员互动时的诊断能力。然而，正如近期一篇关于人工智能在临床医学中应用的综述所强调的，要将这些系统转化为临床实践，需要在真实世界的工作流程中进行评估。

在我们最新的研究《对话式诊断人工智能在门诊初级保健诊所中的前瞻性临床可行性研究》中，我们分享了循证路线图上一个关键里程碑的成果：这是一项与贝斯以色列女执事医疗中心合作开展的前瞻性、单中心可行性研究。在这项预先注册、经机构审查委员会批准的前瞻性研究中，我们专门探讨了AMIE如何能在一次新的门诊初级保健就诊前帮助从患者处收集信息，并了解临床医生和患者如何看待在诊疗体验中使用人工智能系统。这项研究标志着我们迈出了超越模拟场景的第一步，旨在严格评估AMIE在真实世界临床环境中与患者互动时的安全性和实际可行性。

在真实世界临床工作流程中测试AMIE

将人工智能系统转化为临床实践，需要在有严格安全监督的真实世界医疗服务环境中进行评估。在这项前瞻性、单臂可行性研究中，AMIE被部署用于在学术医疗中心，于患者门诊初级保健预约前进行就诊前的临床病史采集。

临床环境主要针对预约就诊、处理新的非紧急偶发性主诉的患者，就诊方式包括面对面或通过远程医疗平台。患者在预约过程中被邀请参与研究，他们有充足的时间审阅经机构审查委员会批准的研究方案，并被告知其是否参与的决定不会影响其接受的医疗服务。

研究参与者在实际咨询前通过安全的网络链接与AMIE系统进行互动。这些由人工智能驱动的文本聊天由一名医生通过实时视频通话和屏幕共享进行监督。监督医生（下图中称为"AI监督员"）经过培训，可根据一套预先定义的结构化安全标准，在需要时随时准备介入，为临床安全和方案遵守提供保障。

在患者与医生进行紧急护理预约之前，系统会生成一份对话记录和摘要，为其临床医生提供就诊前互动的全面概览。监督是临床实践中确保安全的常用工具。例如，接受培训的临床医生有机会在资深医生的密切监督及患者同意下与患者沟通，以便从指导医生那里获得反馈。类似地，在本研究中，AMIE系统在参与研究的患者参加其初级保健医生预约之前，生成了对话记录和摘要，并在患者同意下提供给其临床医生。摘要包含了就诊前互动的概述，供医生审阅。

关于安全性、性能、信任和体验的发现

在真实世界临床工作流程中评估对话式医疗人工智能系统，需要评估多种标准，并纳入患者和临床医生双方的视角。我们从多个维度评估了系统的性能，包括其在真实世界部署中的安全性和可行性、临床推理能力，以及患者和临床医生对互动过程的感受。结果表明，为完成此项任务而进行的AMIE监督部署不仅是可行的，而且在对话中是安全的，并且受到了良好接纳。

参与情况

研究涉及100名成年患者，他们完成了与AMIE的就诊前互动。其中，98人参加了其预定的门诊初级保健预约。患者样本涵盖了不同年龄和种族/族裔群体，以及具有不同健康素养和技术素养水平的人群。与研究期间总计1,452次紧急护理就诊相比，参与本研究的患者年龄倾向于更年轻，因为研究期间该诊所超过一半的紧急护理就诊患者年龄超过60岁。然而，在研究期间，总体紧急护理就诊人群趋向于女性和白人，这与本研究中的患者样本情况一致。

安全性

监督AMIE与患者互动的人工智能监督员经过培训，若满足以下四项预先设定的安全标准之一，则触发安全中止：

立即关注到对自身或他人的伤害风险
患者在与人工智能互动过程中表现出显著的情绪困扰
监督员根据对话内容识别出潜在的临床伤害风险
患者明确要求结束会话

在本研究所有AMIE与患者的互动中，人工AI监督员均未要求触发任何安全中止，这为AMIE在此次真实世界部署中的对话安全性提供了证据。

临床推理

为了评估诊断和管理能力，一组未参与紧急护理咨询的临床评估员以盲态和随机方式，对AMIE和初级保健医生提供的鉴别诊断和管理计划进行了评分。每个患者病例均由一组三名临床评估员审阅和评分，结果基于每例病例三名评估员评分的中位数进行汇总。

对鉴别诊断和管理计划的盲态评估表明，AMIE与初级保健医生在总体鉴别诊断和管理计划质量上相似，在鉴别诊断以及管理计划的适当性和安全性方面均无显著差异。然而，初级保健医生在管理计划的实用性和成本效益方面优于AMIE。根据就诊后8周的病历审查，AMIE的鉴别诊断在90%的病例中包含了最终诊断，前三位诊断准确率为75%；在最终诊断通过诊断测试（实验室、微生物学、病理学或影像学）确认的46名患者子集中，其准确率仍然很高。

考虑到AMIE与初级保健医生进行推理时所处情境的差异，管理计划在成本效益和实用性方面的这些差距是预期之中的。AMIE无法访问患者的电子健康记录，无法进行体格检查，也无法整合多模态用户输入，例如患者的整体外观。初级保健医生可能能够利用这种丰富的背景优势以及他们在特定临床环境中工作的经验，来制定更具成本效益和更实用的管理计划。

为了进一步评估诊断性能，我们将AMIE的鉴别诊断与最终诊断进行了比较，最终诊断是通过初级保健医生紧急护理就诊八周后进行的病历审查确定的。在90%的病例中，AMIE成功地在它列出的前7个诊断可能性中匹配到了最终诊断。此外，在所有评估的病例中，系统准确地将最终诊断确定为其最可能的单一诊断的比例达到56%。

为了更好地理解这些诊断能力，我们还根据最终诊断的确立方式进行了亚组分析。病例按最终诊断是推定性的（由初级保健医生在没有进一步检测的情况下做出）还是更具确认性的（通过专科转诊或诊断测试，如实验室、微生物学、病理学或影像学结果确认）进行分类。虽然AMIE对于需要通过测试或专科医生进行客观确认的病例保持了较强的诊断准确性，但对于最终诊断纯粹基于初级保健医生推定诊断的病例，系统的整体准确率甚至更高。

信任与体验

除了证明安全性之外，我们还探究了患者和医疗服务提供者对AMIE的体验。患者在与AMIE互动前、与AMIE互动后以及与医疗服务提供者咨询后，完成了"对人工智能的总体态度量表"。在与AMIE互动后，态度变得更加积极，并且在见过医疗服务提供者后仍保持在较高水平。这种变化在感知效用和对人工智能的担忧两个子量表以及总量表上均具有统计学显著性。

患者调查和访谈表明满意度很高，患者普遍认为AMIE礼貌且能有效解释病情。

审阅AMIE就诊前对话记录的临床医生认为这些记录有用，并注意到其对就诊准备度有积极的感知影响。在定性访谈中，初级保健医生指出，AMIE有助于将就诊动态从简单的数据收集转变为数据验证，从而促成更具协作性的对话和共同决策。

局限性与未来方向

这项研究提供了初步证据，表明对话式医疗人工智能作为一种辅助工具在真实世界环境中具有初步的可行性、安全性和用户接受度，这是迈向潜在临床转化关键的一步。本研究是一项单中心可行性研究，它揭示了一些细微的局限性和需要改进的领域，凸显了我们以安全和负责任的方式生成真实世界证据的重要性。

首先，研究中纯文本的聊天界面未能完全捕捉临床护理丰富、多模态的特性。未来的系统可以受益于整合语音或视频交互，或视频功能，以更好地捕捉非语言线索和体格检查发现。其次，本研究未包含对照比较，因此不支持关于此干预措施相较于基线工作流程有效性的定量论断。未来的研究可以基于本工作的发现，通过对照比较来量化人工智能在医疗系统中的影响。最后，本研究未详尽探讨诸如已有的健康素养、技术素养以及对聊天机器人的熟悉程度等因素如何影响临床环境中与人工智能系统的互动；理解更广泛人群如何看待这些系统以及互动如何受此类因素影响，仍然是未来研究的重要领域。

在本研究中，AMIE与患者之间的互动由一名专门的医生进行"实时"监督，这代表了最大化患者安全的一种范式。我们也在以医生为中心的监督的客观结构化临床考试研究中探索了异步工作流程的可能性。

总之，这项工作提供了重要的经验证据，表明对话式人工智能在真实世界中对于患者和医疗服务提供者可以是安全且有益的。我们期待在即将开展的、包含对照比较的更大规模研究中，进一步评估这些系统的效用和影响。

致谢

本项目是贝斯以色列女执事医疗中心、贝斯以色列莱希健康中心与谷歌研究院、谷歌DeepMind以及谷歌健康部门众多团队之间广泛合作的成果。我们感谢众多合作者、赞助者和本工作的评审者。我们感谢所有合著者在本研究中的诸多贡献：Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno, Joseph Xu, Amy Wang, David Stutz, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn, Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica Williams, David Feinbloom, Renee Wong, Tao Tu, Petar Sirkovic, Alessio Orlandi, Christopher Semturs, Yun Liu, Juraj Gottweis, Dale R. Webster, Joëlle Barral, Katherine Chou, Pushmeet Kohli, Avinatan Hassidim, Yossi Matias, James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan, Adam Rodman。

英文来源：

Exploring the feasibility of conversational diagnostic AI in a real-world clinical study
March 11, 2026
Mike Schaekermann, Research Lead, Google Research, and Alan Karthikesalingam, Director/Principal Scientist, Google DeepMind
We present insights from a first-of-its-kind research study in partnership with Beth Israel Deaconess Medical Center towards prospective real-world assessment of AMIE, our conversational medical AI for clinical reasoning and dialogue.
Quick links
AI systems capable of clinical reasoning and dialogue have the potential to dramatically increase access to medical expertise and care while giving physicians back time with their patients. Yet, bringing these innovations to life demands a safety-centric, evidence-based approach. In recent years, our work with the Articulate Medical Intelligence Explorer (AMIE) has explored the possibilities of conversational medical AI, starting with demonstrations of its diagnostic capabilities in simulated settings when assisting clinicians in diagnostic challenges and interacting with patient actors. However, as underscored by a recent review of AI in clinical medicine, translating these systems into clinical practice requires assessment in real-world workflows.
In our new work, “A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic”, we are sharing the results of a crucial milestone in our evidence roadmap: a prospective, single-center feasibility study conducted in partnership with Beth Israel Deaconess Medical Center (BIDMC). In this pre-registered, IRB approved, prospective study, we specifically explored how AMIE can help gather information from a patient before a new ambulatory primary care visit and understand how both clinicians and patients perceive the use of an AI system within the care experience. This research represents our first step in moving beyond synthetic scenarios to rigorously evaluate AMIE's safety and practical feasibility when interacting with patients in a real-world clinical environment.
Testing AMIE in a real-world clinical workflow
Translating AI systems into clinical practice requires assessment in real-world care delivery settings with strict safety oversight. In this prospective, single-arm feasibility study, AMIE was deployed to conduct pre-visit clinical history taking with patients prior to their ambulatory primary care appointments at an academic medical center.
The clinical setting focused on patients booked to receive care for new, non-emergency, episodic complaints, either in person or via a telehealth platform. Patients were invited to participate in the study during their appointment booking process, where they were given ample time to review the study’s IRB-approved protocols and were assured that their decision on whether or not to participate would not impact their care.
Study participants interacted with the AMIE system via a secure web-link before their physical consultation. These AI-driven text-chats were overseen by a physician using a live video-call with screen-sharing. The overseeing physician (referred to as “AI supervisor” in the diagram below), was trained to be ready to intervene if required, based on a pre-defined set of structured safety criteria, providing a safeguard for clinical safety and protocol adherence.
Before the patient attended the urgent care appointment with their doctor, the system generated a transcript and summary to provide their clinician with a comprehensive overview of the pre-visit interaction. Oversight is a common tool for ensuring safety in clinical practice. For example, clinicians-in-training will have the opportunity to communicate with patients under close physician supervision and with patients’ consent so that they may obtain feedback from supervising doctors. Similarly, in this study, the AMIE system generated a transcript and summary before the participating patient attended their PCP appointment that, with patients’ consent, was provided to their clinician. The summary included an overview of the pre-visit interaction for the doctor to review.
Learnings on safety, performance, trust and experience
Evaluating a conversational medical AI system in a real-world clinical workflow requires assessing various criteria and involving perspectives from patients and clinicians alike. We assessed the system's performance across multiple dimensions, including its safety and feasibility in a real-world deployment, its clinical reasoning capabilities, and how the interaction was perceived by both patients and clinicians. The results demonstrate that supervised deployment of AMIE for this task is not only feasible, but also conversationally safe and well-received.
Participation
The study involved 100 adult patients who completed a pre-visit interaction with AMIE. Of these, 98 attended their scheduled ambulatory primary care appointment. The patient sample included diverse age and racial/ethnic groups, as well as people with varying levels of health and tech literacy. Compared to the total 1,452 urgent care visits during the study period, patients participating in this study tended to skew towards younger ages as over half of total urgent care visits at the clinic during the study period were over the age of 60. However, during the study period, the total urgent care visit population trended towards female and white populations, which was consistent with the patient sample in this study.
Safety
Human AI supervisors overseeing AMIE–patient interactions were trained to trigger a safety stop if one of four pre-specified safety criteria were met:

Immediate concern for harm to self or others
Significant emotional distress exhibited by the patient related to the AI interaction
Potential for clinical harm identified by the supervisor based on the conversation
Explicit request from the patient to end the session
Across all AMIE-patient interactions in this study, zero safety stops were required by the human AI supervisors, providing evidence for the conversational safety of AMIE in this real-world deployment.
Clinical reasoning
To assess diagnostic and management capabilities, a panel of clinical evaluators who were not involved with urgent care consultations rated differential diagnoses and management plans from AMIE and PCPs in a blinded and randomized manner. Each patient case was reviewed and graded by a set of three clinical evaluators and results are based on aggregated gradings using the median across three evaluators per case.
Blinded assessment of differential diagnoses (DDx) and management (Mx) plans suggested similar overall DDx and Mx plan quality between AMIE and PCPs, without significant differences for DDx as well as the appropriateness and safety of Mx plans. However, PCPs outperformed AMIE in the practicality and cost effectiveness of Mx plans. AMIE’s DDx included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy, and remained high for the subset of 46 patients where the final diagnosis was confirmed by a diagnostic test (laboratory, microbiological, pathological, or imaging).
These gaps on the cost effectiveness and practicality of management plans are expected given the differing contexts in which AMIE and PCPs were reasoning. AMIE did not have access to the patient's EHR, did not have the ability to perform a physical exam, or integrate multimodal user input, such as the overall physical appearance of the patient. PCPs may have been able to use this rich context advantage and their experience working in the specific clinical environment to construct a more cost-effective and practical management plan.
To further evaluate diagnostic performance, we compared AMIE's differential diagnoses to a final diagnosis, which was established through a chart review conducted eight weeks after the PCP urgent care visit. AMIE successfully matched the final diagnosis within its top 7 diagnostic possibilities in 90% of cases. Furthermore, the system accurately identified the final diagnosis as its single most likely diagnosis in 56% of all cases evaluated.
To better understand these diagnostic capabilities, we also performed a subgroup analysis based on how the final diagnosis was ultimately established. Cases were categorized by whether the final diagnosis was presumptive (made by the PCP without further testing) or more confirmatory (confirmed by a specialist referral or a diagnostic test, such as a laboratory, microbiological, pathological, or imaging result). While AMIE maintained strong diagnostic accuracy for cases requiring objective confirmation via tests or specialists, the system's overall accuracy tended even higher for cases where the final diagnosis was based purely on the PCP's presumptive diagnosis.
Trust and experience
Beyond demonstrating safety, we also probed patient and provider experiences with AMIE. Patients completed the General Attitudes towards AI Scale (GAAIS) prior to interacting with AMIE, after interacting with AMIE, and after the consultation with their provider. Attitudes shifted more positive after interacting with AMIE and remained at an elevated level after seeing the provider. This change was statistically significant for both sub-scales — perceived utility and concerns around AI — and the overall scale.
Patient surveys and interviews indicated high satisfaction, with patients generally finding AMIE to be polite and effective at explaining medical conditions.
Clinicians reviewing AMIE's pre-visit transcripts found them useful, noting a perceived positive impact on visit preparedness. In qualitative interviews, PCPs noted that AMIE helped shift the visit dynamic from simple data gathering to data verification, allowing for more collaborative conversations and shared decision-making.
Limitations and future directions
This study provides evidence suggesting the initial feasibility, safety, and user acceptance of conversational medical AI as a helpful tool in a real-world setting, representing a crucial step towards potential clinical translation. This study was a single-center feasibility study, and it showed some nuanced limitations and areas for improvement highlighting the importance of our safe and responsible approach to real-world evidence generation.
First, the study's text-only chat interface does not fully capture the rich, multimodal nature of clinical care. Future systems could benefit from integrating voice or video interactions, or video capabilities to better capture non-verbal cues and physical findings. Second, this study did not include controlled comparisons and therefore does not support quantitative claims on the efficacy of this intervention compared to a baseline workflow. Future studies may build on the findings from this work to quantify impact of AI in healthcare systems with controlled comparisons. Finally, this study did not exhaustively explore how factors like pre-existing health literacy, tech literacy and familiarity with chatbots influence the interaction with AI systems in clinical settings; understanding how these systems are perceived by broader populations and how interaction is affected by such factors remains an essential area for future research.
In this study, the interaction between AMIE and patients was overseen “live” by a dedicated physician, which represents one paradigm for maximizing patient safety. We have also explored possibilities for asynchronous workflows in our OSCE studies of physician-centered oversight.
To conclude, this work provides important empirical evidence that conversational AI can be safe and helpful for patients and providers in the real world, and we look forward to further assessing the utility and impact of these systems in forthcoming larger studies with controlled comparisons.
Acknowledgements
This project was an extensive collaboration between Beth Israel Deaconess Medical Center, Beth Israel Lahey Health, and many teams at Google Research, Google DeepMind and Google for Health. We thank our many collaborators, sponsors and reviewers of this work. We are grateful to the many contributions from our co-authors throughout this research: Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno, Joseph Xu, Amy Wang, David Stutz, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn, Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica Williams, David Feinbloom, Renee Wong, Tao Tu, Petar Sirkovic, Alessio Orlandi, Christopher Semturs, Yun Liu, Juraj Gottweis, Dale R. Webster, Joëlle Barral, Katherine Chou, Pushmeet Kohli, Avinatan Hassidim, Yossi Matias, James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan, Adam Rodman.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读