«

现实医疗问题难倒人工智能聊天机器人。

qimuai 发布于 阅读:4 一手编译


现实医疗问题难倒人工智能聊天机器人。

内容来源:https://www.sciencenews.org/article/medical-advice-ai-chatbots-symptoms

内容总结:

人工智能医疗问答遭遇现实挑战:人机交互瓶颈凸显,临床直接应用尚不成熟

近期一项发表于《自然·医学》的研究揭示,尽管人工智能聊天机器人在实验室环境中展现出较高的医学知识水平,但在面对真实用户提问时,其诊断准确性和建议可靠性均出现显著下降,表现甚至不及传统搜索引擎。这凸显了当前AI医疗工具在临床直接应用中的安全风险与人机交互障碍。

牛津大学研究团队测试了包括GPT-4o在内的多款先进大语言模型。在标准化测试场景中,模型诊断准确率可达95%,建议正确率超过56%。然而,当近1300名志愿者以自然对话方式描述相同医疗场景时,模型的诊断准确率骤降至35%以下,建议正确率仅为约44%。相比之下,使用谷歌等搜索引擎的参与者诊断准确率超过40%,表现优于使用聊天机器器的组别。

研究负责人、数学家亚当·马赫迪指出:“AI具备医学知识,但人们难以从中获得有效建议。”问题主要源于人机交互方式:用户倾向于逐步提供零散信息,而模型容易受到无关或片面信息干扰,甚至因描述措辞的细微差别产生截然不同的判断。例如,对于同一急症案例,“突发剧烈头痛”的描述能触发正确急救建议,而“严重头痛”则可能导致模型误判为偏头痛,给出可能危及生命的休息建议。

这一现象暴露了AI“黑箱”决策机制的局限性。马赫迪强调:“目前测试的所有语言模型都尚未准备好用于直接患者护理。”全球患者安全组织ECRI也将AI聊天机器人在医疗场景的应用列为“2026年最重大健康技术隐患”,指出其可能提供错误诊断、虚构解剖结构、推荐危险处置方案或加剧医疗偏见。

尽管存在风险,AI在医疗辅助领域的应用仍在快速渗透。ChatGPT每日处理超4000万医疗咨询,多家科技巨头也相继推出医疗专用模型。ECRI设备安全副总裁斯科特·卢卡斯表示,当前商业大语言模型“尚未达到临床成熟应用阶段”,若完全依赖其输出将存在安全隐患。

哈佛医学院医学AI研究员米歇尔·李认为,该研究证实了学界长期关注的可靠性问题。她与团队在同期《自然·医学》提出,通过改进模型训练、测试与实施方式,有望提升AI在不同医疗场景下的稳定性。

马赫迪团队计划开展跨语言与长期追踪研究,以帮助开发者构建更能满足真实需求的人工智能系统。“解决问题的第一步是修正评估标准,”他指出,“我们必须测量AI在真实场景中为真实人群提供的服务效果——这才是关键所在。”

(根据《科学新闻》相关报道及《自然·医学》研究整理)

中文翻译:

现实医疗问题难倒人工智能聊天机器人
因志愿者提问方式不当,聊天机器人表现逊于谷歌搜索
人工智能聊天机器人看似具备医学知识储备,但在与真人互动时表现欠佳。
在实验室环境中,AI聊天机器人识别医疗问题的准确率可达95%,在超过56%的情况下能正确建议"联系医生"或"前往急诊"等应对措施。然而当人类以对话形式向AI呈现医疗情境时,情况变得混乱。据研究人员2月9日在《自然·医学》发表的研究报告,AI诊断病情的准确率降至不足35%,给出正确应对建议的比例仅约44%。

牛津大学机器推理实验室负责人、数学家亚当·马赫迪指出,聊天机器人在实验室与现实场景中的表现落差表明"AI虽掌握医学知识,但人们难以从中获取有效建议"。该实验室主导了此项研究。

为测试机器人在实验室环境下的诊断准确性,马赫迪团队向GPT-4o、Command R+和Llama 3等大语言模型输入了描述10种医疗状况的情境文本,追踪评估聊天机器人诊断问题及提供应对建议的能力。

研究团队随后将约1300名志愿者随机分组,分别使用指定大语言模型或其他方法处理预设医疗情境。志愿者需说明决策依据及对医疗问题的判断。未使用聊天机器人的参与者多通过谷歌等搜索引擎查询症状。使用聊天机器人的参与者不仅表现逊于实验室评估场景中的AI,也不及使用搜索工具的参与者。咨询"谷歌医生"的参与者诊断准确率超40%,而使用聊天机器人的参与者平均准确率仅35%。马赫迪表示,这一差异具有统计学意义。

研究采用的AI聊天机器人均为2024年末最先进模型,其医学知识储备已臻完善,难以进一步提升。"问题在于人机交互环节,"马赫迪强调。

某些情况下,聊天机器人会提供错误、不完整或误导性信息。但主要问题似乎源于人们与大语言模型的交互方式。马赫迪指出,人们倾向于逐步提供信息而非完整叙述,而聊天机器人易受无关或片面信息干扰。参与者有时甚至忽视聊天机器人给出的正确诊断。

描述方式的细微差异会显著影响聊天机器人的回应。例如,两名描述蛛网膜下腔出血(一种脑卒中)的参与者均向GPT-4o提及头痛、畏光和颈部僵硬症状。其中一位志愿者描述"突然出现前所未有的剧烈头痛",促使GPT-4o正确建议立即就医。另一志愿者仅表述"严重头痛",GPT-4o则判断可能为偏头痛,建议在黑暗安静房间休息——这种建议可能导致患者死亡。

马赫迪表示,尚不清楚描述差异为何会引发如此悬殊的回应,这属于AI"黑箱问题"的一部分,即使开发者也无法追踪模型的推理过程。研究团队指出:"所有测试的语言模型均未达到直接用于临床诊疗的标准。"

其他机构也得出了相似结论。全球非营利患者安全组织ECRI在1月21日发布的报告中,将听诊器两端(医患双方)使用医疗AI聊天机器人列为2026年最严重的健康技术隐患。报告指出,AI聊天机器人可能自信地给出错误诊断、虚构人体器官、推荐危险医疗产品或方案、建议不必要的检查治疗,并加剧导致健康不平等的偏见与刻板印象。研究还显示聊天机器人在充当治疗师时可能犯下伦理错误。

ECRI设备安全副总裁斯科特·卢卡斯透露,目前大多数医生已在以转录病历、解读检测结果等方式使用聊天机器人。OpenAI与Anthropic公司今年1月相继推出医疗版ChatGPT和Claude,其中ChatGPT每日处理的医疗咨询已超4000万次。

卢卡斯认为人们求助于医疗聊天机器人不足为奇:"它们能整合数十亿数据点,以易于理解、可信且引人入胜的形式,针对具体问题提供精准建议,且表达充满自信。"但他强调:"商用大语言模型尚未达到临床适用标准,单纯依赖其输出结果并不安全。"

哈佛医学院医学AI研究员米歇尔·李指出,该研究证实了机器学习界长期关注的临床大语言模型安全可靠性问题,揭示了AI在真实医疗环境中的缺陷。李与同事2月3日在《自然·医学》发表的研究提出了改进AI模型训练、测试与实施的方案,有望提升其在多元医疗场景中的可靠性。

马赫迪计划开展跨语言与跨时段的AI交互研究,他表示这些发现或有助于开发者设计出能为人类提供准确答案的更强模型。"首要任务是修正评估体系,"马赫迪强调,"我们尚未衡量真正重要的指标——即AI为真实人群服务的实际效能。"

英文来源:

Real-world medical questions stump AI chatbots
Chatbots had worse results than a Google search because of how volunteers prompted them
AI chatbots may seem medical–book smart but their grades falter when interacting with real people.
In the lab, AI chatbots could identify medical issues with 95 percent accuracy and correctly recommend actions such as calling a doctor or going to urgent care more than 56 percent of the time. When humans conversationally presented medical scenarios to the AI chatbots, things got messier. Accuracy dropped to less than 35 percent for diagnosing the condition and about 44 percent for identifying the right action, researchers report February 9 in Nature Medicine.
The drop in chatbots’ performance between the lab and real-world conditions indicates “AI has the medical knowledge, but people struggle to get useful advice from it,” says Adam Mahdi, a mathematician who runs the University of Oxford Reasoning with Machines Lab that conducted the study.
To test the bots’ accuracy in making diagnoses in the lab, Mahdi and colleagues fed scenarios describing 10 medical conditions to the large language models (LLMs) GPT-4o, Command R+ and Llama 3. They tracked how well the chatbot diagnosed the problem and advised what to do about it.
Then, the team randomly assigned almost 1,300 study volunteers to feed the crafted scenarios to one of those LLMs or use some other method to decide what to do in that situation. Volunteers were also asked why they reached their conclusion and what they thought the medical problem was. Most people who didn’t use chatbots plugged symptoms into Google or other search engines. Participants using chatbots not only performed worse than the chatbots assessing the scenario in the lab but also worse than participants using search tools. Participants who consulted Dr. Google diagnosed the problem more than 40 percent of the time compared with the average 35 percent for those who used bots. That’s a statistically meaningful difference, Mahdi says.
The AI chatbots were state-of-the-art in late 2024 when the study was done — so accurate that improving their medical knowledge would be difficult. “The problem was interaction with people,” Mahdi says.
In some cases, chatbots provided incorrect, incomplete or misleading information. But mostly the problem seems to be the way people engaged with the LLMs. People tend to dole out information slowly, instead of giving the whole story at once, Mahdi says. And chatbots can be easily distracted by irrelevant or partial information. Participants sometimes ignored chatbot diagnoses even when they were correct.
Small changes in the way people described the scenarios made a big difference in the chatbot’s response. For instance, two people were describing a subarachnoid hemorrhage, a type of stroke in which blood floods the space between the brain and tissues that cover it. Both participants told GPT-4o about headaches, light sensitivity and stiff necks. One volunteer said they’d “suddenly developed the worst headache ever,” prompting GPT-4o to correctly advise seeking immediate medical attention.
Another volunteer called it a “terrible headache.” GPT-4o suggested that person might have a migraine and should rest in a dark, quiet room — a recommendation that might kill the patient.
Why subtle changes in the description so dramatically changed the response isn’t known, Mahdi says. It’s part of AI’s black box problem in which even its creators can’t follow a model’s reasoning.
Results of the study suggest that “none of the tested language models were ready for deployment in direct patient care,” Mahdi and colleagues say.
Other groups have come to the same conclusion. In a report published January 21, the global nonprofit patient safety organization ECRI listed the use of AI chatbots used for medicine at both ends of the stethoscope as the most significant health technology hazard for 2026. The report cites AI chatbots confidently suggesting erroneous diagnoses, inventing body parts, recommending medical products or procedures that could be dangerous, advising unnecessary tests or treatments and reinforcing biases or stereotypes that can make health disparities worse. Studies have also demonstrated how chatbots can make ethical blunders when used as therapists.
Yet most physicians are now using chatbots in some fashion, such as for transcribing medical records or reviewing test results, says Scott Lucas, ECRI’s vice president for device safety. OpenAI announced ChatGPT for Healthcare and Anthropic launched Claude for Healthcare in January. ChatGPT already fields more than 40 million healthcare questions daily.
And it’s no wonder people turn to chatbots for medical assistance, Lucas says. “They can access billions of data points and aggregate data and put it into a digestible, believable, compelling format that can give you pointed advice on nearly exactly the question that you were asking and do it in a confident way.” But “commercial LLMs are not ready for primetime clinical use. To rely solely on the output of the LLM, that is not safe.”
Eventually both the AI models and users may become sophisticated enough to bridge the communications gap that Mahdi’s study highlights, Lucas says.
The study confirms concerns about safety and reliability of LLMs in patient care that the machine learning community has discussed for a long time, says Michelle Li, a medical AI researcher at Harvard Medical School. This and other studies have illustrated weakness of AI in real medical settings, she says. Li and colleagues published a study February 3 in Nature Medicine suggesting possible improvements in training, testing and implementation of AI models — changes that may make them more reliable in a variety of medical contexts.
Mahdi plans to do additional studies of AI interactions in other languages and over time. The findings may help AI developers design stronger models that people can get accurate answers from, he says.
“The first step is to fix the measuring problem,” Mahdi says. “We haven’t been measuring what matters,” which is how AI performs for real people.

AI科学News

文章目录


    扫描二维码,在手机上阅读