安全的AI助手是否可能实现?

内容来源:https://www.technologyreview.com/2026/02/11/1132768/is-a-secure-ai-assistant-possible/
内容总结:
AI个人助手安全难题待解,开源工具OpenClaw引发广泛担忧
随着大型语言模型(LLM)技术快速发展,能够自主执行任务的AI智能体(Agent)正成为下一个竞争焦点。然而,安全专家警告,当AI助手获得访问电子邮件、浏览器乃至本地文件的权限时,其潜在风险将被急剧放大。
近期,一款名为OpenClaw的开源AI助手工具在开发者社区迅速走红。该工具由独立软件工程师彼得·斯坦伯格于2025年11月发布,允许用户基于任意LLM定制专属的、24小时在线的个人助手,实现管理日程、处理邮件甚至在线购物等功能。但其设计机制要求用户授予大量敏感数据权限,已引发全球安全研究人员的密集警告。我国相关部门也已就OpenClaw的安全漏洞公开发布风险提示。
斯坦伯格本人承认,非技术用户不应使用该软件。然而,市场对强大AI助手的需求显而易见,且不限于具备安全审计能力的群体。这迫使有志于进入该领域的AI公司必须直面核心挑战:如何构建真正能保护用户数据安全的系统。
风险聚焦:提示词注入攻击
在诸多安全隐患中,一种名为“提示词注入”的新型攻击被专家视为最隐蔽且棘手的威胁。攻击者可通过在网页或邮件中嵌入恶意指令,诱导读取这些内容的AI助手执行非法操作,例如窃取数据或转账。多伦多大学教授尼古拉斯·帕珀诺特形象地比喻:“使用OpenClaw就像把钱包交给街上的陌生人。”
尽管尚未有公开的重大事故报告,但安全专家担忧,随着OpenClaw这类工具用户激增,提示词注入可能成为网络犯罪分子的高吸引力目标。
安全防线如何构建?
目前,业界正从三条路径探索防御方案:
- 模型训练:通过“后训练”过程,教导LLM识别并拒绝疑似注入指令。但难点在于平衡安全性与可用性,过度防范可能导致其拒绝正常指令。
- 输入检测:使用专门的检测模型在用户指令抵达主LLM前进行过滤。但研究表明,现有检测器仍可能漏判某些复杂攻击。
- 输出管控:为AI助手的行为设定严格策略,例如限制其仅能向预设地址发送邮件。但这会大幅限制其功能灵活性。杜克大学教授尼尔·龚指出,这本质是“效用与安全之间的权衡”。
业内对当前技术能否支撑起安全可靠的AI助手仍存分歧。加州大学伯克利分校教授、初创企业Virtue AI创始人宋晓东认为目前已可安全部署;而尼尔·龚则认为“我们尚未达到那个水平”。
面对质疑,斯坦伯格近期宣布已邀请安全专家加入OpenClaw项目。许多用户则在风险与便利间自行取舍:例如将助手运行于云端隔离环境,以避免本地数据被误删。一位名为乔治·皮克特的忠实用户坦言,他虽知风险但未专门防范提示词注入,认为“自己不太可能成为第一个被攻击的目标”。
显然,在AI助手真正步入寻常百姓家之前,一场关于安全、功能与风险的深度博弈才刚刚开始。
中文翻译:
一个安全的人工智能助手可能实现吗?
专家们在大型语言模型安全领域已取得进展,但有人质疑人工智能助手是否已臻成熟。人工智能代理本质上是高风险业务——即便被限制在聊天窗口内,大型语言模型仍会犯错甚至行为失控。一旦它们获得与外部世界交互的工具(如网络浏览器和电子邮件地址),这些错误的后果将变得极其严重。
这或许解释了为何首个突破性的大型语言模型个人助手并非出自需要顾虑声誉与法律责任的主流AI实验室,而是来自独立软件工程师彼得·斯坦伯格。2025年11月,斯坦伯格将其工具(现命名为OpenClaw)上传至GitHub,今年1月下旬该项目迅速引爆网络。
OpenClaw通过调用现有大型语言模型,允许用户创建定制化助手。对部分用户而言,这意味着需要提交海量个人数据——从多年累积的电子邮件到硬盘全部内容。这种操作令安全专家深感不安。OpenClaw引发的风险如此广泛,仅阅读过去几周涌现的相关安全博客文章就可能耗费近一周时间。中国政府已针对其安全漏洞发布公开警告。
面对这些担忧,斯坦伯格在X平台发文建议非技术人员勿使用该软件(他未回应本文的采访请求)。但市场对OpenClaw功能的需求显而易见,且远不止于能自行进行软件安全审计的群体。任何希望进军个人助手领域的AI公司都必须构建能保障用户数据安全的系统,为此需要借鉴智能体安全研究的前沿方法。
风险管理
OpenClaw本质上是大型语言模型的"机甲战衣"。用户可自主选择大型语言模型作为"驾驶员",该模型将获得增强的记忆能力,并能设定定期执行的自动化任务。与主流AI公司的代理产品不同,OpenClaw代理设计为全天候运行,用户可通过WhatsApp等通讯软件与之交互。这意味着它们能化身超级个人助手:每日清晨用个性化待办清单唤醒用户,在工作时段规划假期行程,闲暇时甚至能创建新应用程序。
但这种强大能力伴随风险。若希望AI个人助手管理收件箱,就必须授予其电子邮件访问权限——包括其中所有敏感信息。若需要它代为购物,则需提供信用卡数据。若期望它在计算机上执行编写代码等任务,就必须开放部分本地文件访问权限。
隐患主要体现在两方面:首先是AI助手可能犯错,例如据报道某用户的谷歌反重力编程代理曾误删其整个硬盘;其次是攻击者可能通过传统黑客工具入侵代理,窃取敏感数据或运行恶意代码。OpenClaw爆火后的数周内,安全研究人员已发现大量此类漏洞,使缺乏安全意识的用户面临风险。
尽管这些风险可通过隔离运行(如在独立计算机或云端部署OpenClaw代理)或成熟安全方案进行管控,但本文采访的专家更关注名为"提示词注入"的隐蔽安全威胁。这种攻击实质上是大型语言模型劫持:攻击者只需在模型可能浏览的网站发布恶意文本图像,或向模型读取的收件箱发送恶意内容,就能操纵模型执行指令。
若该模型已获取用户隐私信息,后果将不堪设想。"使用OpenClaw如同将钱包交给街头陌生人,"多伦多大学电气与计算机工程教授尼古拉斯·帕珀诺特警示道。主流AI公司能否放心推出个人助手,可能取决于其防御此类攻击的能力。
需要强调的是,提示词注入尚未引发公开报道的重大灾难。但随着数十万OpenClaw代理在互联网活跃,网络犯罪分子可能开始青睐这种攻击方式。"此类工具正在诱使恶意攻击者扩大目标范围,"帕珀诺特指出。
构建防护栏
"提示词注入"术语由知名大型语言模型博主西蒙·威利森在2022年创造,比ChatGPT发布早数月。当时已可预见,大型语言模型普及后将催生全新安全漏洞类型。这些模型无法区分用户指令与执行指令所需数据(如邮件和网络搜索结果)——对模型而言它们都是文本。若攻击者在邮件中嵌入特定语句被模型误判为用户指令,就能操控模型执行任意操作。
提示词注入是棘手难题且短期内难以根治。"我们目前尚无完美防御方案,"加州大学伯克利分校计算机科学教授宋晓东表示。但活跃的学术共同体正在推进相关研究,已提出若干可能最终保障AI个人助手安全的策略。
从技术层面看,当前使用OpenClaw避免提示词注入风险是可行的:只需断开网络连接。但禁止其读取邮件、管理日程和网络调研,将使AI助手的使用价值大打折扣。防护关键是在允许模型正常工作前提下,阻止其响应劫持企图。
一种策略是训练模型忽略提示词注入。在大型语言模型开发的后训练阶段,通过对恰当回答给予"奖励"、错误回应实施"惩罚"的方式,将文本生成模型转化为实用助手。这种隐喻式奖惩机制能使模型像动物学习般掌握拒绝特定提示词注入的能力。但需把握平衡:过度训练可能导致模型连合法用户请求也一并拒绝。且由于模型行为存在随机性,即便经过严格训练仍可能偶尔失误。
另一种方案是在恶意指令抵达模型前进行拦截,通常采用专用检测模型分析输入数据是否包含注入攻击。但近期研究表明,即使最优检测器也会漏判某些类型的提示词注入。
第三种策略更为复杂:不再通过检测输入内容实施控制,而是制定规范模型输出行为的策略体系。部分防御措施较为简单,例如限制模型仅能向预设地址发送邮件,但这会阻碍其完成诸如代用户联系潜在合作伙伴等有用任务。"核心挑战在于如何精准定义这些策略,"杜克大学电气与计算机工程教授龚尼尔指出,"这本质是效用与安全之间的权衡。"
从更宏观视角看,整个智能体领域都在应对这种权衡:智能体需要达到何种安全程度才能发挥效用?专家观点各异。开发智能体安全平台的初创公司Virtue AI创始人宋晓东认为当前已可安全部署AI个人助手,但龚尼尔教授表示:"我们尚未达到那个阶段。"
即便无法完全防范提示词注入,仍有多种风险缓释手段可供实施,其中部分技术或将在OpenClaw中落地。上周在旧金山首届ClawCon活动上,斯坦伯格宣布已邀请安全专家加入开发团队。
目前OpenClaw仍存在漏洞,但这并未阻挡众多热情用户。开源项目GitHub仓库志愿者维护者、工具爱好者乔治·皮克特表示已采取云端运行、设置防误删机制、禁止他人连接其助手等安全措施。但他坦言未专门防范提示词注入,虽知风险存在,却尚未见到OpenClaw相关案例报告。"或许我的看法不够明智,但成为首个被黑目标的可能性不大,"他如此说道。
深度解析
人工智能
将大型语言模型视为外星生命的新世代生物学家
通过将大语言模型当作生命体而非计算机程序进行研究,科学家首次揭示了它们的部分奥秘。
杨立昆的新创企业是对大语言模型的逆向押注
这位AI先驱在独家专访中分享了其巴黎新公司AMI实验室的发展规划。
Moltbook是人工智能戏剧的巅峰之作
这个病毒式传播的机器人社交网络,在揭示智能体未来发展的同时,更映照出当前社会对AI的狂热迷恋。
保持联系
获取《麻省理工科技评论》最新动态
探索特别优惠、头条新闻、近期活动等精彩内容。
英文来源:
Is a secure AI assistant possible?
Experts have made progress in LLM security. But some doubt AI assistants are ready for prime time.
AI agents are a risky business. Even when stuck inside the chatbox window, LLMs will make mistakes and behave badly. Once they have tools that they can use to interact with the outside world, such as web browsers and email addresses, the consequences of those mistakes become far more serious.
That might explain why the first breakthrough LLM personal assistant came not from one of the major AI labs, which have to worry about reputation and liability, but from an independent software engineer, Peter Steinberger. In November of 2025, Steinberger uploaded his tool, now called OpenClaw, to GitHub, and in late January the project went viral.
OpenClaw harnesses existing LLMs to let users create their own bespoke assistants. For some users, this means handing over reams of personal data, from years of emails to the contents of their hard drive. That has security experts thoroughly freaked out. The risks posed by OpenClaw are so extensive that it would probably take someone the better part of a week to read all of the security blog posts on it that have cropped up in the past few weeks. The Chinese government took the step of issuing a public warning about OpenClaw’s security vulnerabilities.
In response to these concerns, Steinberger posted on X that nontechnical people should not use the software. (He did not respond to a request for comment for this article.) But there’s a clear appetite for what OpenClaw is offering, and it’s not limited to people who can run their own software security audits. Any AI companies that hope to get in on the personal assistant business will need to figure out how to build a system that will keep users’ data safe and secure. To do so, they’ll need to borrow approaches from the cutting edge of agent security research.
Risk management
OpenClaw is, in essence, a mecha suit for LLMs. Users can choose any LLM they like to act as the pilot; that LLM then gains access to improved memory capabilities and the ability to set itself tasks that it repeats on a regular cadence. Unlike the agentic offerings from the major AI companies, OpenClaw agents are meant to be on 24-7, and users can communicate with them using WhatsApp or other messaging apps. That means they can act like a superpowered personal assistant who wakes you each morning with a personalized to-do list, plans vacations while you work, and spins up new apps in its spare time.
But all that power has consequences. If you want your AI personal assistant to manage your inbox, then you need to give it access to your email—and all the sensitive information contained there. If you want it to make purchases on your behalf, you need to give it your credit card info. And if you want it to do tasks on your computer, such as writing code, it needs some access to your local files.
There are a few ways this can go wrong. The first is that the AI assistant might make a mistake, as when a user’s Google Antigravity coding agent reportedly wiped his entire hard drive. The second is that someone might gain access to the agent using conventional hacking tools and use it to either extract sensitive data or run malicious code. In the weeks since OpenClaw went viral, security researchers have demonstrated numerous such vulnerabilities that put security-naïve users at risk.
Both of these dangers can be managed: Some users are choosing to run their OpenClaw agents on separate computers or in the cloud, which protects data on their hard drives from being erased, and other vulnerabilities could be fixed using tried-and-true security approaches.
But the experts I spoke to for this article were focused on a much more insidious security risk known as prompt injection. Prompt injection is effectively LLM hijacking: Simply by posting malicious text or images on a website that an LLM might peruse, or sending them to an inbox that an LLM reads, attackers can bend it to their will.
And if that LLM has access to any of its user’s private information, the consequences could be dire. “Using something like OpenClaw is like giving your wallet to a stranger in the street,” says Nicolas Papernot, a professor of electrical and computer engineering at the University of Toronto. Whether or not the major AI companies can feel comfortable offering personal assistants may come down to the quality of the defenses that they can muster against such attacks.
It’s important to note here that prompt injection has not yet caused any catastrophes, or at least none that have been publicly reported. But now that there are likely hundreds of thousands of OpenClaw agents buzzing around the internet, prompt injection might start to look like a much more appealing strategy for cybercriminals. “Tools like this are incentivizing malicious actors to attack a much broader population,” Papernot says.
Building guardrails
The term “prompt injection” was coined by the popular LLM blogger Simon Willison in 2022, a couple of months before ChatGPT was released. Even back then, it was possible to discern that LLMs would introduce a completely new type of security vulnerability once they came into widespread use. LLMs can’t tell apart the instructions that they receive from users and the data that they use to carry out those instructions, such as emails and web search results—to an LLM, they’re all just text. So if an attacker embeds a few sentences in an email and the LLM mistakes them for an instruction from its user, the attacker can get the LLM to do anything it wants.
Prompt injection is a tough problem, and it doesn’t seem to be going away anytime soon. “We don’t really have a silver-bullet defense right now,” says Dawn Song, a professor of computer science at UC Berkeley. But there’s a robust academic community working on the problem, and they’ve come up with strategies that could eventually make AI personal assistants safe.
Technically speaking, it is possible to use OpenClaw today without risking prompt injection: Just don’t connect it to the internet. But restricting OpenClaw from reading your emails, managing your calendar, and doing online research defeats much of the purpose of using an AI assistant. The trick of protecting against prompt injection is to prevent the LLM from responding to hijacking attempts while still giving it room to do its job.
One strategy is to train the LLM to ignore prompt injections. A major part of the LLM development process, called post-training, involves taking a model that knows how to produce realistic text and turning it into a useful assistant by “rewarding” it for answering questions appropriately and “punishing” it when it fails to do so. These rewards and punishments are metaphorical, but the LLM learns from them as an animal would. Using this process, it’s possible to train an LLM not to respond to specific examples of prompt injection.
But there’s a balance: Train an LLM to reject injected commands too enthusiastically, and it might also start to reject legitimate requests from the user. And because there’s a fundamental element of randomness in LLM behavior, even an LLM that has been very effectively trained to resist prompt injection will likely still slip up every once in a while.
Another approach involves halting the prompt injection attack before it ever reaches the LLM. Typically, this involves using a specialized detector LLM to determine whether or not the data being sent to the original LLM contains any prompt injections. In a recent study, however, even the best-performing detector completely failed to pick up on certain categories of prompt injection attack.
The third strategy is more complicated. Rather than controlling the inputs to an LLM by detecting whether or not they contain a prompt injection, the goal is to formulate a policy that guides the LLM’s outputs—i.e., its behaviors—and prevents it from doing anything harmful. Some defenses in this vein are quite simple: If an LLM is allowed to email only a few pre-approved addresses, for example, then it definitely won’t send its user’s credit card information to an attacker. But such a policy would prevent the LLM from completing many useful tasks, such as researching and reaching out to potential professional contacts on behalf of its user.
“The challenge is how to accurately define those policies,” says Neil Gong, a professor of electrical and computer engineering at Duke University. “It’s a trade-off between utility and security.”
On a larger scale, the entire agentic world is wrestling with that trade-off: At what point will agents be secure enough to be useful? Experts disagree. Song, whose startup, Virtue AI, makes an agent security platform, says she thinks it’s possible to safely deploy an AI personal assistant now. But Gong says, “We’re not there yet.”
Even if AI agents can’t yet be entirely protected against prompt injection, there are certainly ways to mitigate the risks. And it’s possible that some of those techniques could be implemented in OpenClaw. Last week, at the inaugural ClawCon event in San Francisco, Steinberger announced that he’d brought a security person on board to work on the tool.
As of now, OpenClaw remains vulnerable, though that hasn’t dissuaded its multitude of enthusiastic users. George Pickett, a volunteer maintainer of the OpenGlaw GitHub repository and a fan of the tool, says he’s taken some security measures to keep himself safe while using it: He runs it in the cloud, so that he doesn’t have to worry about accidentally deleting his hard drive, and he’s put mechanisms in place to ensure that no one else can connect to his assistant.
But he hasn’t taken any specific actions to prevent prompt injection. He’s aware of the risk but says he hasn’t yet seen any reports of it happening with OpenClaw. “Maybe my perspective is a stupid way to look at it, but it’s unlikely that I’ll be the first one to be hacked,” he says.
Deep Dive
Artificial intelligence
Meet the new biologists treating LLMs like aliens
By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time.
Yann LeCun’s new venture is a contrarian bet against large language models
In an exclusive interview, the AI pioneer shares his plans for his new Paris-based company, AMI Labs.
Moltbook was peak AI theater
The viral social network for bots reveals more about our own current mania for AI as it does about the future of agents.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.