«

这是人工智能领域最被误解的一张图。

qimuai 发布于 阅读:3 一手编译


这是人工智能领域最被误解的一张图。

内容来源:https://www.technologyreview.com/2026/02/05/1132254/this-is-the-most-misunderstood-graph-in-ai/

内容总结:

AI领域“最具误解性图表”引热议:METR时间地平线图究竟预示什么?

在人工智能领域,一张由非营利研究机构METR(模型评估与威胁研究)发布的“时间地平线”图表,正成为技术乐观主义者与末日论者共同引用的焦点。该图表显示,顶尖大语言模型在特定编码任务上的表现正以指数级速度提升,部分观察者据此认为,通用人工智能(AGI)乃至颠覆性变革已近在眼前。然而,该研究的真实含义远非如此简单。

图表为何引发轰动?
自2023年3月首次发布以来,每当OpenAI、谷歌或Anthropic发布新模型,AI社区便等待METR更新这张图表。其Y轴“时间地平线”指标,指模型能成功完成约50%人类任务所对应的人类耗时基准。例如,Anthropic最新模型Claude Opus 4.5在去年11月发布后,METR评估其可处理耗时约5小时的人类任务,远超指数趋势预测。这一结果甚至促使部分研究人员调整方向,并在社交平台引发“感到恐惧”的感叹。

误解丛生:图表并非预测“替代人类”
METR技术团队成员悉尼·冯·阿克斯指出,公众常对图表“过度解读”。首先,该指标存在显著误差范围——Opus 4.5的实际能力可能在2小时至20小时任务之间波动。更重要的是,图表仅基于编码任务评估,不能直接推论模型在其他领域的表现。伊利诺伊大学助理教授Daniel Kang强调:“模型编码能力提升,不意味着它能神奇地胜任其他工作。”

设计初衷:量化AI进步的尝试
为构建该图表,METR团队收集从选择题到复杂编程的软件工程任务,通过测量人类完成时间建立基准,再测试模型表现。尽管该方法被质疑“耗时越长是否等于任务越难”,但数据显示,顶尖模型的时间地平线每约7个月翻倍——从2020年中处理9秒任务,到2024年底处理40分钟任务。这种指数趋势是图表影响力的核心来源。

研究者的矛盾与澄清
METR联合创始人托马斯·郭坦言,尽管团队多次撰文说明图表局限,但“炒作机器总会忽略所有警告”。该图表曾被科幻预测文章《AI 2027》引用,推演“AI于2030年毁灭人类”;也被红杉资本等机构用于论证“AI即将成为员工”。METR团队表示,这些极端推论偏离了本意——图表仅是试图用具体数据量化AI进步趋势的工具,而非预言社会变革。

价值与局限:不完美但不可或缺
尽管存在缺陷,学界仍肯定该研究的严谨性。纽约大学前教授盖瑞·马库斯称其部分工作“非常出色”。冯·阿克斯坦言:“这是在诸多限制下尽力构建的指标,存在许多深层缺陷,但它仍是同类中最好的尝试之一。”在AI快速演进的时代,即使不完美的测量工具,也为理解技术发展轨迹提供了宝贵参照。

(本文基于《麻省理工科技评论》报道编译分析)

中文翻译:

这是人工智能领域最被误解的一张图表

对某些人而言,METR的"时间跨度曲线"预示着人工智能乌托邦——或末日——已近在咫尺。但真相远比这复杂。

《麻省理工科技评论》解读专栏:让我们专业的作者为您梳理技术世界复杂纷乱的脉络,助您看清未来趋势。点击此处阅读本系列更多文章。

每当OpenAI、谷歌或Anthropic发布新一代前沿大语言模型,整个AI界都会屏息以待。直到非营利AI研究机构METR(全称"模型评估与威胁研究")更新那张已成为标志性的图表,人们才会松一口气。自去年三月首次发布以来,这张图表在AI讨论中始终占据核心地位。曲线显示某些AI能力正呈指数级发展,而近期发布的模型表现甚至超越了这一本已惊人的趋势。

Anthropic去年11月底发布的最强模型Claude Opus 4.5正是如此。去年12月,METR宣布Opus 4.5似乎能独立完成耗时约五小时的人类任务——这比指数趋势预测的进步幅度更为惊人。一位Anthropic安全研究员在社交媒体上表示将据此调整研究方向;该公司另一名员工则写道:"妈妈快来接我,我害怕。"

但这些戏剧性反应掩盖了更复杂的真相。首先,METR对特定模型能力的评估存在显著误差范围。正如METR在社交平台X上明确指出的:Opus 4.5可能仅能稳定完成耗时两小时的人类任务,也可能成功处理长达20小时的任务。鉴于评估方法固有的不确定性,目前无法得出确切结论。

"人们有太多方式过度解读这张图表了,"METR技术团队成员悉妮·冯·阿克斯表示。

更根本的是,METR图表并非全面衡量AI能力,也从未如此宣称。为构建这张图表,METR主要基于编程任务测试模型,通过测量或估算人类完成每项任务的时间来评估难度——这种度量标准并非人人认可。Claude Opus 4.5或许能完成某些耗时五小时的人类任务,但这绝不意味着它能替代人类工作者。

METR成立的初衷是评估前沿AI系统的风险。尽管指数趋势图最为人所知,该机构还与AI公司合作开展更细致的系统评估,并发布了多项独立研究项目,包括2025年7月那项广受关注的研究——该研究表明AI编程助手实际上可能拖慢软件工程师的工作效率。

但指数曲线成就了METR的声誉,该机构与外界对这张图表的狂热解读似乎保持着微妙关系。今年1月,该图表论文的主要作者之一托马斯·郭撰写博客文章回应某些批评,并阐明其局限性,METR目前正在编写更详尽的问题解答文档。但郭对这些努力能否真正扭转讨论风向并不乐观:"我认为无论我们做什么,炒作机器基本上都会过滤掉所有免责声明。"

尽管如此,METR团队确实认为这张图表对AI发展轨迹具有重要参考价值。"你绝对不该把人生押注在这张图表上,"冯·阿克斯说。但她补充道:"不过我也敢打赌,这个趋势将会持续。"

METR图表的部分问题在于其复杂程度远超表面所见。横轴相对简单:记录每个模型的发布日期。但纵轴才是关键所在——它记录每个模型的"时间跨度",这是METR独创的特殊指标,据郭和冯·阿克斯透露,该指标经常被误解。

要准确理解模型时间跨度的含义,需要了解METR为此投入的全部工作。首先,研究团队汇编了从快速选择题到复杂编程挑战的一系列任务——所有这些都与软件工程相关。随后他们让人类程序员尝试大部分任务,评估完成所需时间,从而建立人类基准耗时。有些任务专家仅需数秒,有些则需数小时。

当METR用这套任务测试大语言模型时发现:先进模型能轻松完成快速任务,但随着任务所需人类时间增加,模型准确率开始下降。研究人员根据模型表现,计算出在人类任务时间轴上模型能成功完成约50%任务的时间点——这个点就是模型的时间跨度。

所有这些细节都记载于METR随原始时间跨度图表发布的博客文章和学术论文中。但METR图表在社交媒体传播时经常脱离背景信息,导致时间跨度指标的真实含义在传播中丢失。一个常见误解是:纵轴数字(例如Claude Opus 4.5的约五小时)代表模型能独立运行的时间长度。实则不然。这些数字代表模型能成功完成的任务所需的人类耗时。郭频繁看到这种误解,特意在近期博客文章开篇予以纠正。当被问及会给网络流传的图表版本添加什么信息时,他表示会在提及任务完成时间时都加上"人类"二字。

尽管时间跨度概念复杂且常被误读,其基本逻辑仍具合理性:具有一小时时间跨度的模型可自动化软件工程师的部分基础工作,而具有40小时时间跨度的模型则可能独立完成数日工作量。但一些专家质疑:用人类任务耗时来衡量AI能力是否有效?"我不认为耗时更长就必然意味着任务更难,这并非既定事实,"加州大学伯克利分校研究模型评估的博士生伊尼奥卢瓦·黛博拉·拉吉表示。

冯·阿克斯坦言自己最初也怀疑时间跨度是否合适的衡量标准。说服她的是分析结果:当团队计算2025年初所有主流模型的50%时间跨度并绘制在图表上时,发现顶尖模型的时间跨度随时间推移而增长,且增速不断加快。大约每七个月时间跨度翻倍,这意味着最先进模型能完成的任务在2020年中需人类9秒,2023年初需4分钟,2024年底则需40分钟。"我可以从理论上质疑这是否合理,但趋势确实存在,"冯·阿克斯说。

正是这种惊人模式使METR图表引发轰动。许多人通过阅读《AI 2027》了解到这张图表——这篇病毒式传播的科幻故事兼量化预测假设超级智能AI可能在2030年前灭绝人类。作者部分预测正是基于METR图表并大量引用。用冯·阿克斯的话说:"这么多人通过这种充满主观臆断的解读来了解你的工作,感觉有点奇怪。"

当然,许多人引用METR图表时并非想象大规模死亡与毁灭。对一些AI乐观主义者而言,指数趋势预示着AI将很快开启经济高速增长时代。例如风险投资公司红杉资本近期发布题为《2026:这就是通用人工智能》的文章,引用METR图表论证能充当员工或承包商的AI即将到来。"我们提出的核心问题是:'当你的计划需要以世纪为单位衡量时,你会怎么做?'"红杉资本普通合伙人、该文作者之一索尼娅·黄表示。

但模型在METR图表上达到一小时时间跨度,并不意味着能在现实世界替代一小时人类工作。一方面,评估模型的任务无法反映现实工作的复杂性与混乱性。在原始研究中,郭、冯·阿克斯和同事根据"模型是否确切知道评分标准""出错后能否轻松重试"等标准,量化了每项任务的"混乱程度"(混乱任务对这两个问题的答案都是否)。他们发现模型在混乱任务上表现明显较差,但整体进步趋势在混乱与非混乱任务中均成立。

即便是METR考虑的最混乱任务,也难以提供AI承担多数工作岗位能力的信息,因为该图表几乎完全基于编程任务。"模型可能在编程方面进步,但不会神奇地提升其他能力,"伊利诺伊大学厄巴纳-香槟分校计算机科学助理教授丹尼尔·康指出。在后续研究中,郭和同事确实发现其他领域任务的时间跨度也呈指数轨迹,但该研究严谨性远不及前者。

尽管存在这些局限,许多人仍赞赏该团队的研究。"METR研究是此类文献中设计最严谨的研究之一,"康告诉我。就连纽约大学前教授、专业LLM批评家加里·马库斯也在博客文章中称赞构建该图表的大量工作"非常出色"。

几乎可以肯定,仍会有人将METR图表解读为AI导致末日的预言,但实际上它远没有那么耸人听闻:这是一个精心构建的科学工具,用具体数字量化了人们对AI进展的直觉认知。正如METR员工会欣然同意的,这个图表远非完美工具。但在快速发展的新兴领域,即使不完美的工具也具有巨大价值。

"这是一群人在诸多限制下尽力创造的度量标准,它在很多方面存在深刻缺陷,"冯·阿克斯说,"但我也认为,这是同类工具中最出色的之一。"

深度报道

人工智能

认识这些将大语言模型视为外星生物的新生代生物学家

通过像研究生命体而非计算机程序那样研究大语言模型,科学家首次发现了它们的某些秘密。

杨立昆的新创企业是对大语言模型的逆向押注

在这篇独家专访中,这位AI先驱分享了他位于巴黎的新公司AMI实验室的计划。

2026年人工智能将走向何方

我们的AI作者对来年做出大胆预测——以下是五大值得关注的热点趋势。

保持联系

获取《麻省理工科技评论》最新动态

发现特别优惠、头条新闻、即将举办的活动等更多内容。

英文来源:

This is the most misunderstood graph in AI
To some, METR’s “time horizon plot” indicates that AI utopia—or apocalypse—is close at hand. The truth is more complicated.
MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.
Every time OpenAI, Google, or Anthropic drops a new frontier large language model, the AI community holds its breath. It doesn’t exhale until METR, an AI research nonprofit whose name stands for “Model Evaluation & Threat Research,” updates a now-iconic graph that has played a major role in the AI discourse since it was first released in March of last year. The graph suggests that certain AI capabilities are developing at an exponential rate, and more recent model releases have outperformed that already impressive trend.
That was certainly the case for Claude Opus 4.5, the latest version of Anthropic’s most powerful model, which was released in late November. In December, METR announced that Opus 4.5 appeared to be capable of independently completing a task that would have taken a human about five hours—a vast improvement over what even the exponential trend would have predicted. One Anthropic safety researcher tweeted that he would change the direction of his research in light of those results; another employee at the company simply wrote, “mom come pick me up i’m scared.”
But the truth is more complicated than those dramatic responses would suggest. For one thing, METR’s estimates of the abilities of specific models come with substantial error bars. As METR explicitly stated on X, Opus 4.5 might be able to regularly complete only tasks that take humans about two hours, or it might succeed on tasks that take humans as long as 20 hours. Given the uncertainties intrinsic to the method, it was impossible to know for sure.
“There are a bunch of ways that people are reading too much into the graph,” says Sydney Von Arx, a member of METR’s technical staff.
More fundamentally, the METR plot does not measure AI abilities writ large, nor does it claim to. In order to build the graph, METR tests the models primarily on coding tasks, evaluating the difficulty of each by measuring or estimating how long it takes humans to complete it—a metric that not everyone accepts. Claude Opus 4.5 might be able to complete certain tasks that take humans five hours, but that doesn’t mean it’s anywhere close to replacing a human worker.
METR was founded to assess the risks posed by frontier AI systems. Though it is best known for the exponential trend plot, it has also worked with AI companies to evaluate their systems in greater detail and published several other independent research projects, including a widely covered July 2025 study suggesting that AI coding assistants might actually be slowing software engineers down.
But the exponential plot has made METR’s reputation, and the organization appears to have a complicated relationship with that graph’s often breathless reception. In January, Thomas Kwa, one of the lead authors on the paper that introduced it, wrote a blog post responding to some criticisms and making clear its limitations, and METR is currently working on a more extensive FAQ document. But Kwa isn’t optimistic that these efforts will meaningfully shift the discourse. “I think the hype machine will basically, whatever we do, just strip out all the caveats,” he says.
Nevertheless, the METR team does think that the plot has something meaningful to say about the trajectory of AI progress. “You should absolutely not tie your life to this graph,” says Von Arx. “But also,” she adds, “I bet that this trend is gonna hold.”
Part of the trouble with the METR plot is that it’s quite a bit more complicated than it looks. The x-axis is simple enough: It tracks the date when each model was released. But the y-axis is where things get tricky. It records each model’s “time horizon,” an unusual metric that METR created—and that, according to Kwa and Von Arx, is frequently misunderstood.
To understand exactly what model time horizons are, it helps to know all the work that METR put into calculating them. First, the METR team assembled a collection of tasks ranging from quick multiple-choice questions to detailed coding challenges—all of which were somehow relevant to software engineering. Then they had human coders attempt most of those tasks and evaluated how long it took them to finish. In this way, they assigned the tasks a human baseline time. Some tasks took the experts mere seconds, whereas others required several hours.
When METR tested large language models on the task suite, they found that advanced models could complete the fast tasks with ease—but as the models attempted tasks that had taken humans more and more time to finish, their accuracy started to fall off. From a model’s performance, the researchers calculated the point on the time scale of human tasks at which the model would complete about 50% of the tasks successfully. That point is the model’s time horizon.
All that detail is in the blog post and the academic paper that METR released along with the original time horizon plot. But the METR plot is frequently passed around on social media without this context, and so the true meaning of the time horizon metric can get lost in the shuffle. One common misapprehension is that the numbers on the plot’s y-axis—around five hours for Claude Opus 4.5, for example—represent the length of time that the models can operate independently. They do not. They represent how long it takes humans to complete tasks that a model can successfully perform. Kwa has seen this error so frequently that he made a point of correcting it at the very top of his recent blog post, and when asked what information he would add to the versions of the plot circulating online, he said he would include the word “human” whenever the task completion time was mentioned.
As complex and widely misinterpreted as the time horizon concept might be, it does make some basic sense: A model with a one-hour time horizon could automate some modest portions of a software engineer’s job, whereas a model with a 40-hour horizon could potentially complete days of work on its own. But some experts question whether the amount of time that humans take on tasks is an effective metric for quantifying AI capabilities. “I don’t think it’s necessarily a given fact that because something takes longer, it’s going to be a harder task,” says Inioluwa Deborah Raji, a PhD student at UC Berkeley who studies model evaluation.
Von Arx says that she, too, was originally skeptical that time horizon was the right measure to use. What convinced her was seeing the results of her and her colleagues’ analysis. When they calculated the 50% time horizon for all the major models available in early 2025 and then plotted each of them on the graph, they saw that the time horizons for the top-tier models were increasing over time—and, moreover, that the rate of advancement was speeding up. Every seven-ish months, the time horizon doubled, which means that the most advanced models could complete tasks that took humans nine seconds in mid 2020, 4 minutes in early 2023, and 40 minutes in late 2024. “I can do all the theorizing I want about whether or not it makes sense, but the trend is there,” Von Arx says.
It’s this dramatic pattern that made the METR plot such a blockbuster. Many people learned about it when they read AI 2027, a viral sci-fi story cum quantitative forecast positing that superintelligent AI could wipe out humanity by 2030. The writers of AI 2027 based some of their predictions on the METR plot and cited it extensively. In Von Arx’s words, “It’s a little weird when the way lots of people are familiar with your work is this pretty opinionated interpretation.”
Of course, plenty of people invoke the METR plot without imagining large-scale death and destruction. For some AI boosters, the exponential trend indicates that AI will soon usher in an era of radical economic growth. The venture capital firm Sequoia Capital, for example, recently put out a post titled “2026: This is AGI,” which used the METR plot to argue that AI that can act as an employee or contractor will soon arrive. “The provocation really was like, ‘What will you do when your plans are measured in centuries?’” says Sonya Huang, a general partner at Sequoia and one of the post’s authors.
Just because a model achieves a one-hour time horizon on the METR plot, however, doesn’t mean that it can replace one hour of human work in the real world. For one thing, the tasks on which the models are evaluated don’t reflect the complexities and confusion of real-world work. In their original study, Kwa, Von Arx, and their colleagues quantify what they call the “messiness” of each task according to criteria such as whether the model knows exactly how it is being scored and whether it can easily start over if it makes a mistake (for messy tasks, the answer to both questions would be no). They found that models do noticeably worse on messy tasks, although the overall pattern of improvement holds for both messy and non-messy ones.
And even the messiest tasks that METR considered can’t provide much information about AI’s ability to take on most jobs, because the plot is based almost entirely on coding tasks. “A model can get better at coding, but it’s not going to magically get better at anything else,” says Daniel Kang, an assistant professor of computer science at the University of Illinois Urbana-Champaign. In a follow-up study, Kwa and his colleagues did find that time horizons for tasks in other domains also appear to be on exponential trajectories, but that work was much less formal.
Despite these limitations, many people admire the group’s research. “The METR study is one of the most carefully designed studies in the literature for this kind of work,” Kang told me. Even Gary Marcus, a former NYU professor and professional LLM curmudgeon, described much of the work that went into the plot as “terrific” in a blog post.
Some people will almost certainly continue to read the METR plot as a prognostication of our AI-induced doom, but in reality it’s something far more banal: a carefully constructed scientific tool that puts concrete numbers to people’s intuitive sense of AI progress. As METR employees will readily agree, the plot is far from a perfect instrument. But in a new and fast-moving domain, even imperfect tools can have enormous value.
“This is a bunch of people trying their best to make a metric under a lot of constraints. It is deeply flawed in many ways,” Von Arx says. “I also think that it is one of the best things of its kind.”
Deep Dive
Artificial intelligence
Meet the new biologists treating LLMs like aliens
By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time.
Yann LeCun’s new venture is a contrarian bet against large language models
In an exclusive interview, the AI pioneer shares his plans for his new Paris-based company, AMI Labs.
What’s next for AI in 2026
Our AI writers make their big bets for the coming year—here are five hot trends to watch.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录


    扫描二维码,在手机上阅读