不同的AI模型在编码现实时似乎趋于一致。

内容总结:
不同AI模型对现实世界的“理解”正趋向一致?科学家提出“柏拉图表征假说”
你是否曾想过,当人工智能“看到”一只狗的照片和“读到”一段关于狗的文字时,它内部的理解方式是否相同?近期,一项由麻省理工学院研究人员提出的“柏拉图表征假说”,在人工智能学界引发了广泛讨论。该假说认为,尽管各类AI模型训练数据与任务不同,但随着其能力增强,它们对现实世界的内在表征正趋于一致。
从“洞穴寓言”到AI的“共识”
研究团队借用了柏拉图著名的“洞穴寓言”来比喻这一现象:真实世界如同洞穴外的实体,而各类数据(文本、图像等)则是其投射出的“影子”。不同的AI模型如同洞穴中的囚徒,各自观察着一种类型的“影子”。假说指出,能力更强的模型正逐渐透过不同的“数据影子”,收敛到对背后同一现实世界(即“洞穴”外的真相)的某种共享内在表征上。
“语言模型和视觉模型为何能对齐?因为它们都是同一个世界的影子。”论文资深作者菲利普·艾索拉解释道。
数学上的“物以类聚”
在技术层面,AI模型对任何输入(如一个词、一张图)的表征,本质上是一组高维空间中的向量。研究人员通过比较这些向量的几何关系(例如,“狗”的向量是否在模型中都更接近“猫”而非“水母”)来评估不同模型内部表征的相似性。研究发现,更强大的模型之间,这种表征结构的相似度确实更高。
争议与前景
然而,这一假说并未获得学界一致认同。主要争议点在于:如何选择进行比较的表征、使用何种数学方法,以及数据集本身的偏差是否影响了结果。加州大学伯克利分校的研究者亚历克sei·埃夫罗斯就指出,许多测试中使用的图文数据本身内容高度对应,这或许夸大了模型间的“共识”,而现实世界中大量信息(如艺术体验)难以在不同模态间完全转换。
尽管如此,探索模型间的共性已被证明具有实用价值。例如,已有研究成功实现了不同语言模型间内部表征的“翻译”,而跨模态(如图文)表征的共享,也有助于开发更高效的多模态AI训练方法。
目前,这场围绕AI是否在形成“世界模型”共识的辩论仍在继续。正如艾索拉所言:“一半的研究者认为这显而易见,另一半则认为这明显错误。我们对这种反应感到高兴。”无论假说最终是否被完全证实,这场探讨本身正推动着人们对AI如何“理解”世界进行更深入的思考。
中文翻译:
不同人工智能模型对现实的编码方式似乎趋于一致
引言
读过一个关于狗的故事后,下次在公园里看到奔跑的狗时,你可能会想起这个故事。这种联想之所以成为可能,是因为你脑海中存在一个统一的"狗"的概念,这个概念并不局限于文字或图像。无论是斗牛犬还是边境牧羊犬,无论是吠叫还是被抚摸肚皮,狗可以呈现多种形态却始终是狗。
人工智能系统未必如此幸运。这些系统通过名为训练的过程,吸收海量数据进行学习。这些数据通常属于同一类型——语言模型处理文本,计算机视觉系统处理图像,而预测分子气味或蛋白质结构的系统则处理更特殊的数据类型。那么,语言模型和视觉模型对"狗"的概念理解能达到多大程度的共通?
研究人员通过窥探人工智能系统内部,研究它们如何表征场景和句子来探索这类问题。越来越多的研究发现,不同的人工智能模型即使使用不同数据集或完全不同的数据类型进行训练,也可能形成相似的表征。更有甚者,一些研究表明随着模型能力提升,这些表征正变得越来越相似。在2024年的一篇论文中,麻省理工学院的四位人工智能研究者提出,这些趋同迹象并非偶然。他们提出的"柏拉图式表征假说"在学术界引发了热烈讨论,并催生了一系列后续研究。
该假说得名于古希腊哲学家柏拉图在2400年前提出的洞穴寓言。寓言中,被困洞穴的囚徒只能通过外部物体投射的阴影来感知世界。柏拉图认为我们都像这些不幸的囚徒。在他看来,日常生活中接触的物体,只是存在于超越感官的彼岸世界中那些理想"理型"的苍白投影。
柏拉图式表征假说则更为具体。在这个现代版本中,洞穴之外是真实世界,它以数据流的形式投射出机器可读的阴影。人工智能模型就是囚徒。麻省理工学院团队的核心观点是:仅接触数据流的各类迥异模型,正开始就数据背后世界的"柏拉图式表征"达成共识。
"为什么语言模型和视觉模型能对齐?因为它们都是同一世界的投影。"该论文的资深作者菲利普·伊索拉解释道。
并非所有人都认同这一观点。主要争议点在于应该关注哪些表征。我们无法逐一检视语言模型对每个可能句子的内部表征,也无法穷尽视觉模型对所有图像的表征。那么该如何确定哪些具有代表性?该从何处寻找这些表征?又该如何在截然不同的模型间进行比较?虽然研究者短期内难以就此假说达成共识,但伊索拉对此并不担忧。
"一半学界同仁认为这显而易见,另一半则认为显然错误。"他说,"我们很满意这样的反响。"
表征的"社交圈"
如果人工智能研究者对柏拉图学说各执己见,或许能在其前辈毕达哥拉斯那里找到更多共识。这位哲学家的思想据说始于"万物皆数"的前提,这恰是对驱动人工智能模型的神经网络的贴切描述。它们对文字或图像的表征不过是长串数字,每个数字代表特定人工神经元的激活程度。
为简化计算,研究者通常单独聚焦神经网络的某一层,这类似于在特定时刻对大脑特定区域活动进行快照记录。他们将这层的神经元激活状态记录为称为"向量"的几何对象——即在抽象空间中指向特定方向的箭头。现代人工智能模型每层包含数万神经元,因此其表征是高维向量,无法直接可视化。但向量能便捷比较网络表征:当对应向量指向相似方向时,两种表征即具相似性。
在单个AI模型内部,相似输入往往产生相似表征。例如在语言模型中,"狗"这个词的向量会相对接近"宠物""吠叫""毛茸茸"等词的向量,而远离"柏拉图的""糖浆"等词。这精准实现了六十多年前英国语言学家约翰·鲁珀特·弗斯提出的著名观点:"观其伴,知其词"。
不同模型间的表征又如何呢?直接比较不同网络的激活向量并无意义,但研究者已设计出间接评估表征相似性的方法。一种流行方案是借鉴弗斯的精辟见解,衡量两个模型对同一输入的表征是否拥有相同的"社交圈"。
假设要比较两个语言模型对动物词汇的表征。首先编制词汇表——狗、猫、狼、水母等,然后将这些词输入两个网络并记录各自表征。在每个网络中,这些表征会形成向量簇。接着可以追问:两个向量簇的整体形状相似度如何?
"这可以理解为测量相似性的相似性。"纽约大学人工智能研究员伊利亚·苏霍卢茨基解释道。
在这个简单示例中,两个模型间应存在某些相似性——例如两个网络中"猫"向量可能都接近"狗"向量,而"水母"向量则指向不同方向。但两个向量簇不可能完全一致。"狗"更接近"猫"还是"狼"?如果模型使用不同数据集训练或基于不同网络架构构建,它们可能给出不同答案。
2010年代中期,研究者开始用这种方法探索AI模型间的表征相似性,发现不同模型对相同概念的表征往往相似却不尽相同。有趣的是,多项研究表明,强大模型间的表征相似度似乎高于较弱模型。2021年有论文将此称为"安娜·卡列尼娜情景",致敬托尔斯泰经典小说的开篇名句:或许成功的AI模型都是相似的,而不成功的模型各有各的缺陷。
与多数早期表征相似性研究类似,该论文仅关注当时最热门的计算机视觉领域。而强大语言模型的出现即将改变这一局面。对伊索拉而言,这也是探索表征相似性极限的契机。
趋同演化
柏拉图式表征假说论文的诞生始于2023年初,那是AI研究者面临剧烈动荡的时期。ChatGPT在数月前发布,越来越多人意识到,单纯扩大模型规模——用更多数据训练更大的神经网络——就能提升模型在多任务中的表现,但原因成谜。
"每个AI研究者都在经历存在主义危机。"当时在伊索拉实验室攻读博士的OpenAI研究员闵英·胡回忆道。他开始定期与伊索拉及同事张博文、王通州会面,探讨模型扩展如何影响内部表征。
想象这种情况:多个模型使用相同数据训练,较强模型学到更相似的表征。这未必是因为它们创造了更精确的世界摹本,可能只是更擅长把握训练数据集的特性。
现在考虑使用不同数据集训练的模型。如果它们的表征也趋于一致,将更有力证明模型正变得更善于把握数据背后世界的共同特征。而从完全不同的数据类型(如语言和视觉数据)中学习的模型间出现趋同,则会提供更强证据。
在最初讨论的一年后,伊索拉团队决定撰写论文,综述趋同表征的证据并论证柏拉图式表征假说。
此时,其他研究者已开始探索视觉与语言模型表征的相似性。胡设计了自己的实验:在维基百科带标题图片数据集上测试五个视觉模型和十一个不同规模的语言模型。他将图片输入视觉模型,标题输入语言模型,然后比较两类模型的向量簇。随着模型能力增强,他观察到表征相似性持续增长——这正是柏拉图式表征假说所预测的。
寻找普遍性
当然,事情从未如此简单。表征相似性的测量总涉及大量可能影响结果的实验选择:观察每个网络的哪些层级?获得各模型的向量簇后,该使用哪种数学方法进行比较?最初又该测量哪些表征?
"如果只测试一个数据集,你无法确定结论的普适性。"研究语言模型表征相似性的芝加哥大学研究员克里斯托弗·沃尔夫勒姆指出,"谁知道使用更奇特的数据集会发生什么?"
伊索拉承认问题远未解决。这不是单篇论文能解答的:理论上可以测量模型对任何图片或句子的表征。在他看来,模型确实呈现趋同的案例比可能不趋同的案例更具说服力。
"科学的使命在于发现普遍性。"伊索拉说,"我们可以研究模型的差异或分歧,但识别共性往往具有更强的解释力。"
其他研究者则认为,关注模型表征的差异更具建设性。加州大学伯克利分校研究员阿列克谢·埃夫罗斯就是其中一位,他曾担任麻省理工学院团队四名成员中三位的导师。
"他们都是好朋友,非常非常聪明。"埃夫罗斯说,"我认为他们错了,但这就是科学的本质。"
埃夫罗斯指出,胡使用的维基百科数据集中,图像和文本本身就被设计成包含相似信息。但现实世界遇到的大多数数据都具有难以转化的特征。"人们去艺术博物馆而不只是阅读目录,是有原因的。"他说。
模型间的内在同一性无需完美即可发挥作用。去年夏天,研究者开发出将句子内部表征从一种语言模型转换到另一种的方法。如果语言和视觉模型表征能在某种程度上互换,可能催生同时学习两类数据的新型训练方法。伊索拉等人在近期论文中探索了此类训练方案。
尽管前景可期,仍有研究者认为单一理论难以完全捕捉现代AI模型的行为特征。
"无法将万亿参数系统简化为简单解释。"不列颠哥伦比亚大学AI研究员杰夫·克鲁恩表示,"答案注定是复杂的。"
英文来源:
Distinct AI Models Seem To Converge On How They Encode Reality
Introduction
Read a story about dogs, and you may remember it the next time you see one bounding through a park. That’s only possible because you have a unified concept of “dog” that isn’t tied to words or images alone. Bulldog or border collie, barking or getting its belly rubbed, a dog can be many things while still remaining a dog.
Artificial intelligence systems aren’t always so lucky. These systems learn by ingesting vast troves of data in a process called training. Often, that data is all of the same type — text for language models, images for computer vision systems, and more exotic kinds of data for systems designed to predict the odor of molecules or the structure of proteins. So to what extent do language models and vision models have a shared understanding of dogs?
Researchers investigate such questions by peering inside AI systems and studying how they represent scenes and sentences. A growing body of research has found that different AI models can develop similar representations, even if they’re trained using different datasets or entirely different data types. What’s more, a few studies have suggested that those representations are growing more similar as models grow more capable. In a 2024 paper, four AI researchers at the Massachusetts Institute of Technology argued that these hints of convergence are no fluke. Their idea, dubbed the Platonic representation hypothesis, has inspired a lively debate among researchers and a slew of follow-up work.
The team’s hypothesis gets its name from a 2,400-year-old allegory by the Greek philosopher Plato. In it, prisoners trapped inside a cave perceive the world only through shadows cast by outside objects. Plato maintained that we’re all like those unfortunate prisoners. The objects we encounter in everyday life, in his view, are pale shadows of ideal “forms” that reside in some transcendent realm beyond the reach of the senses.
The Platonic representation hypothesis is less abstract. In this version of the metaphor, what’s outside the cave is the real world, and it casts machine-readable shadows in the form of streams of data. AI models are the prisoners. The MIT team’s claim is that very different models, exposed only to the data streams, are beginning to converge on a shared “Platonic representation” of the world behind the data.
“Why do the language model and the vision model align? Because they’re both shadows of the same world,” said Phillip Isola, the senior author of the paper.
Not everyone is convinced. One of the main points of contention involves which representations to focus on. You can’t inspect a language model’s internal representation of every conceivable sentence, or a vision model’s representation of every image. So how do you decide which ones are, well, representative? Where do you look for the representations, and how do you compare them across very different models? It’s unlikely that researchers will reach a consensus on the Platonic representation hypothesis anytime soon, but that doesn’t bother Isola.
“Half the community says this is obvious, and the other half says this is obviously wrong,” he said. “We were happy with that response.”
The Company Being Kept
If AI researchers don’t agree on Plato, they might find more common ground with his predecessor Pythagoras, whose philosophy supposedly started from the premise “All is number.” That’s an apt description of the neural networks that power AI models. Their representations of words or pictures are just long lists of numbers, each indicating the degree of activation of a specific artificial neuron.
To simplify the math, researchers typically focus on a single layer of a neural network in isolation, which is akin to taking a snapshot of brain activity in a specific region at a specific moment in time. They write down the neuron activations in this layer as a geometric object called a vector — an arrow that points in a particular direction in an abstract space. Modern AI models have many thousands of neurons in each layer, so their representations are high-dimensional vectors that are impossible to visualize directly. But vectors make it easy to compare a network’s representations: Two representations are similar if the corresponding vectors point in similar directions.
Within a single AI model, similar inputs tend to have similar representations. In a language model, for instance, the vector representing the word “dog” will be relatively close to vectors representing “pet,” “bark,” and “furry,” and farther from “Platonic” and “molasses.” It’s a precise mathematical realization of an idea memorably expressed more than 60 years ago by the British linguist John Rupert Firth: “You shall know a word by the company it keeps.”
What about representations in different models? It doesn’t make sense to directly compare activation vectors from separate networks, but researchers have devised indirect ways to assess representational similarity. One popular approach is to embrace the lesson of Firth’s pithy quote and measure whether two models’ representations of an input keep the same company.
Imagine that you want to compare how two language models represent words for animals. First, you’ll compile a list of words — dog, cat, wolf, jellyfish, and so on. You’ll then feed these words into both networks and record their representations of each word. In each network, the representations will form a cluster of vectors. You can then ask: How similar are the overall shapes of the two clusters?
“It can kind of be described as measuring the similarity of similarities,” said Ilia Sucholutsky, an AI researcher at New York University.
In this simple example, you’d expect some similarity between the two models — the “cat” vector would probably be close to the “dog” vector in both networks, for instance, and the “jellyfish” vector would point in a different direction. But the two clusters probably won’t look exactly the same. Is “dog” more like “cat” than “wolf,” or vice versa? If your models were trained on different datasets, or built on different network architectures, they might not agree.
Researchers began to explore representational similarity among AI models with this approach in the mid-2010s and found that different models’ representations of the same concepts were often similar, though far from identical. Intriguingly, a few studies found that more powerful models seemed to have more similarities in their representations than weaker ones. One 2021 paper dubbed this the “Anna Karenina scenario,” a nod to the opening line of the classic Tolstoy novel. Perhaps successful AI models are all alike, and every unsuccessful model is unsuccessful in its own way.
That paper, like much of the early work on representational similarity, focused only on computer vision, which was then the most popular branch of AI research. The advent of powerful language models was about to change that. For Isola, it was also an opportunity to see just how far representational similarity could go.
Convergent Evolution
The story of the Platonic representation hypothesis paper began in early 2023, a turbulent time for AI researchers. ChatGPT had been released a few months before, and it was increasingly clear that simply scaling up AI models — training larger neural networks on more data — made them better at many different tasks. But it was unclear why.
“Everyone in AI research was going through an existential life crisis,” said Minyoung Huh, an OpenAI researcher who was a graduate student in Isola’s lab at the time. He began meeting regularly with Isola and their colleagues Brian Cheung and Tongzhou Wang to discuss how scaling might affect internal representations.
From top right: Anna Decker; @by.h_official; Jiaxi Chen; Kris Brewer
Imagine a case where multiple models are trained on the same data, and the stronger models learn more similar representations. This isn’t necessarily because these models are creating a more accurate likeness of the world. They could just be better at grasping quirks of the training dataset.
Now consider models trained on different datasets. If their representations also converge, that would be more compelling evidence that models are getting better at grasping shared features of the world behind the data. Convergence between models that learned from entirely different data types, such as language and vision models, would provide even stronger evidence.
A year after their initial conversations, Isola and his colleagues decided to write a paper reviewing the evidence for convergent representations and presenting an argument for the Platonic representation hypothesis.
By then, other researchers had started studying similarities between vision and language model representations. Huh conducted his own experiment, in which he tested a set of five vision models and 11 language models of varying sizes on a dataset of captioned pictures from Wikipedia. He would feed the pictures into the vision models and the captions into the language models, and then compare clusters of vectors in the two types. He observed a steady increase in representational similarity as models became more powerful. It was exactly what the Platonic representation hypothesis predicted.
Find the Universals
Of course, it’s never so simple. Measurements of representational similarity invariably involve a host of experimental choices that can affect the outcome. Which layers do you look at in each network? Once you have a cluster of vectors from each model, which of the many mathematical methods do you use to compare them? And which representations do you measure in the first place?
“If you only test one dataset, you don’t necessarily know how [the result] generalizes,” said Christopher Wolfram, a researcher at the University of Chicago who has studied representational similarity in language models. “Who knows what would happen if you did some weirder dataset?”
Isola acknowledged that the issue is far from settled. It’s not a question that any one paper can resolve: In principle, you can measure models’ representations of any picture or any sentence. To him, cases where models do exhibit convergence are more compelling than cases where they may not.
Peter DaSilva for Quanta Magazine
“The endeavor of science is to find the universals,” Isola said. “We could study the ways in which models are different or disagree, but that somehow has less explanatory power than identifying the commonalities.”
Other researchers argue that it’s more productive to focus on where models’ representations differ. Among them is Alexei Efros, a researcher at the University of California, Berkeley, who has been an adviser to three of the four members of the MIT team.
“They’re all good friends and they’re all very, very smart people,” Efros said. “I think they’re wrong, but that’s what science is about.”
Efros noted that in the Wikipedia dataset that Huh used, the images and text contained very similar information by design. But most data we encounter in the world has features that resist translation. “There is a reason why you go to an art museum instead of just reading the catalog,” he said.
Any intrinsic sameness across models doesn’t have to be perfect to be useful. Last summer, researchers devised a method to translate internal representations of sentences from one language model to another. And if language and vision model representations are to some extent interchangeable, that could lead to new ways to train models that learn from both data types. Isola and others explored one such training scheme in a recent paper.
Despite these promising developments, other researchers think it’s unlikely that any single theory will fully capture the behavior of modern AI models.
“You can’t reduce a trillion-parameter system to simple explanations,” said Jeff Clune, an AI researcher at the University of British Columbia. “The answers are going to be complicated.”