«

GIST智能采样技术:开启智能采样的新篇章

qimuai 发布于 阅读:23 一手编译


GIST智能采样技术:开启智能采样的新篇章

内容来源:https://research.google/blog/introducing-gist-the-next-stage-in-smart-sampling/

内容总结:

谷歌发布GIST算法:突破数据智能采样瓶颈,为大规模AI训练提速

2026年1月23日,谷歌研究院的研究科学家Morteza Zadimoghaddam与Matthew Fahrbach团队正式推出了一种名为“GIST”(贪婪独立集阈值法)的新型算法。该算法旨在解决机器学习中一个核心难题:如何从海量数据中高效选取一个既具代表性又无冗余的高质量子集,为后续模型训练奠定基础。

随着机器学习模型,尤其是大语言模型和计算机视觉系统的飞速发展,处理庞大而复杂的数据集已成为性能提升的主要成本瓶颈。传统的数据子集选择方法往往需要在“数据多样性”(避免选取重复或相似数据)与“数据效用”(选取对任务最具价值的数据)之间艰难权衡。二者目标时常冲突,而找到两者兼顾的最优子集是一个计算复杂度极高的难题。

GIST算法的突破性在于,它首次为这一权衡问题提供了具有严格数学证明的近似保证。其核心思路是将复杂的多样性-效用联合优化问题,分解为一系列可处理的子问题。算法通过设定不同的“最小距离阈值”来保证所选数据点之间的差异性,并在此基础上,采用一种精心设计的双标准贪婪算法,高效逼近在每种距离约束下的最大效用子集。理论证明,GIST所选子集的效用值至少能达到理论最优解的一半以上,这为实际应用提供了可靠的数学保障。

在实际效能测试中,GIST表现卓越。在ImageNet等数据集上的图像分类任务中,采用GIST算法选取训练子集所训练的ResNet-56模型,其准确率显著超越了随机采样、边际采样、k中心点等多种现有先进基准方法。这表明GIST能更有效地平衡数据的覆盖广度与信息密度。此外,尽管解决的是NP难题,GIST子集选择步骤本身的运行时间极短,相较于长达数日甚至数周的模型训练过程,其开销几乎可以忽略,使其能够轻松集成到包含数十亿数据点的大规模训练流程中。

研究团队指出,该算法框架具有广泛的适用性。例如,其核心的“最大最小多样性”思想已被YouTube首页推荐团队借鉴,用于提升视频推荐的多样性,并成功提高了用户的长期使用价值。

谷歌团队表示,GIST算法为解决数据选择中的根本性权衡难题提供了一个高效、可证明的统一框架。这项研究为构建下一代可扩展的人工智能系统奠定了坚实基础,确保在数据量持续增长的未来,我们依然能够基于既信息丰富又最小冗余的数据子集来高效训练模型,推动AI技术向更智能、更节能的方向发展。

中文翻译:

推出GIST:智能采样的新阶段
2026年1月23日
谷歌研究院研究科学家 Morteza Zadimoghaddam、Matthew Fahrbach

我们正式推出GIST算法,这一创新算法能够在选择高质量数据子集时,提供可证明的理论保证,同时最大化数据的多样性与实用性。

快速了解
现代机器学习虽已实现前所未有的性能突破,但代价是必须处理日益庞大且复杂的数据集。无论是大语言模型还是计算机视觉系统,都面临一个共同挑战:如何处理海量且计算成本高昂的数据。

这催生了子集选择的需求——即从完整数据集中选取一个规模更小但具有代表性的数据子集,用于常规训练任务(而非微调)。随之而来的问题是:我们如何确保所选子集包含足够信息,以训练出准确的模型?

在2025年的NeurIPS大会上,我们提出了贪心独立集阈值算法(GIST)。这一新算法通过平衡数据的“多样性”(确保所选数据不冗余)与“实用性”(数据与任务相关且有用),有效解决了上述问题。GIST不仅在图像分类等前沿基准任务中表现优异,更在数学上保证了其解的质量。

矛盾点:智能采样为何困难
在选择数据子集时,研究者必须权衡两个常相互冲突的目标:多样性与实用性。数据多样性要求确保所选数据点不重复;实用性则衡量所选子集的整体价值或信息量。

对于多样性,我们关注最大化任意两个选定数据点之间的最小距离(通常在嵌入空间中),即最大最小多样性。若选择两个高度相似的数据点(如两张几乎相同的金毛犬图片),多样性就会很低。最大最小多样性强制选取彼此尽可能分散的点,从而减少冗余并确保对数据空间的广泛覆盖。对于实用性,我们聚焦于单调次模函数类,旨在最大化子集覆盖的独特信息总量。

难点在于如何结合这两个目标。纯粹的最大最小策略可能选出多样但无关的数据点,而纯粹的实用性策略则可能选出一簇高度相关却冗余的点。寻找一个既最大程度分散又最大程度信息丰富的子集,是一个复杂的组合优化问题,已知属于NP难题,这意味着没有算法能高效找到最优解,尤其对于海量数据集。

这一内在矛盾需要巧妙的近似策略。

GIST的工作原理
既然寻找完美子集不切实际,目标便转向设计具有可证明近似保证的算法——即通过数学安全网确保解始终接近理论最优值。这正是GIST的突破性贡献所在。

GIST将多样性-实用性的权衡问题分解为一系列更简单但相关的优化子问题:

  1. 对多样性部分设定阈值
    GIST首先暂时隔离多样性部分。它不直接尝试最大化所有点之间的最小距离(难题),而是转向一个更简单的问题:“在给定固定最小距离的前提下,我们能选出的最佳数据子集是什么?”

通过设定最小距离要求,GIST将数据转化为图结构处理:仅当两点距离小于该设定距离时,它们才被连接。在此图中,任何相连的点都被视为过于相似,不应同时出现在最终子集中。

  1. 近似求解独立集
    接着,GIST寻找在此图中无任何两点相连的最大实用性子集——即经典的最大独立集问题。这好比策划一场晚宴:某些客人不能同桌,目标是邀请尽可能有趣的人群,但必须遵守“桌上任意两人无冲突”的规则。这是一个巨大的组合难题,因为选择一位客人可能“排除”其他三位高价值人选。要找到最佳组合,需检查指数级数量的分组方式,因此该问题被视为计算领域最难的课题之一。

由于最大独立集问题本身是NP完全的(普遍认为不存在高效算法能为海量数据集找到绝对“完美”答案),且缺乏合理的近似算法,GIST采用精心设计的双准则贪心算法进行高效近似求解。它遍历多个可能的距离阈值,解决相应的独立集问题,并最终选取所有阈值中的最佳解。对于最优解能达到的任何最小距离d,GIST在距离阈值d/2下实现的实用性可与最优解的实用性相媲美。

双准则贪心算法如同一个系统的“调谐旋钮”,在数据多样性与价值间找到平衡点。它并非盲目猜测,而是分析所有数据点间的实际距离,生成一系列潜在的间距规则,并逐一测试:针对每条规则,算法“贪心地”选取当前最有价值的点,前提是它们与已选点不过于接近。通过遍历所有相关距离并比较结果,算法最终确定那个既能捕获最多有用信息、又能确保数据尽可能分散的“最佳平衡点”。

通过巧妙近似这一系列最大独立集问题,GIST在满足最小多样性要求的同时,达成了实用性目标。

成果
强大的理论保证
我们的理论成果最为重要:GIST是首个为多样性-实用性权衡提供强可证明保证的算法。该算法保证找到的数据子集价值至少达到理论最优解价值的50%(详见论文)。这一强保证为实践者提供了必要的数学安全网,确保算法在最大化实用性与确保多样性之间实现高效权衡。此外,我们证明,若想找到价值超过最优解0.56倍的子集,本身即是NP难题。

实际应用效果
我们在多种机器学习场景中评估了GIST与前沿基准方法的性能,重点关注多样性与实用性皆关键的任务:

我们还尝试将GIST与其他策略结合,以提升其效果:

图像分类中的数据采样
在ImageNet等数据集上使用ResNet-56模型的实验中,GIST在单次子集选择任务中展现出显著优势。单次数据降采样可一步缩减数据集规模(通常为图像、信号或高维数据),同时保留关键信息。与迭代或多阶段方法不同,此方式最大化速度与效率,常用于降低计算负担或优化图形相关任务的渲染性能。

这对于数据密集型任务至关重要:我们需要在开始昂贵的模型训练前,一次性选出信息量最大且最多样的图像。相比以往方法,GIST选择的子集 consistently 带来更高的模型准确率,证明了其在平衡覆盖度与非冗余性方面的卓越能力。

运行效率
尽管底层问题复杂,但GIST的实际子集选择步骤极快:其运行时间与最终机器学习模型训练所需的数小时甚至数天相比,往往可忽略不计。这一速度使得GIST能够轻松集成到包含数十亿数据点的大规模训练流程中。

我们还在YouTube首页推荐团队的工作中观察到最大最小多样性方法的价值:该团队运用类似原理增强视频推荐的多样性,从而提升了长期用户价值。

结论:为可扩展AI奠定基础
在计算科学领域,如何兼顾相互冲突的优化目标(如在保持最大多样性的同时最大化总效用)一直是长期存在的难题。GIST算法通过提供一个高效统一的框架,成功解决了数据选择中的这一根本性权衡问题。

通过将复杂的多样性-效用权衡问题分解为一系列更简单、可近似的子任务,GIST为机器学习社区提供了一个可证明有效的工具。这项研究为下一代可扩展AI系统奠定了坚实基础,确保即使数据持续增长,我们仍能在既最大程度信息丰富又最小程度冗余的数据子集上训练模型。简言之,GIST确保我们以智能方式采样数据。

致谢
我们感谢谷歌研究院的Sara Ahmadian;Google DeepMind的Srikumar Ramalingam、Giulia DeSalvo、Gui Citovsky;以及YouTube的Cheenar Banerjee、Cristos Goodrow、Fabio Soldo、Su-Lin Wu对本工作的贡献。Matthew Fahrbach、Morteza Zadimoghaddam和Srikumar Ramalingam在本研究项目中贡献均等。

英文来源:

Introducing GIST: The next stage in smart sampling
January 23, 2026
Morteza Zadimoghaddam and Matthew Fahrbach, Research Scientists, Google Research
We introduce GIST, a novel algorithm that provides provable guarantees for selecting a high-quality data subset that maximizes both data diversity and data utility.
Quick links
Modern machine learning (ML) has unlocked unprecedented performance at the cost of having to process increasingly massive and complex datasets. From large language models (LLMs) to computer vision systems, there’s a common challenge: handling a massive amount of data that’s expensive to process.
This necessitates subset selection — the task of choosing a smaller, representative group of data points from the full dataset for the typical training task (not the fine-tuning). The question is then: how can we be sure this subset contains enough information to train an accurate model?
At NeurIPS 2025, we introduced Greedy Independent Set Thresholding (GIST), a novel algorithm that helps solve this issue by balancing data “diversity” (ensuring the selected data is not redundant) and data “utility“ (data that is relevant and useful for the task). GIST not only outperforms state-of-the-art benchmarks tasks, such as image classification, but it does so with a mathematical guarantee about its solution quality.
The conflict: Why smart sampling is hard
When selecting a subset of data, researchers must balance two often conflicting objectives: diversity and utility. Enforcing data diversity ensures the selected points aren’t redundant. Utility measures the overall usefulness or informational value of the selected subset.
For diversity, we focus on maximizing the minimum distance (typically in embedding space) between any two selected data points, also known as max-min diversity. If you choose two data points that are very similar (e.g., two almost identical pictures of a golden retriever), your diversity is low. Max-min diversity forces you to select points that are all as far apart from each other as possible, minimizing redundancy and ensuring broad coverage of the data landscape. For utility, we focus on the class of monotone submodular functions, which aim to maximize the total unique information covered by the subset.
The difficulty lies in combining these two goals. A pure max-min strategy might select diverse but ultimately irrelevant data points, while a pure utility strategy might select a tight, highly relevant cluster of redundant points. Finding a subset that is both maximally spread out and maximally informative is a complex combinatorial problem that is known to be NP-hard, meaning that no algorithm can find the best solution efficiently, especially for massive datasets.
This inherent conflict requires a clever approximation strategy.
How GIST works
Since finding a perfect subset is impractical, the goal shifts to finding an algorithm with a provable approximation guarantee — a mathematical safety net that assures the solution is always close to the true optimum. This is where GIST provides its groundbreaking solution.
GIST breaks the diversity–utility challenge into a series of simpler, but related, optimization problems:

  1. Thresholding the diversity component
    GIST starts by temporarily isolating the diversity component. Instead of trying to maximize the minimum distance between all points (the hard part), GIST tackles a simpler question: "For a certain fixed minimum distance, what is the best subset of data we can select?"
    By fixing the minimum required distance, GIST processes the data using a graph where two points are connected only if their distance is less than that specified distance. In this graph, any two connected points are considered too similar to be in the final subset.
  2. Approximating the independent set
    GIST then looks for the maximum-utility subset that can be chosen where no two points are connected in this graph: the classic maximum independent set problem. Imagine planning a dinner party where certain guests can’t sit together. Your goal is to invite the most interesting group of people possible, but you must follow one rule: no two people at the table can have a conflict. This is a massive puzzle because picking one guest might "block" you from inviting three other high-interest people. To find the best combination, you have to check an exponential number of groupings, which is why it is considered one of the hardest problems in computing.
    Since the max independent set problem itself is NP-complete (meaning that it’s widely believed there does not exist an efficient algorithm to find the absolute “perfect” answer for a massive dataset) and admits no reasonable approximation algorithm, GIST uses a carefully constructed bicriteria greedy algorithm to approximate the solution efficiently. It iterates through many possible distance thresholds, solving the corresponding independent set problem and ultimately selecting the best solution found across all thresholds. For any minimum distance d achieved by the optimum solution, GIST achieves a comparable utility to the optimum utility at distance threshold d/2.
    The bicriteria greedy algorithm acts like a systematic "tuning knob" that finds the right balance between data variety and value. Instead of guessing, it analyzes the actual distances between all data points to create a list of potential spacing rules. It then tests these rules one by one: for each rule, it "greedily" grabs the most valuable points it can find, provided they aren't too close to the points it has already picked. By running this process across every relevant distance and comparing the results, the algorithm identifies the specific "sweet spot" that captures the most useful information while ensuring the data is as spread out as possible.
    By cleverly approximating a series of these maximum independent set problems, GIST manages to satisfy the utility goal while respecting the minimum diversity requirement.
    Results
    Strong guarantees
    Our theoretical results are the most significant findings: GIST is the first algorithm to provide a strong, provable guarantee for this diversity-utility tradeoff. The GIST algorithm is guaranteed to find a data subset whose value is at least half the value of the absolute optimal solution (see paper for details). This strong guarantee provides practitioners with a necessary mathematical safety net, ensuring that the algorithm is making an efficient trade-off between maximizing utility while ensuring diversification. Further, we prove that it is NP-hard to find a subset with more than a 0.56 fraction of the optimal value.
    Real-world impact
    We also evaluated GIST against state-of-the-art benchmarks in various ML applications, focusing on scenarios where diversity and utility are both essential:
    • Random: A simple and lightweight approach that promotes diversity in many settings and provides good solutions.
    • Margin: Picks data points that the model is currently "uncertain" about and is not incentivized to output a diverse set of training examples.
    • k-center: Selects a subset of data points so that every single point in the original, massive dataset is as "close" as possible to one of the selected representatives. Instead of looking for the most important or interesting points, it tries to eliminate "blind spots."
    • Submod: Chooses "important" points (utility) while also making sure they aren't "too similar" (diversity). However, it uses a slightly older mathematical way of defining diversity that can sometimes be inconsistent when the dataset becomes large.
      We then combined GIST with the other strategies to see if it could make them better.
    • GIST-margin: This takes the "picking the hard cases" strategy and forces it to follow GIST's strict diversity rules. It says, "Pick the most confusing examples, but I forbid you from picking two confusing examples that are too similar to each other."
    • GIST-submod: Submod uses the GIST framework to handle the diversity part more rigorously than the original submod approach could.
      Data sampling for image classification
      In experiments using a ResNet-56 model on datasets like ImageNet, GIST demonstrated significant advantages in single-shot subset selection. Single-shot data downsampling reduces the volume of a dataset — typically images, signals, or high-dimensional data — in one step while retaining critical information. Unlike iterative or multi-stage processes, this approach maximizes speed and efficiency and is often used to decrease computational burden or to optimize rendering performance in graphics-related tasks.
      This is critical for data-intensive tasks where we need to select the most informative and diverse images once before beginning costly model training. GIST consistently selected subsets that led to higher model accuracy compared to previous methods, proving its improved ability to balance coverage and non-redundancy.
      Running time
      Despite the complexity of the underlying problem, the actual subset selection step of GIST is incredibly fast: its running time is often negligible compared to the hours or even days required for training the final ML model. This speed makes GIST practical for integration into large-scale training pipelines with billions of data points.
      We also observed the value of the max-min diversity approach with the YouTube Home ranking team, which employed a similar principle to enhance the diversity of video recommendations and consequently improved long-term user value.
      Conclusion: A foundation for scalable AI
      The challenge of combining competing optimization goals — such as maximizing total utility while maintaining maximum diversity — has been a long-standing hurdle in computational science. The GIST algorithm successfully solves this fundamental trade-off for data selection by providing a single, highly efficient framework.
      By breaking the challenging diversity–utility tradeoff problem into a sequence of simpler, approximable tasks, GIST provides the ML community with a provably effective tool. This research establishes a strong foundation for the next generation of scalable AI systems, guaranteeing that as data continues to grow, we can still train models on subsets that are both maximally informative and minimally redundant. The gist of it is, GIST ensures we are sampling data intelligently.
      Acknowledgements
      We thank Sara Ahmadian from Google Research; Srikumar Ramalingam, Giulia DeSalvo, and Gui Citovsky from Google DeepMind; and Cheenar Banerjee, Cristos Goodrow, Fabio Soldo, and Su-Lin Wu from YouTube who contributed to this work. Matthew Fahrbach, Morteza Zadimoghaddam and Srikumar Ramalingam contributed equally in this research project.

谷歌研究进展

文章目录


    扫描二维码,在手机上阅读