«

借助人工智能驱动的实证软件加速科学发现

qimuai 发布于 阅读:1 一手编译


借助人工智能驱动的实证软件加速科学发现

内容来源:https://research.google/blog/accelerating-scientific-discovery-with-ai-powered-empirical-software/

内容总结:

谷歌研发AI系统助力科研:自动编写实验软件,高效推动多学科突破

2025年9月9日,谷歌研究院产品经理Lizzie Dorfman与科学家Michael Brenner团队宣布开发出一款新型人工智能系统,能够协助科学家自动生成专业水平的实验性软件,在基因组学、公共卫生、神经科学等六个不同领域的复杂科学问题上取得突破性进展。

传统科研工作中,假设验证常需编写定制化软件,过程耗时且容易成为研究瓶颈。该AI系统基于谷歌大语言模型Gemini构建,通过输入明确的研究问题与评估标准,可自动生成代码、优化算法并验证性能。系统采用树状搜索策略(受AlphaZero启发),在数千种代码变体中迭代寻优,将原本需数月的探索过程压缩至数小时或数天。

在具体应用中,该系统展现出多学科突破能力:

该系统输出的代码解决方案具备可验证、可解释、可复现的特点,所有核心成果已通过论文公开发表,并提供交互式网站供学界查验。研究人员表示,这项技术将显著解放科学家在编程调试上的时间消耗,使其能更专注于核心科学问题的探索与创新。

(根据谷歌研究院2025年9月9日发布的研究成果整理)

中文翻译:

借助人工智能驱动的实证软件加速科学发现
2025年9月9日
Lizzie Dorfman(产品经理)与Michael Brenner(研究科学家)
谷歌研究院

我们全新的人工智能系统可协助科学家编写实证软件,已在六个多元高难度课题上取得专家级成果。

快速访问

在科研领域,全面验证假说是获得更可靠、更完备答案的关键,但相关工作的繁复性却成为阻碍发现进程的瓶颈。尤其在现代科研高度依赖计算实验来模拟、仿真和分析复杂现象的背景下,假说验证往往需要编写定制化软件——这是一个耗时且艰巨的任务。鉴于大语言模型(LLM)在传统编程任务中日益展现的强大能力,我们探索其是否也能为科学假说的评估与迭代优化生成高质量的定制软件。

今日我们发布一篇论文,介绍基于Gemini构建的"专为辅助科学家编写专家级实证软件而设计的人工智能系统"。该系统以明确定义的问题和评估方法为输入,扮演系统化代码优化研究引擎的角色:它能提出新颖的方法论与架构概念,将其实现为可执行代码,并通过实证验证其性能。随后通过树搜索算法对数千种代码变体进行搜索与迭代,以优化性能。我们在代表多学科挑战的六个基准测试中验证系统,涵盖基因组学、公共卫生、地理空间分析、神经科学、时间序列预测及数值分析领域。该系统在所有测试中均达到专家级水准。

实证软件与可评分任务

科研本质上是迭代的过程,研究者通常需要测试数十甚至数百种模型或参数才能实现突破。即使对具备编程经验的科学家而言,编写、调试和优化软件也极其耗时。手动编写每个新想法不仅低效缓慢,更使得系统化探索潜在解决方案几乎无法实现。

我们系统的核心在于"实证软件"这一基础概念。与传统软件仅以功能正确性为评判标准不同,实证软件以最大化预设质量评分为核心目标。能够通过实证软件有效解决的问题或挑战被称为"可评分任务",这类任务广泛存在于科学、应用数学和工程领域。

工作原理

系统输入为可评分任务,包括问题描述、评分指标以及适用于训练、验证和评估的数据。用户还可补充背景信息,如外部文献的思路或优先方法指引。

系统随后生成研究思路(包括对已知方法的程序化复现、优化与重组),进而衍生出新颖的高性能解决方案。这些思路被实现为可执行代码后,系统采用受AlphaZero启发的上置信界树搜索策略构建软件候选树,并决策哪些候选方案值得深入探索。通过调用LLM重写代码以提升质量评分,系统能以空前规模不知疲倦地执行解决方案搜索,快速锁定优质方案,将探索时间从数月压缩至数小时或数天。其输出的代码解决方案具备可验证、可解释与可复现的特性。

实证效能

以往对代码生成AI系统的评估多集中于竞技编程或软件工程类任务,虽具价值却未能全面反映科学发现中的固有挑战。我们不仅证明系统能编写语法正确的代码,更展示其能为六个推动当前计算方法和人类专业知识边界的挑战性基准问题生成创新解决方案。这些基准的多样性使我们能综合评估零样本泛化、高维信号处理、不确定性量化、复杂数据语义解析及系统级建模等领域的能力。每个基准测试的最高分解决方案均已公开,可供研究者复现结果,其中还包含可探索完整候选方案树的交互式网站。

基因组学:单细胞RNA测序数据的批次整合
单细胞RNA测序(scRNA-seq)是一项能在单细胞层面提供基因表达高分辨率视图的强大技术。联合分析多个异构数据集时,关键挑战在于消除样本间复杂批次效应的同时保留真实生物信号。现有近300种工具可用于scRNA-seq数据批次整合,多个基准测试已被开发用于评估批次效应消除与生物变异保留的指标。使用综合13项指标生成总评分的OpenProblems V2.0.0批次整合基准时,我们的系统发现40种新颖方法,其表现优于顶尖专家开发的方法。最高分解决方案通过成功融合两种现有方法(ComBat与BBKNN),实现了对最佳已发布方法(ComBat)14%的整体提升。

公共卫生:美国COVID-19住院人数预测
美国COVID-19预测的核心基准是由疾控中心(CDC)协调的大型合作项目COVID-19预测中心(CovidHub)。该平台汇集来自数十个专家团队的多样化竞争性方案,任务需预测全美及各地区未来一个月内的COVID-19新增住院人数。预测结果采用平均加权区间评分(WIS)进行评估,该指标通过汇总模型在整个季节中每周对所有地区的预测表现来评估概率预测质量。最终各方案被聚合为CovidHub集合模型,该模型被视为美国COVID-19住院预测的金标准。我们的系统生成14个超越官方CovidHub集合模型的方案。

地理空间分析:遥感图像分割
高分辨率遥感图像的语义分割是地理空间分析的常见课题,对土地利用监测、人类活动环境影响评估及自然灾害管理等应用至关重要。此任务需为图像中每个像素精确分配类别标签,要求模型对场景建立空间与上下文理解,不仅要识别物体类型,还需精准定位其边界。

使用通过平均交并比(mIoU)评估方法的密集标注遥感数据集(DLRSD)基准时,本系统生成的前三解决方案均略优于当前最优技术(mIoU超过0.80)。所有方案均基于现有模型、库和策略构建:两个方案采用标准UNet++和U-Net模型,但搭配ImageNet预训练的强力编码器;第三个方案使用基于Transformer的最先进架构SegFormer。三者均采用大量测试时增强(TTA)技术。

神经科学:全脑神经活动预测
我们将方法应用于斑马鱼活动预测基准(ZAPBench)——该最新基准用于预测整个脊椎大脑中超过70,000个神经元的活动。系统发现了一种实现最先进性能的新型时间序列预测模型,超越所有现有基线(包括此前最优的、基于视频预测3D体积的计算密集型模型)。作为概念验证,我们还证明系统能设计融合生物物理神经元模拟器(Jaxley)的混合模型,为开发更可解释的预测模型铺平道路。

虽然每个案例都独具说服力,但本系统生成实证软件的泛化能力尤为突出。我们还在数学领域针对困难积分数值计算任务进行评估:系统生成的解决方案在19道隐藏积分题中正确解答17道,而标准数值方法全部失败。最后采用通用时间序列预测模型评估基准(GIFT-Eval)进行测试(该基准涵盖7个不同领域28个数据集,时间频率从秒到年共10种),系统通过在整个数据集上以平均绝对缩放误差为指标进行单代码爬山优化,成功从零构建了统一的通用预测库。详见论文。

结论

LLM的最新进展已为全球研究者提供了获取知识与思想的新途径,其作为自动化科研中机械性劳动手段的潜力正日益被挖掘。我们探索了LLM是否能在"为科学假说的评估与迭代优化编写定制软件"这一 ubiquitous、关键且极度耗时的任务中发挥作用,憧憬未来科学家能轻松、快速且系统化地探索数百乃至数千种潜在解决方案。本系统可快速生成专家级方案,将创意探索时间从数月缩短至数小时或数天,有望为从学生到教授的科研工作者节省大量时间,使其能聚焦真正具有创造性与关键性的挑战,持续界定和优先处理那些科学研究能助力解决的基础性问题与社会性难题。

致谢
我们感谢论文所有合著者的贡献,并感谢Shibl Mourad、John Platt、Erica Brand、Katherine Chou、Ronit Levavi Morad、Yossi Matias和James Manyika的支持与领导。

英文来源:

Accelerating scientific discovery with AI-powered empirical software
September 9, 2025
Lizzie Dorfman, Product Manager, and Michael Brenner, Research Scientist, Google Research
Our new AI system helps scientists write empirical software, achieving expert-level results on six diverse, challenging problems.
Quick links
In scientific research, thoroughly evaluating hypotheses is essential to developing more robust and comprehensive answers, but the required work forms a bottleneck, hindering the pace of discovery. In particular, much of modern scientific research depends on computational experiments to model, simulate, and analyze complex phenomena. Here, hypothesis evaluation often requires creating custom software, a slow and challenging task. Given the increasing capability of large language models (LLMs) to perform traditional coding tasks, we wondered if they could similarly generate high-quality custom software for evaluating and iteratively improving scientific hypotheses.
Today we are releasing a paper describing an "AI system designed to help scientists write expert-level empirical software", built using Gemini. Taking as input a well-defined problem and a means of evaluation, our system acts as a systematic code-optimizing research engine: it can propose novel methodological and architectural concepts, implement them as executable code and empirically validate their performance. It then searches and iterates through thousands of code variants, using tree search to optimize performance. We tested our system using six benchmarks representing distinct multidisciplinary challenges, spanning the fields of genomics, public health, geospatial analysis, neuroscience, time-series forecasting, and numerical analysis. Our system achieves expert-level performance across all of these benchmarks.
Empirical software and scorable tasks
Scientific research is inherently iterative, often requiring researchers to test dozens or hundreds of models or parameters to achieve a breakthrough. Even for scientists who are experienced programmers, coding, debugging, and optimizing software is incredibly time-consuming. Manually coding each new idea is slow and inefficient, making systematic exploration of potential solutions practically impossible.
At the heart of our system lies the foundational concept of empirical software. Unlike conventional software, which is often judged by functional correctness alone, empirical software is designed with a primary objective: to maximize a predefined quality score. A problem or challenge that can be effectively addressed and solved through the application of empirical software is termed a scorable task. These scorable tasks are prevalent across science, applied mathematics, and engineering.
How it works
The input to our system is a scorable task, which includes a problem description, a scoring metric, and data suitable for training, validation, and evaluation. A user can also provide context, such as ideas from external literature, or directives for methodologies to prioritize.
The system then generates research ideas, including programmatic reproduction, optimization, and recombination of known methods, leading to novel and highly performant approaches. Ideas are implemented as executable code and the system uses a tree search strategy with an upper confidence bound (inspired by AlphaZero) to create a tree of software candidates and decide which candidates warrant further exploration. It then uses an LLM to rewrite the code to attempt to improve its quality score, and can exhaustively and tirelessly carry out solution searches at an unprecedented scale, identifying high-quality solutions quickly, reducing exploration time from months to hours or days. Its outputs, as coded solutions, are verifiable, interpretable and reproducible.
Demonstrated effectiveness
The evaluation of code generating AI systems has historically focused on tasks derived from competitive programming or software engineering, which, while valuable, fail to capture the full spectrum of challenges inherent in scientific discovery. We demonstrate proficiency not merely in writing syntactically correct code, but in generating novel solutions to six diverse and challenging benchmark problems that push the boundaries of current computational methods and human expertise. The diversity of these benchmarks allows us to collectively assess proficiency in areas such as zero-shot generalization, high-dimensional signal processing, uncertainty quantification, semantic interpretation of complex data, and systems-level modeling. The top scoring solutions to each of these benchmark problems are openly available for anyone interested in reproducing our results, including as an interactive website to explore the full candidate solution trees.
Genomics: Batch integration of single cell RNA sequencing data
Single-cell RNA sequencing (scRNA-seq) is a powerful technology that provides a high-resolution view of gene expression at the individual cell level. A major challenge required to jointly analyze many disparate datasets is to remove complex batch effects present across samples while preserving true biological signals. Nearly 300 tools exist to perform batch integration of scRNA-seq data, and multiple benchmarks have been developed for assessing metrics of batch effect removal and conservation of biological variability. Using the OpenProblems V2.0.0 batch integration benchmark, which combines 13 metrics into one overall score, our system discovered 40 novel methods that outperformed top expert-developed methods. The highest-scoring solution achieved a 14% overall improvement over the best published method (ComBat) by successfully combining two existing methods (ComBat and BBKNN).
Public health: Prediction of U.S. COVID-19 hospitalizations
The primary U.S. benchmark for COVID-19 forecasting is the COVID-19 Forecast Hub (CovidHub), a large collaborative effort coordinated by the Centers for Disease Control and Prevention (CDC). CovidHub attracts competitive and methodologically diverse submissions from dozens of expert-led teams. Their task is to forecast new COVID-19 hospitalizations across all the U.S. states and its territories for up to a month ahead. These forecasts are evaluated using average weighted interval score (WIS), which assesses the quality of probabilistic forecasts by summarizing a model's performance across all locations for every weekly prediction over the season. Individual submissions are then aggregated into the CovidHub Ensemble model, which is considered the gold standard in the U.S. for forecasting COVID-19 hospitalizations. Our system generated 14 models that outperform the official CovidHub Ensemble.
Geospatial analysis: Segmentation of remote sensing images
Semantic segmentation of high-resolution remote sensing images is a common problem in geospatial analysis, and is essential for diverse applications, ranging from monitoring land use, assessing the environmental impacts of human activity, and managing natural disasters. This task, which involves accurately assigning class labels to individual pixels in an image, requires a model to develop a spatial and contextual understanding of the scene, identifying not just what objects are present, but precisely where their boundaries lie.
Using the dense labeling remote sensing dataset (DLRSD) benchmark, which evaluates methods using a mean intersection over union (mIoU), the top three solutions generated by our system are slightly better than current state of the art, with mIoU greater than 0.80. All three solutions build upon existing models, libraries and strategies. Two leverage standard UNet++ and U-Net models but paired with powerful encoders pre-trained on ImageNet. The third uses SegFormer, a state of the art Transformer-based architecture. All three employ extensive test-time augmentation (TTA).
Neuroscience: Whole-brain neural activity prediction
We applied our method to the Zebrafish Activity Prediction Benchmark (ZAPBench), a recent benchmark for forecasting the activity of over 70,000 neurons across an entire vertebrate brain. Our system discovered a novel time-series forecasting model that achieved state-of-the-art performance, surpassing all existing baselines. This includes a computationally intensive, video-based model that forecasts 3D volumes and was the previous top performing solution. As a proof of concept, we also demonstrated that our system can design hybrid models that incorporate a biophysical neuron simulator (Jaxley), paving the way for more interpretable predictive models.
While each of these examples is compelling in its own right, our system to generate empirical software is striking in its generalizability. We additionally evaluated our system in the context of mathematics on the task of numerical evaluation of difficult integrals. In this task, our system generated a solution that correctly evaluated 17 out of 19 held-out integrals, where the standard numerical method failed. Lastly, we evaluated our system on the general problem of time series forecasting, using the General Time Series Forecasting Model Evaluation (GIFT-Eval), a benchmark derived from 28 datasets spanning seven diverse domains, with 10 different frequencies, from seconds to years. Our system successfully created a unified, general purpose forecasting library from scratch, by hill climbing with a single code on the average mean absolute scaled error on the entire GIFT-Eval dataset. See the paper for more details.
Conclusion
Recent advances in LLMs have already given researchers worldwide new ways to easily engage with knowledge and ideas, and LLMs are increasingly being pursued as a means of automating the rote and toilsome aspects of scientific research. We explored whether LLMs could be useful for the ubiquitous, essential, and highly time-consuming task of producing custom software for evaluating and iteratively improving scientific hypotheses, motivated by the possibility of a future where scientists can easily, rapidly, and systematically investigate hundreds or thousands of potential solutions to the questions and problems that motivate their research. Our system quickly generates expert-level solutions reducing the time required for exploration of a set of ideas from months to hours or days. This promises to save significant time for scientists, from students to professors, to focus on truly creative and critical challenges, and to continue to define and prioritize the fundamental research questions and societal challenges that scientific research can help address.
Acknowledgements
We thank and acknowledge the contributions from all of the co-authors of the manuscript. Thanks to Shibl Mourad, John Platt, Erica Brand, Katherine Chou, Ronit Levavi Morad, Yossi Matias, and James Manyika for their support and leadership.

谷歌研究进展

文章目录


    扫描二维码,在手机上阅读