借助人工智能驱动的实证软件加速科学发现
内容来源:https://research.google/blog/accelerating-scientific-discovery-with-ai-powered-empirical-software/
内容总结:
谷歌研发AI系统助力科研 高效生成专业级实证软件加速科学发现
谷歌研究团队于2025年9月9日宣布开发出一款基于Gemini的人工智能系统,能够协助科学家自动生成专业水平的实证软件,在基因组学、公共卫生、神经科学等六个跨学科领域的复杂科学问题中取得突破性成果。
传统科研过程中,假设验证常需耗费数月时间编写定制化软件,形成研究瓶颈。该AI系统通过接收明确定义的科学问题与评估标准,自动生成创新性方法论架构,将其转化为可执行代码并进行实证验证。系统采用树状搜索策略(受AlphaZero启发),对数千种代码变体进行迭代优化,将原本需要数月的探索过程压缩至数小时或数天。
在具体应用中,该系统展现出卓越的跨学科适应性:
- 在基因组学单细胞RNA测序数据整合任务中,系统发现的40种新方法较现有最优方法(ComBat)实现14%的性能提升
- 在美国COVID-19住院人数预测任务中,生成的14个模型超越美国疾控中心官方集成模型的预测精度
- 在遥感图像语义分割任务中,前三名解决方案的mIoU指标均突破0.80,超越现有最优水平
- 在斑马鱼全脑神经活动预测任务中,开发出超越所有现有基线的新一代时间序列预测模型
研究团队强调,该系统不仅能生成语法正确的代码,更擅长创造性地解决推动科学边界的前沿问题。所有高分解决方案均开源发布,确保研究成果的可验证性与可复现性。
这项突破标志着人工智能正从辅助编程工具升级为科研创新伙伴,有望使科学家从繁琐的编码工作中解放,更专注于本质性科学问题探索与关键性社会挑战的应对。
中文翻译:
借助人工智能驱动的实证软件加速科学发现
2025年9月9日
Lizzie Dorfman(产品经理)与Michael Brenner(研究科学家)
谷歌研究院
我们全新的人工智能系统可协助科学家编写实证软件,在六个多元高难度课题中达到专家级成果。
快速访问
科学研究中,全面验证假说是获得更可靠、更完整结论的关键,但所需工作往往形成瓶颈,阻碍发现进程。尤其现代科研多依赖计算实验来模拟、仿真和分析复杂现象。此类假说验证常需定制化软件开发,过程缓慢且充满挑战。鉴于大语言模型(LLM)在传统编程任务中日益展现的能力,我们探索其是否也能生成高质量定制软件,用于评估并迭代改进科学假说。
今日我们发布论文,介绍一款"基于Gemini构建、旨在协助科学家编写专家级实证软件"的AI系统。该系统以明确定义的问题和评估方法为输入,充当系统化的代码优化研究引擎:它能提出新颖的方法论与架构概念,将其实现为可执行代码,并通过实证验证性能。随后通过树搜索优化性能,对数千种代码变体进行搜索迭代。我们在六个代表多学科挑战的基准测试中验证系统,涵盖基因组学、公共卫生、地理空间分析、神经科学、时间序列预测和数值分析领域,系统在所有测试中均达到专家级表现。
实证软件与可评分任务
科研本质是迭代过程,常需测试数十乃至数百种模型或参数才能实现突破。即使对精通编程的科学家而言,编码、调试和优化软件也极其耗时。手动编写每个新想法不仅低效,更使系统性探索潜在解决方案几乎无法实现。
我们系统的核心是实证软件理念。与传统软件仅关注功能正确性不同,实证软件以最大化预设质量评分为核心目标。可通过实证软件有效解决的问题称为可评分任务,这类任务广泛存在于科学、应用数学和工程领域。
工作原理
系统输入为可评分任务,包含问题描述、评分指标以及适用于训练、验证和评估的数据。用户还可提供背景信息,如外部文献思路或优先方法指令。
系统随后生成研究思路(包括程序化复现、优化和已知方法重组),形成新颖高效的方法。这些思路被实现为可执行代码,系统采用受AlphaZero启发的上置信界树搜索策略构建候选软件树,决定哪些方案值得深入探索。通过LLM重写代码以提升质量评分,以前所未有的规模 exhaustive 执行解决方案搜索,快速锁定优质方案,将探索时间从数月压缩至数小时或数天。最终输出的代码解决方案具备可验证、可解释与可复现特性。
实证效能
传统代码生成AI系统的评估多基于竞技编程或软件工程任务,虽具价值却未能涵盖科学发现中的全方位挑战。我们不仅证明系统能编写语法正确的代码,更在六大跨领域高难度基准问题上生成创新解决方案,突破当前计算方法和人类专业知识的边界。这些基准测试的多样性让我们能综合评估零样本泛化、高维信号处理、不确定性量化、复杂数据语义解析和系统级建模等能力。每个基准测试的最高分解决方案均已公开,可供复现研究结果,还提供交互式网站探索完整候选解决方案树。
基因组学:单细胞RNA测序数据的批次整合
单细胞RNA测序(scRNA-seq)是一项强大技术,可在单细胞层面提供基因表达的高分辨率视图。联合分析多个异构数据集时,关键挑战在于消除样本间复杂批次效应的同时保留真实生物信号。现有近300种scRNA-seq数据批次整合工具,且已开发多套基准测试用于评估批次效应消除与生物变异保留指标。使用融合13项指标的综合评分基准OpenProblems V2.0.0时,我们的系统发现40种新颖方法,其表现超越专家开发的顶尖方法。最高分方案通过成功结合两种现有方法(ComBat与BBKNN),实现了较最佳已发布方法(ComBat)14%的整体提升。
公共卫生:美国COVID-19住院人数预测
美国COVID-19预测主要基准平台COVID-19 Forecast Hub(CovidHub)由疾控中心(CDC)协调运营,汇集数十个专家团队的多样化方法提交。任务需提前四周预测全美及各地区新增COVID-19住院人数。预测结果采用平均加权区间评分(WIS)评估,该指标通过汇总模型在所有地点每周预测的表现来衡量概率预测质量。最终各提交方案被聚合为CovidHub集合模型,被视为美国COVID-19住院预测的金标准。我们的系统生成14个超越官方CovidHub集合模型的方案。
地理空间分析:遥感图像分割
高分辨率遥感图像的语义分割是地理空间分析常见任务,对土地利用监测、人类活动环境影响评估及自然灾害管理等应用至关重要。该任务需为图像中每个像素精确分配类别标签,要求模型对场景建立空间与上下文理解,不仅识别对象类型,更精准定位其边界。
使用基于平均交并比(mIoU)评估的密集标注遥感数据集(DLRSD)基准时,本系统生成的前三解决方案均略优于当前最优技术(mIoU>0.80)。所有方案均基于现有模型、库和策略构建:两个方案采用标准UNet++和U-Net模型,但搭配ImageNet预训练的强编码器;第三个方案使用基于Transformer的最先进架构SegFormer。三者均采用广泛测试时增强(TTA)技术。
神经科学:全脑神经活动预测
我们将方法应用于斑马鱼活动预测基准(ZAPBench),该最新基准用于预测整个脊椎大脑中超过70,000个神经元的活动。系统发现了一种新颖的时间序列预测模型,其性能达到业界最优,超越所有现有基线(包括此前顶尖的基于视频计算密集型3D体积预测模型)。作为概念验证,我们还证明系统能设计融合生物物理神经元模拟器(Jaxley)的混合模型,为开发更具可解释性的预测模型铺平道路。
虽然每个案例都具说服力,但我们的实证软件生成系统最突出之处在于其普适性。我们还在数学领域针对困难积分数值计算任务评估系统,其生成的解决方案在标准数值方法失败的19个保留积分中正确计算17个。最后,使用涵盖7大领域、10种频率(秒至年)的28个数据集构成的通用时间序列预测评估基准(GIFT-Eval)测试系统时,它通过单代码在整体数据集平均绝对缩放误差上的爬山算法,成功从头创建了统一通用预测库。详见论文。
结论
LLM的最新进展已为全球研究者提供了获取知识与思想的新途径,其作为自动化科研中重复性劳动手段的应用日益增多。我们探索了LLM是否能在普遍、必需且极度耗时的定制软件开发任务中发挥作用——这些软件用于评估并迭代改进科学假说,其愿景是让科学家能轻松、快速、系统地探索数百乃至数千种潜在解决方案。我们的系统快速生成专家级方案,将创意探索时间从数月缩短至数小时或数天,有望为从学生到教授的科学家节省大量时间,使其专注于真正创造性与关键性的挑战,持续界定和优先处理那些科学研究能助力解决的基础性问题与社会挑战。
致谢
我们感谢论文所有合著者的贡献,并感谢Shibl Mourad、John Platt、Erica Brand、Katherine Chou、Ronit Levavi Morad、Yossi Matias和James Manyika的支持与领导。
英文来源:
Accelerating scientific discovery with AI-powered empirical software
September 9, 2025
Lizzie Dorfman, Product Manager, and Michael Brenner, Research Scientist, Google Research
Our new AI system helps scientists write empirical software, achieving expert-level results on six diverse, challenging problems.
Quick links
In scientific research, thoroughly evaluating hypotheses is essential to developing more robust and comprehensive answers, but the required work forms a bottleneck, hindering the pace of discovery. In particular, much of modern scientific research depends on computational experiments to model, simulate, and analyze complex phenomena. Here, hypothesis evaluation often requires creating custom software, a slow and challenging task. Given the increasing capability of large language models (LLMs) to perform traditional coding tasks, we wondered if they could similarly generate high-quality custom software for evaluating and iteratively improving scientific hypotheses.
Today we are releasing a paper describing an "AI system designed to help scientists write expert-level empirical software", built using Gemini. Taking as input a well-defined problem and a means of evaluation, our system acts as a systematic code-optimizing research engine: it can propose novel methodological and architectural concepts, implement them as executable code and empirically validate their performance. It then searches and iterates through thousands of code variants, using tree search to optimize performance. We tested our system using six benchmarks representing distinct multidisciplinary challenges, spanning the fields of genomics, public health, geospatial analysis, neuroscience, time-series forecasting, and numerical analysis. Our system achieves expert-level performance across all of these benchmarks.
Empirical software and scorable tasks
Scientific research is inherently iterative, often requiring researchers to test dozens or hundreds of models or parameters to achieve a breakthrough. Even for scientists who are experienced programmers, coding, debugging, and optimizing software is incredibly time-consuming. Manually coding each new idea is slow and inefficient, making systematic exploration of potential solutions practically impossible.
At the heart of our system lies the foundational concept of empirical software. Unlike conventional software, which is often judged by functional correctness alone, empirical software is designed with a primary objective: to maximize a predefined quality score. A problem or challenge that can be effectively addressed and solved through the application of empirical software is termed a scorable task. These scorable tasks are prevalent across science, applied mathematics, and engineering.
How it works
The input to our system is a scorable task, which includes a problem description, a scoring metric, and data suitable for training, validation, and evaluation. A user can also provide context, such as ideas from external literature, or directives for methodologies to prioritize.
The system then generates research ideas, including programmatic reproduction, optimization, and recombination of known methods, leading to novel and highly performant approaches. Ideas are implemented as executable code and the system uses a tree search strategy with an upper confidence bound (inspired by AlphaZero) to create a tree of software candidates and decide which candidates warrant further exploration. It then uses an LLM to rewrite the code to attempt to improve its quality score, and can exhaustively and tirelessly carry out solution searches at an unprecedented scale, identifying high-quality solutions quickly, reducing exploration time from months to hours or days. Its outputs, as coded solutions, are verifiable, interpretable and reproducible.
Demonstrated effectiveness
The evaluation of code generating AI systems has historically focused on tasks derived from competitive programming or software engineering, which, while valuable, fail to capture the full spectrum of challenges inherent in scientific discovery. We demonstrate proficiency not merely in writing syntactically correct code, but in generating novel solutions to six diverse and challenging benchmark problems that push the boundaries of current computational methods and human expertise. The diversity of these benchmarks allows us to collectively assess proficiency in areas such as zero-shot generalization, high-dimensional signal processing, uncertainty quantification, semantic interpretation of complex data, and systems-level modeling. The top scoring solutions to each of these benchmark problems are openly available for anyone interested in reproducing our results, including as an interactive website to explore the full candidate solution trees.
Genomics: Batch integration of single cell RNA sequencing data
Single-cell RNA sequencing (scRNA-seq) is a powerful technology that provides a high-resolution view of gene expression at the individual cell level. A major challenge required to jointly analyze many disparate datasets is to remove complex batch effects present across samples while preserving true biological signals. Nearly 300 tools exist to perform batch integration of scRNA-seq data, and multiple benchmarks have been developed for assessing metrics of batch effect removal and conservation of biological variability. Using the OpenProblems V2.0.0 batch integration benchmark, which combines 13 metrics into one overall score, our system discovered 40 novel methods that outperformed top expert-developed methods. The highest-scoring solution achieved a 14% overall improvement over the best published method (ComBat) by successfully combining two existing methods (ComBat and BBKNN).
Public health: Prediction of U.S. COVID-19 hospitalizations
The primary U.S. benchmark for COVID-19 forecasting is the COVID-19 Forecast Hub (CovidHub), a large collaborative effort coordinated by the Centers for Disease Control and Prevention (CDC). CovidHub attracts competitive and methodologically diverse submissions from dozens of expert-led teams. Their task is to forecast new COVID-19 hospitalizations across all the U.S. states and its territories for up to a month ahead. These forecasts are evaluated using average weighted interval score (WIS), which assesses the quality of probabilistic forecasts by summarizing a model's performance across all locations for every weekly prediction over the season. Individual submissions are then aggregated into the CovidHub Ensemble model, which is considered the gold standard in the U.S. for forecasting COVID-19 hospitalizations. Our system generated 14 models that outperform the official CovidHub Ensemble.
Geospatial analysis: Segmentation of remote sensing images
Semantic segmentation of high-resolution remote sensing images is a common problem in geospatial analysis, and is essential for diverse applications, ranging from monitoring land use, assessing the environmental impacts of human activity, and managing natural disasters. This task, which involves accurately assigning class labels to individual pixels in an image, requires a model to develop a spatial and contextual understanding of the scene, identifying not just what objects are present, but precisely where their boundaries lie.
Using the dense labeling remote sensing dataset (DLRSD) benchmark, which evaluates methods using a mean intersection over union (mIoU), the top three solutions generated by our system are slightly better than current state of the art, with mIoU greater than 0.80. All three solutions build upon existing models, libraries and strategies. Two leverage standard UNet++ and U-Net models but paired with powerful encoders pre-trained on ImageNet. The third uses SegFormer, a state of the art Transformer-based architecture. All three employ extensive test-time augmentation (TTA).
Neuroscience: Whole-brain neural activity prediction
We applied our method to the Zebrafish Activity Prediction Benchmark (ZAPBench), a recent benchmark for forecasting the activity of over 70,000 neurons across an entire vertebrate brain. Our system discovered a novel time-series forecasting model that achieved state-of-the-art performance, surpassing all existing baselines. This includes a computationally intensive, video-based model that forecasts 3D volumes and was the previous top performing solution. As a proof of concept, we also demonstrated that our system can design hybrid models that incorporate a biophysical neuron simulator (Jaxley), paving the way for more interpretable predictive models.
While each of these examples is compelling in its own right, our system to generate empirical software is striking in its generalizability. We additionally evaluated our system in the context of mathematics on the task of numerical evaluation of difficult integrals. In this task, our system generated a solution that correctly evaluated 17 out of 19 held-out integrals, where the standard numerical method failed. Lastly, we evaluated our system on the general problem of time series forecasting, using the General Time Series Forecasting Model Evaluation (GIFT-Eval), a benchmark derived from 28 datasets spanning seven diverse domains, with 10 different frequencies, from seconds to years. Our system successfully created a unified, general purpose forecasting library from scratch, by hill climbing with a single code on the average mean absolute scaled error on the entire GIFT-Eval dataset. See the paper for more details.
Conclusion
Recent advances in LLMs have already given researchers worldwide new ways to easily engage with knowledge and ideas, and LLMs are increasingly being pursued as a means of automating the rote and toilsome aspects of scientific research. We explored whether LLMs could be useful for the ubiquitous, essential, and highly time-consuming task of producing custom software for evaluating and iteratively improving scientific hypotheses, motivated by the possibility of a future where scientists can easily, rapidly, and systematically investigate hundreds or thousands of potential solutions to the questions and problems that motivate their research. Our system quickly generates expert-level solutions reducing the time required for exploration of a set of ideas from months to hours or days. This promises to save significant time for scientists, from students to professors, to focus on truly creative and critical challenges, and to continue to define and prioritize the fundamental research questions and societal challenges that scientific research can help address.
Acknowledgements
We thank and acknowledge the contributions from all of the co-authors of the manuscript. Thanks to Shibl Mourad, John Platt, Erica Brand, Katherine Chou, Ronit Levavi Morad, Yossi Matias, and James Manyika for their support and leadership.