利用人工智能通过DeepSomatic识别肿瘤中的基因变异。

qimuai 发布于 2025-10-17 08:01 阅读：90 一手编译

内容来源：https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/

内容总结：

谷歌研发团队于2025年10月16日在《自然·生物技术》期刊发表论文，正式推出人工智能工具"DeepSomatic"。该工具通过深度学习技术精准识别肿瘤基因突变，为癌症精准医疗提供新方案。

癌症本质上是基因疾病，肿瘤细胞中积累的基因变异会驱动癌症发展。目前临床通常通过基因测序分析肿瘤突变，但传统方法存在准确性不足的局限。DeepSomatic采用卷积神经网络技术，能将基因测序数据转化为图像进行分析，有效区分遗传变异、测序误差与真正的肿瘤特异性突变。

研究团队联合加州大学圣克鲁兹基因组学研究所、美国国家癌症研究所等机构，构建了包含乳腺癌、肺癌细胞系的标准化训练数据集CASTLE。测试结果显示，该工具在三大主流测序平台（Illumina、PacBio、牛津纳米孔）上的表现均优于现有方法，尤其对基因插入/缺失变异的检测准确率提升显著：在Illumina平台将F1分数从80%提升至90%，在PacBio平台从不足50%提升至80%以上。

该工具具有两大突破性优势：其一，可适用于仅含肿瘤样本的特殊场景（如白血病），在堪萨斯城儿童医院的临床试验中成功从8例儿科白血病样本中识别出新型突变；其二，能有效处理经福尔马林固定石蜡包埋的陈旧样本，为历史病例研究和临床样本再利用开辟新途径。

作为谷歌"AI抗癌"计划的重要组成部分，DeepSomatic已向全球科研界开源。研究团队表示，该工具不仅能辅助医生选择化疗、免疫疗法等现有治疗方案，更有望通过发现新突变靶点推动创新疗法研发，最终助力实现"一人一策"的精准医疗愿景。

中文翻译：

利用DeepSomatic人工智能技术识别肿瘤基因变异
2025年10月16日
谷歌研究院技术主管Kishwar Shafin与产品主管Andrew Carroll

DeepSomatic是一款基于人工智能的工具，可通过识别肿瘤基因序列中与癌症相关的突变，帮助精准定位致癌驱动因素。
快速入口

癌症本质上是一种基因疾病，其特征是细胞分裂的遗传调控失常。癌症种类繁多，每种癌症因独特的遗传基础而面临不同挑战。识别肿瘤细胞中的基因突变不仅是研究癌症的重要手段，更是制定治疗方案的关键步骤。目前临床医生通常会通过对活检肿瘤细胞进行基因组测序，来指导制定特异性阻断癌症生长的治疗方案。

我们与加州大学圣克鲁兹基因组学研究所及多家联邦、学术研究机构合作，在《自然-生物技术》期刊发表题为《DeepSomatic：面向多测序技术的精准体细胞小变异检测》的论文，推出了一款利用机器学习技术、能比现有方法更准确识别肿瘤细胞基因变异的工具。DeepSomatic采用卷积神经网络识别肿瘤变异，具有高度灵活性，适用于所有主流测序平台的数据和不同样本处理方式，并能将其学习能力拓展至训练数据未涵盖的癌症类型。

我们已向研究界开源该工具及我们构建的高质量训练数据集。作为谷歌利用人工智能解读癌症、助力科研人员治疗癌症的系列举措之一，相关工作还包括分析乳腺癌筛查的乳腺X光影像、肺癌筛查的CT扫描，以及通过合作推进妇科癌症的AI研究。我们期望以此加速癌症研究进程，推动精准医疗目标的实现。

出生后获得的遗传变异
基因组测序在科研与临床中用于识别个体与人类参考基因组之间的遗传差异。区分真实变异与测序过程中产生的误差极具挑战。正因如此，谷歌研究院约十年前推出了DeepVariant工具，用于识别来自父母、存在于所有体细胞中的遗传变异（亦称种系变异）。

癌症的遗传机制更为复杂。癌症通常由出生后获得的变异驱动。紫外线或化学致癌物等环境因素对DNA的损伤，以及DNA复制过程中发生的随机错误，可能导致体细胞获得新变异。这些后天获得的变异有时会改变细胞正常行为，引发异常增殖，从而驱动癌症的初始发生及后期向快速生长和侵袭阶段的演进。

识别特定于某些体细胞的变异远比识别遗传变异困难。肿瘤细胞可能包含多种不同频率的获得性变异，而测序错误率可能高于样本中体细胞变异的存在比例。

训练DeepSomatic识别肿瘤细胞基因变异
为应对这些挑战，我们开发了能精准识别体细胞变异的DeepSomatic。在多数临床与科研场景中，研究者通过对活检获取的肿瘤细胞以及未受肿瘤影响、携带典型遗传变异的正常细胞进行测序来研究癌症。DeepSomatic经过训练可识别肿瘤细胞中非遗传性的变异，这类变异能为揭示肿瘤生长驱动因素提供关键线索。该工具还能在仅肿瘤样本模式下（如难以从血样中分离正常细胞的白血病）识别体细胞变异。这种适应临床与科研常见场景的扩展能力，使DeepSomatic可广泛应用于多种研究与临床环境。

与DeepVariant类似，DeepSomatic首先将基因测序数据转换为图像集。这些图像呈现了测序数据、染色体比对情况、输出质量及其他变量。随后，DeepSomatic运用卷积神经网络分析肿瘤细胞与非癌细胞数据，区分参考基因组、个体非癌种系变异以及肿瘤中癌症相关的体细胞变异，同时剔除测序过程中小误差引起的变异。最终输出结果为癌症相关变异（即突变）列表。

要训练能识别不同癌症类型基因变异的精准模型，需要全面高质量的数据集与基准真值集。为此我们构建了全新的肿瘤细胞变异检测训练与评估数据集。通过与加州大学圣克鲁兹分校及美国国家癌症研究所合作，我们对来自研究细胞系的四份乳腺癌样本和两份肺癌样本中的肿瘤细胞及配套正常细胞进行了测序。

为创建精准训练数据集，我们使用三大主流平台——Illumina短读长测序、PacBio长读长测序和牛津纳米孔长读长测序——对这六份样本进行全基因组测序。整合三大平台数据以消除平台特异性误差后，我们创建了名为"癌症标准长读长评估数据集（CASTLE）"的精准参考数据集，用于研究肿瘤与正常细胞的遗传多样性。

测试DeepSomatic识别癌症相关变异的能力
我们使用CASTLE参考数据集中三份乳腺癌基因组和两份肺癌基因组训练DeepSomatic，随后通过多种方式测试其性能，包括在未参与训练的单份乳腺癌基因组及各样本1号染色体（训练时已排除）上的表现。

结果表明，针对三大测序平台分别开发的DeepS模型均优于其他方法，能以更高准确度识别更多肿瘤变异。在短读长测序数据对比中，参照工具包括SomaticSniper（仅限单核苷酸变异SNVs）、MuTect2和Strelka2；长读长数据则与基于合成数据训练的深度学习模型ClairS对比。测试中，DeepSomatic在六份参考细胞系和第七份保存样本中共识别329,011个体细胞变异。该工具在识别涉及基因编码插入缺失（Indels）的癌症变异方面表现尤为突出，显著提升了F1分数（平衡衡量模型发现真实变异的能力与避免假阳性的指标）。在Illumina测序数据中，次优方法识别Indels得分为80%，DeepSomatic达90%；在Pacific Biosciences数据中，次优方法得分低于50%，DeepSomatic超过80%。

第七份样本是经福尔马林固定石蜡包埋（FFPE）保存的乳腺癌研究细胞系。这种常用保存方法会引入特定DNA损伤模式，增加基因分析难度。该样本还采用全外显子组测序（WES）——一种仅针对占基因组约1%的蛋白质编码区、成本更低的测序方法。当DeepSomatic在此类样本数据上训练后，在预留的1号染色体测试中再次超越其他工具，表明其可用于低质量或历史肿瘤样本的变异识别，挽救难以测序的样本，并适用于仅含外显子组的临床数据。

DeepSomatic在其他癌症类型的应用
为测试DeepSomatic在其他癌症的性能，我们分析了仅含少量变异的侵袭性脑癌——胶质母细胞瘤的单个样本。DeepSomatic成功定位这些变异，证明其学习能力可推广至不同癌症类型。

我们还与堪萨斯城儿童慈善医院合作分析了八份既往测序的小儿白血病样本（作为最常见儿童癌症，白血病存在于血液中难以获取"正常"非癌血样）。尽管存在这一挑战，DeepSomatic不仅识别出已知变异，还新发现10个变异，展现了仅凭肿瘤样本工作的能力。

展望未来
我们期待科研实验室与临床医生能开始使用这一工具。检测已知癌症变异有助于在化疗、免疫疗法等现有治疗方案中做出选择，而发现新变异可能催生全新疗法。我们希望人们能借助这些工具深入解析每种癌症肿瘤，定位其驱动机制，最终为患者提供最有效的治疗方案。

致谢
我们感谢所有研究参与者对研究项目的参与及细胞系捐赠，正是他们的贡献使本项及其他生物医学研究成为可能。感谢加州大学圣克鲁兹基因组学研究所、美国国家癌症研究所、弗雷德里克国家癌症研究实验室、儿童慈善医院及纽约大学的合作者。感谢Hannah Hickey的撰稿支持，感谢Avinatan Hassidim、Katherine Chou、Lizzie Dorfman和Yossi Matias的研究领导支持，以及Resham Parikh和Isha Mishra的传播支持。

英文来源：

Using AI to identify genetic variants in tumors with DeepSomatic
October 16, 2025
Kishwar Shafin, Technical Lead, and Andrew Carroll, Product Lead, Google Research
DeepSomatic is an AI-powered tool that identifies cancer-related mutations in a tumor’s genetic sequence to help pinpoint what’s driving the cancer.
Quick links
Cancer is fundamentally a genetic disease in which the genetic controls on cell division go awry. Many types of cancer exist, and each poses unique challenges as it can have distinct genetic underpinnings. A powerful way to study cancer, and a critical step toward creating a treatment plan, is to identify the genetic mutations in tumor cells. Indeed, clinicians will now often sequence the genomes of biopsied tumor cells to inform treatment plans that specifically disrupt how that cancer grows.
With partners at the University of California, Santa Cruz Genomics Institute and other federal and academic researchers, our new paper, “DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies” in Nature Biotechnology presents a tool that leverages machine learning to identify genetic variants in tumor cells more accurately than current methods. DeepSomatic is a flexible model that uses convolutional neural networks to identify tumor variants. It works on data from all major sequencing platforms, for different types of sample processing, and can extend its learning to cancer types not included in training.
We have made both the tool and the high-quality training dataset we created openly available to the research community. This work is part of broader Google efforts to develop AI methods to understand cancer and help scientists treat cancer, including analyzing mammogram images for breast cancer screening, CT scans for lung cancer screening, as well as a partnership aimed at using AI to advance research on gynecological cancers. Our hope is to speed cancer research and further the goal of precision medicine.
Genetic variation acquired after birth
Genome sequencing is used in research and medical clinics to identify genetic variations between an individual and the human reference genome. Distinguishing between real variants and simple errors made during the sequencing process is challenging. That’s why almost a decade ago Google Research introduced DeepVariant to identify inherited variants, also called germline variants, that came from parents and are found in all of the body’s cells.
The genetics of cancer is more complex. Cancer is often driven by variants acquired after birth. Environmental exposure that damages DNA, such as UV light or chemical carcinogens, as well as random errors that occur during DNA replication, can cause cells in the body, known as somatic cells, to acquire new variants. Sometimes, these acquired variants change the normal behavior of cells, and can cause them to replicate when they shouldn’t. This process drives the initial development of cancer, as well as its later progression to more fast-growing and invasive stages.
Identifying variants specific to some of a person’s somatic cells is much harder than identifying inherited variants. Tumor cells can contain a diverse set of acquired variants at different frequencies, and the error rate of sequencing can be higher than the rate a somatic variant is present in a sample.
Training DeepSomatic to spot genetic variation in tumor cells
We developed DeepSomatic to address these challenges and accurately identify somatic variants. In most clinical and research settings, cancer is studied by sequencing the tumor cells acquired through biopsy, as well as normal cells that are unaffected by the tumor growth and contain more typical inherited genetic variations. DeepSomatic is trained to identify variations observed in tumor cells that are not inherited variants. These types of variations can provide critical insights about which variations are driving the tumor growth. DeepSomatic is also able to identify somatic variation in tumor-only mode where a non-tumor sequence is not available, for example in a blood cancer like leukemia where it is hard to get only normal cells from a blood draw. The ability to extend to different types of use-cases that follow common ways clinicians and researchers study cancer makes DeepSomatic applicable to many research and clinical settings.
Like our earlier tool, DeepVariant, the DeepSomatic model works by first turning genetic sequencing data into a set of images. The images represent the sequencing data, alignment along the chromosome, the quality of the output, and other variables. DeepSomatic then uses its convolutional neural network on data from tumor cells and non-cancerous cells to differentiate between the reference genome, the non-cancer germline variants in that individual, and the cancer-caused somatic variants in the tumor, while discarding variations caused by small errors acquired during the sequencing process. The result is a list of cancer-related variants, or mutations.
Training accurate models that can identify genetic variation for different cancer types requires comprehensive, high-quality data and truth sets. For this work we created a new training and evaluation dataset for detecting variants in tumor cells. With our partners at UC Santa Cruz and the National Cancer Institute, we sequenced tumor cells and accompanying normal cells from four breast cancer samples and two lung cancer samples from research cell lines.
To create an accurate training dataset, we did whole-genome sequencing of these six samples using three leading platforms: Illumina’s short-read sequencing, PacBio’s long-read sequencing, and Oxford Nanopore Technology’s long-read sequencing. Output from all three platforms was combined to remove platform-specific errors and create a single, accurate reference dataset we call the Cancer Standards Long-read Evaluation dataset (CASTLE) for genetic diversity in tumor and normal cells.
Testing DeepSomatic’s ability to spot cancer-related variants
We trained DeepSomatic on three of the breast cancer genomes and the two lung cancer genomes in the CASTLE reference dataset. We then tested DeepSomatic’s performance in several ways, including on the single breast cancer genome that was not included in its training data, and on chromosome 1 from each sample, which we also excluded from the training.
Results show that DeepSomatic models developed for each of the three major sequencing platforms performed better than other methods, identifying more tumor variants with higher accuracy. The tools used for comparison on short-read sequencing data were SomaticSniper, MuTect2 and Strelka2 (with SomaticSniper specifically for single nucleotide variants, or SNVs). For long-read sequencing data we compared against ClairS, a deep learning model trained on synthetic data.
In our tests DeepSomatic identified 329,011 somatic variants across the six reference cell lines and a seventh preserved sample. DeepSomatic does particularly well at identifying cancer variations that involve insertions and deletions (“Indels”) of genetic code. For these types of variants, DeepSomatic substantially increased the F1-score, a balanced measure of how well the model finds true variants in a sample (recall) while not making false positives (precision). On Illumina sequencing data the next-best method scored 80% at identifying Indels, while DeepSomatic scored 90%. On Pacific Biosciences sequencing data, the next-best method scored less than 50% at identifying Indels, and DeepSomatic scored more than 80%.
The seventh sample was one of the previously used research cell lines of a breast cancer tumor that was preserved using formalin-fixed-paraffin-embedded (FFPE). This common preservation method introduces additional patterns of DNA damage that can complicate genetic analysis. This sample was also sequenced using whole exome sequencing (WES), a more affordable method that focuses only on the roughly 1% of the genome that codes for proteins. When DeepSomatic was trained on these types of sample data and then tested on chromosome 1, which was reserved from training, it again outperformed other tools, suggesting it can be used to identify variants in lower-quality or historic tumor samples, potentially rescuing samples that have been harder to sequence, and working on clinical data where only the exome was sequenced.
Applying DeepSomatic to other cancers
To test DeepSomatic’s performance on other types of cancers, we analyzed a single sample of glioblastoma, an aggressive form of brain cancer that arises from a small number of variants. DeepSomatic was able to pinpoint those variants, showing that it can generalize its learning to apply it to a different cancer type.
We also worked with partners at Children’s Mercy in Kansas City to analyze eight previously sequenced samples of pediatric leukemia, a cancer of the white blood cells that is the most common childhood cancer. Leukemia exists in the bloodstream, so a “normal” non-cancer blood sample is not possible. Despite that challenge, DeepSomatic identified the previously known variants as well as 10 new ones, showing that it can work with a tumor-only sample.
What’s next
Our hope is that research labs and clinicians can begin to use this tool. Detecting known cancer variants could help choose between existing treatments, such as chemotherapy, immunotherapy or other methods. Identifying new cancer variants could potentially lead to brand-new therapies. We hope people can take these tools and learn more about each cancer tumor, find what’s driving it, and ultimately deliver the most effective treatments to patients.
Acknowledgments
We thank all research participants whose participation in research programs and donation of cell lines made this work and other biomedical research possible. We thank our collaborators at UC Santa Cruz Genomics Institute, the National Cancer Institute, the Frederick National Laboratory for Cancer Research, Children’s Mercy Hospital, and NYU. We thank Hannah Hickey for writing contributions. We thank Avinatan Hassidim, Katherine Chou, Lizzie Dorfman, and Yossi Matias for research leadership support. We thank Resham Parikh and Isha Mishra for communications support.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读