Snapseed 新增交互式本地图像分割功能

qimuai 发布于 2025-10-2 08:01 阅读：39 一手编译

内容来源：https://research.google/blog/introducing-interactive-on-device-segmentation-in-snapseed/

内容总结：

【科技快讯】谷歌Snapseed推出AI智能选区功能手机端实时精准抠图时代来临

2025年10月1日，谷歌云端软件工程师本·哈恩与弗洛里安·库伯勒联合宣布，在iOS版Snapseed中推出革命性的"对象画笔"功能。这项基于终端设备交互式分割技术的新工具，让普通用户也能通过简单划动实现专业级精准选区，标志着移动端图像编辑进入智能交互新阶段。

传统手机修图过程中，单独调整人物、天空或特定物品总是充满挑战——触控操作不精准、处理器算力有限等因素让精细编辑成为难题。全新上线的对象画笔植根于谷歌研发的"交互式分割器"AI模型，通过MediaPipe框架与LiteRT的GPU加速技术，仅需在对象上轻划一笔，系统就能在20毫秒内自动识别完整物体轮廓并生成精准蒙版。无论是人像、宠物还是云朵，都能实现边界自然的实时选区。

为实现这一突破，技术团队创新性地采用"大迁移"训练方法：先通过35个类别约3万张精准标注图像训练出高精度教师模型，再基于包含200万张图像的大规模弱标注数据集进行知识蒸馏，最终打造出兼顾精度与速度的终端专用模型。该技术通过分离图像编码与交互解码的双子模型架构，既保留了大型模型的语义理解能力，又实现了7.4毫秒的极速响应（iPhone 16 Pro实测数据）。

特别值得关注的是，该技术支持768x768分辨率蒙版预测，并采用边缘保持联合双边上采样技术，最高可适配4K图像输出。目前该功能已登陆Snapseed的"调整"工具模块，预计未来数月将扩展至更多编辑工具。这项底层技术同时已应用于新款Chromebook Plus 14的图库AI编辑，谷歌表示将继续推进该技术在更多创意工具中的集成。

（完）

中文翻译：

隆重推出 Snapseed 的交互式端侧分割功能
2025年10月1日
Ben Hahn 与 Florian Kübler，谷歌云软件工程师

这项创新移动技术可实现实时图像分割，从而提升 Snapseed 内的照片编辑体验。

提升照片质感的关键往往在于局部调整：提亮前景主体、增强天空饱和度或让外套色彩更鲜艳。然而使用现有工具进行选区时——无论是基于主体、背景、天空还是色彩——始终是一项复杂且令人困扰的工作。在移动设备上这一挑战尤为突出，触控操作的不精确性与有限算力使得精细选区和编辑异常困难。

如今，我们让基于对象的图像调整变得轻松快捷。iOS 版 Snapseed 全新推出的"对象画笔"功能已集成于"调整"工具中，让您能直观地编辑对象。只需在目标对象上轻划一笔，即可将其从画面中分离并进行独立调整。该功能将于本周陆续上线，欢迎即刻体验！

通过交互式端侧分割实现直观编辑

对象画笔的核心驱动力是我们完全在设备端运行的交互式分割器AI模型。只需轻点或快速划动，您就能选中画面中的任意物体或人物。该模型会在20毫秒内立即识别并完整选取目标，无论是人物、宠物还是天际流云，都能生成精准匹配边缘的蒙版。实时反馈机制让您能动态优化选区，随意增减区域直至满意。整个流程依托 MediaPipe 与 LiteRT 的 GPU 加速技术，带来流畅无缝的体验。

这种直观界面与高效机器学习模型的完美融合，让专业级照片编辑变得前所未有的简单、愉悦且精准，所有操作都在您设备上流畅运行。

交互式分割器的训练之道

我们致力于打造通用型分割模型，其能力不限于特定对象类别或场景。为避免海量数据标注，我们采用 Big Transfer 方案，利用通用预训练图像编码器进行伪标注，辅以少量人工标注图像。

教师模型培育
我们基于预训练的高泛化性模型进行交互式分割微调。从350余个对象类别中抽取样本，由标注师绘制像素级精准蒙版。通过此过程获得约3万张高质量图像蒙版。虽然直接训练轻量移动模型数据不足，但大型预训练模型经此数据微调后可生成高精度蒙版。由此训练的交互式分割模型即"交互式分割器·教师版"。

尽管教师模型能生成优质分割蒙版，但其运行效率与模型体积限制了端侧应用。为此我们开发出"交互式分割器·边缘版"，通过知识蒸馏技术将教师模型的能力提炼至专为端侧场景优化的精炼模型。

知识蒸馏
由于端侧模型体积大幅缩减，其泛化能力受限，3万张标注图像不足以支撑新模型训练。同时模型规模决定了跨域预训练收益有限。

为实现从教师模型到边缘版的知识迁移，我们需要数百万张图像及涵盖多元对象类别的真实提示。为此我们启用包含200余万张图像的大型弱标注数据集，这些图像覆盖数百类别并附带蒙版。

提示生成
蒸馏数据集中的分割蒙版因采用自动化流程生成，未达像素级精度，但非常适合创建交互式分割的训练提示。在此过程中，教师模型实时生成真实蒙版，通过知识蒸馏流程指导学生模型。关键之处在于师生模型在训练中使用相同提示，确保模型一致性。

我们模拟用户选取图像对象的行为：在真实蒙版腐蚀区域内随机绘制涂鸦作为前景提示（即用户想要选择的内容，下图中红色标示），在蒙版外绘制涂鸦作为背景提示（即用户明确排除的内容，蓝色标示）。通过随机点触与涂鸦模拟点击操作，同时为支持套索选区，训练中还向模型提供物体边界框提示。

质量与延迟的平衡之道

核心挑战在于平衡分割质量与实时交互延迟的矛盾需求。我们通过解耦图像理解与提示处理达成平衡：首先采用高性能图像编码器逐图像提取语义特征，该过程可在用户启动交互分割时预执行，有效隐藏延迟；随后轻量级交互编解码器基于预计算特征，根据用户触控提示生成最终分割蒙版，全过程耗时不足20毫秒。这种双模型架构让交互式分割器既能承载大模型的图像理解能力，又具备小模型的瞬时响应特性。

最终学生模型（编码器+超级解码器）经8位量化后，在 iPhone 16 Pro 上通过 LiteRT GPU 加速实现7.4毫秒的解码推理延迟，为图像编辑带来无缝直观的体验。

图像级蒙版超分辨率

为在高分辨率图像上保持最佳编辑质量，我们训练分割模型生成768x768分辨率蒙版，并进一步超采样至图像分辨率（最高4K以适配单GPU缓冲）。采用保留边缘的联合双边超采样方案的GPU高效实现，为优化延迟，仅当用户完成手势抬起手指时执行超采样。

展望未来

全新交互式分割器让 Snapseed 图像编辑变得更简单强大。轻触划动间即可实现精准选区，助您将创意灵感转化为视觉现实。欢迎下载 iOS 版 Snapseed，点亮您的摄影作品。对象画笔功能将于未来数月扩展至 Snapseed 更多工具。该底层模型不仅支撑多样图像编辑任务，更成为直观选择性编辑的基础技术，目前已应用于新款 Chromebook Plus 14，为图库应用的AI图像编辑提供支持。接下来我们计划将其整合至谷歌更多图像与创意编辑产品中。

致谢团队
特别感谢所有项目组成员：Valentin Bazarevsky、Daniel Fenner、Lutz Justen、Ronald Wotzlaw、Tai-Yu Daniel Pan、Jason Chang、Matthew Harries、Giles Ochs、Jonathan Horsman、Alexander Kolesnikov、Lucas Beyer、Xiaohua Zhai、Karthik Raveendran、Matsvei Zhdanovich、Mogan Shieh、Chris Parsons、Jianing Wei 及 Matthias Grundmann。

英文来源：

Introducing interactive on-device segmentation in Snapseed
October 1, 2025
Ben Hahn and Florian Kübler, Software Engineers, Google Cloud
A novel mobile technology that facilitates real-time image segmentation, thereby improving the user experience for photo editing within Snapseed.
The key to elevating a good photo often lies in selective image adjustments: brightening a subject in the foreground, enhancing the sky, or making the color of a jacket pop. Yet, isolating specific elements with existing tools that offer subject, background, sky, or color-based selections has remained a frustrating and complex endeavor. This challenge has been particularly acute on mobile devices, where imprecise touch input and limited processing have made detailed selections and edits very difficult.
Now, we have made object-based image adjustments quick and easy. The new Object Brush in Snapseed on iOS, accessible in the "Adjust" tool, now lets you edit objects intuitively. It allows you to simply draw a stroke on the object you want to edit and then adjust how you want it to look, separate from the rest of the image. Give it a try as we roll this new capability out in the coming week!
Intuitive editing through interactive on-device segmentation
At its core, Object Brush is powered by our Interactive Segmenter, a powerful AI model that runs entirely on device. With a simple gesture — just a tap or tracing a quick line — you can choose an object or person in the frame. The model will then immediately detect and select the complete object or person, in less than 20ms. The model generates a mask for the object, which accurately matches its boundaries, whether it's a person, a pet, or the clouds in the sky. This real-time feedback lets you refine your selection on the fly, easily adding or subtracting areas until it's just right. This entire process is powered by MediaPipe and LiteRT’s GPU acceleration for a fast and seamless experience.
This powerful fusion of a simple, intuitive user interface with an effective and efficient machine learning model makes advanced photo editing more accessible, enjoyable, and more precise than ever before, all running seamlessly on your own device.
Training the Interactive Segmenter model
The Interactive Segmenter model is designed to be a universally capable segmentation model, not limited to any specific class of objects or scenes. To avoid having to annotate large amounts of data to cover all areas, we chose to follow the Big Transfer approach and use a general pre-trained image encoder for pseudo-annotation to complement small amounts of manually annotated images.
Teacher for Interactive Segmenter
We started with a pre-trained and highly-generalizable model, fine-tuned for interactive segmentation. We took samples for 350+ different object categories and asked annotators to precisely annotate object masks with pixel-perfect quality. Through this process, we obtained ~30,000 high-quality image masks for these categories. While insufficient for direct training of a small mobile model, large pre-trained models can successfully be fine-tuned on this data to predict high accuracy masks. Using this dataset we trained an interactive segmentation model, which we call “Interactive Segmenter: Teacher”.
Interactive Segmenter: Teacher produces high-quality segmentation masks; however, its speed and size hinder its use in on-device scenarios. To overcome this challenge, we developed “Interactive Segmenter: Edge”, a specialized model tailored for on-device use cases by leveraging the knowledge distilled from the original Interactive Segmenter: Teacher model.
Distillation
Since the on-device model is significantly smaller, it has limited generalization capabilities, and the 30,000 annotated images we used for fine-tuning aren't sufficient to train a new model. At the same time the small model size implies we won’t see significant gains from pre-training on different domains or tasks.
For knowledge transfer from Interactive Segmenter: Teacher to Interactive Segmenter: Edge, we need millions of images and realistic prompts for a diverse range of object categories. So, we leveraged a large, weakly annotated dataset, which contains over 2 million images with masks across hundreds of different categories.
Prompt generation
The segmentation masks in the distillation dataset are not pixel-perfect, because they were generated through automated or semi-automated procedures, and are not ideal for training high-quality segmenters. Nevertheless, they are suitable for creating realistic prompts for interactive segmentation. In this process, the ground truth mask is produced on-the-fly by Interactive Segmenter: Teacher, which acts as a teacher model in a process known as knowledge distillation. Importantly, both the teacher as well as the student model use the same prompts during training, ensuring consistency across models.
We attempt to simulate a user selecting objects in an image. We draw random scribbles within the (eroded) ground truth mask to get foreground prompts (i.e., what the user wants to select, shown in red in the image below) and random scribbles outside the ground truth mask to get background prompts (i.e., what the user explicitly does not want to select, shown in blue). We simulate tapping by drawing random points as well as random scribbles. Furthermore, to support lasso selection we also expose the model during training to box prompts around an object.
High quality vs. low latency
A central challenge was reconciling the conflicting demands of segmentation quality versus real-time, interactive latency. To reach the right balance, we decouple image and prompt understanding into distinct sub-models. First, a powerful, heavyweight image encoder is run once per image to extract a rich set of semantic features. This image encoder can be run as soon as the user’s intent to use interactive segmentation becomes apparent, thus effectively hiding the latency from the user. Second, a lightweight interactive encoder-decoder operates on these pre-computed features. This network takes the user's touch prompts and generates the final segmentation mask, executing well under our 20ms budget. This separation into two models allows Interactive Segmenter to harness the image understanding of a large model while delivering the instantaneous responsiveness of a small one.
The final student models (encoder + super decoder) are quantized to 8 bits and both run on LiteRT's GPU acceleration with decoder inference latencies of 7.4ms on an iPhone 16 Pro, enabling seamless and intuitive image editing.
Image-size mask upsampling
To preserve the best image editing quality on high-resolution images, we need high-resolution segmentation masks. To achieve this, we train our segmentation model to predict a mask in 768x768 resolution and further upsample it to image resolution (capped at 4k to have it fit within a single GPU buffer). We use an efficient GPU implementation of the edge-preserving joint-bilateral upsampling method. To improve latency, we only apply upsampling once a user completes a gesture by lifting their finger.
Conclusion
With the new Interactive Segmenter in Snapseed image editing has become easier and more powerful than ever. Simple taps and strokes are translated into accurate selections, allowing users to translate their editing ideas into reality. Download Snapseed for iOS here and let your photos shine. Object Brush will be rolled out to more tools in Snapseed in the coming months. The underlying model powers a wide range of image editing and manipulation tasks and serves as a foundational technology for intuitive selective editing. It has also been shipped in the new Chromebook Plus 14 to power AI image editing in the Gallery app. Next, we plan to integrate it across more image and creative editing products at Google.
Acknowledgments
Special thanks to all members who worked on the tech with us: Valentin Bazarevsky, Daniel Fenner, Lutz Justen, Ronald Wotzlaw, Tai-Yu Daniel Pan, Jason Chang, Matthew Harries, Giles Ochs, Jonathan Horsman, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Karthik Raveendran, Matsvei Zhdanovich, Mogan Shieh, Chris Parsons, Jianing Wei, and Matthias Grundmann.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读