从巨型模型到移动端魔法：YouTube实时生成式AI特效背后的技术

qimuai 发布于 2025-8-22 03:01 阅读：41 一手编译

内容来源：https://research.google/blog/from-massive-models-to-mobile-magic-the-tech-behind-youtube-real-time-generative-ai-effects/

内容总结：

谷歌研发团队通过知识蒸馏与MediaPipe移动端优化技术，成功将大型生成式AI模型压缩为可实时运行的轻量化版本，使YouTube Shorts实现超20种手机端实时AI特效。该技术采用师生模型架构：以Imagen等大模型作为"教师"生成高质量图像效果，再将其能力蒸馏至专为移动端设计的UNet结构"学生"模型。通过关键身份保护技术（PTI）解决人脸特征失真问题，并利用MediaPipe框架实现毫秒级图像处理，在Pixel 8 Pro和iPhone 13等设备上单帧处理耗时仅6-10毫秒。自2023年应用以来，该技术已支持"万圣节僵尸面具""卡通化滤镜"等多款爆款特效，未来还将集成Veo 3等新一代模型并进一步降低低端设备延迟。

中文翻译：

从巨型模型到移动端魔法：YouTube实时生成式AI特效背后的技术
2025年8月21日
Google Cloud软件工程师Andrey Vakunov与YouTube软件工程师Adam Svystun

本文将详解YouTube如何通过知识蒸馏技术与MediaPipe设备端优化方案，在移动设备上实现实时生成式AI特效，既突破算力限制又完美保留用户面部特征。

快速导读
特效是YouTube Shorts趣味体验的核心要素，但要想呈现惊艳效果，必须在创作者录制时通过摄像头实时呈现。这带来一个挑战：如何将卡通风格转换等大型生成式AI模型的最新功能应用于创作者手机？

我们构建的解决方案能将大模型能力蒸馏至专注于单一任务的轻量化模型。这种聚焦式设计创造出可直接在手机上逐帧处理视频的紧凑高效模型。通过该方法，我们已为YouTube Shorts创作者推出20余款实时特效。本文将详细解析数据策展、模型训练及设备端部署的全流程。

数据是基石
高质量数据是项目根基。我们首先使用经合规授权的图像构建人脸数据集，并精心筛选确保数据在性别、年龄和肤色（依据Monk肤色量表）维度保持多样性与均匀分布，从而打造适合所有人群的特效。

师生传承模式
我们的核心技术是知识蒸馏——采用"教师-学生"模型训练框架。"教师模型"是经过预训练的大型生成模型，能生成理想视觉效果但无法满足实时性要求。根据目标差异，我们最初采用基于定制数据训练的StyleGAN2模型，可搭配StyleCLIP等工具实现文本驱动的面部编辑。随着项目推进，我们升级至Google DeepMind的Imagen等先进生成模型，显著提升图像保真度、艺术控制力及风格多样性。

"学生模型"是最终部署在用户设备上的轻量化网络。我们采用擅长图像转换任务的UNet架构，其编码器使用移动端优化的MobileNet主干网，解码器则集成MobileNet模块。

蒸馏教学：迭代式培养
为达到生产级效果，我们开发了突破合成数据局限的稳健训练方案。通过真实世界数据生成"图像对"训练学生模型，并实现高效超参数搜索。

蒸馏过程包含两个关键阶段：

数据生成：通过教师模型处理海量图像，生成数千组"前后对比"图像对。生成过程中引入增强技术（如添加AR眼镜/太阳镜、合成手部遮挡），并采用关键调优反演技术（PTI）保护用户身份特征。
学生训练：基于配对图像训练学生模型。我们融合L1、LPIPS、自适应及对抗损失函数，确保输出结果兼具数值精度与视觉真实性。通过神经架构搜索优化深度/宽度乘数等参数，为不同应用场景定制高效架构。

核心挑战：身份特征保留
图像"编辑"在潜在空间中进行——这是编码图像特征的压缩数值表示空间。原始像素到潜在空间的转换称为"反演"。面部特效生成面临的最大挑战是保持人物身份特征，因为传统方法会重构整个画面，易导致肤色、眼镜或服饰失真（即"反演问题"）。

我们通过关键调优反演（PTI）技术解决该问题：

使用编码器将原图像转换为关键编码，生成初始反演结果（与原图近似但存在细节差异）
通过PTI迭代流程微调生成器，精准保留面部特征
使用StyleCLIP等技术生成的预设向量编辑编码
通过微调后的生成器与编辑编码输出最终图像

设备端部署：Google AI Edge的MediaPipe
模型训练完成后，需集成至手机端高效管道。我们采用Google AI Edge的开源跨平台多模态ML框架MediaPipe构建设备端解决方案：

MediaPipe面部网格模块检测视频流中的面部
计算稳定旋转的面部裁剪区域（确保学生模型处理一致性）
将裁剪图像转换为张量输入轻量化学生模型
模型施加特效（如微笑或卡通风格）后，将处理图像实时映射回原始视频帧

为保障30FPS流畅体验，管道需在33毫秒内完成单帧处理。在Google Tensor G3（Pixel 8 Pro）和iPhone 13 GPU上的模型推理延迟分别为6毫秒与10.6毫秒。我们针对多移动设备进行深度优化，通过GPU加速确保用户体验流畅。

成果：移动创意新纪元
自2023年起，该技术已成为YouTube Shorts的核心支撑，成功推出表情驱动特效（如"永不眨眼"）、万圣节主题面具（如"复活僵尸"）及沉浸式全帧特效（如"卡通2.0"），极大拓展了视频创作者的表达空间。

通过弥合大型生成模型与移动硬件之间的技术鸿沟，我们正在重新定义设备端实时生成特效的技术边界。这仅是起点——我们正积极集成Veo 3等新一代模型，大幅降低入门设备延迟，让更多用户能在YouTube Shorts中享受尖端生成式AI技术。

致谢
感谢合作者：Sarah Xu, Maciej Pęśko, Paweł Andruszkiewicz, Jacob Rockwell, Ronny Votel, Robert (Guohui) Wang, Tingbo Hou, Karthik Raveendran, Jianing Wei, Matthias Grundmann, Omer Tov, Ariel Ephrat, Shiran Zada, Inbar Mosseri。

英文来源：

From massive models to mobile magic: The tech behind YouTube real-time generative AI effects
August 21, 2025
Andrey Vakunov, Software Engineer, Google Cloud, and Adam Svystun, Software Engineer, YouTube
We detail how YouTube delivers real-time generative AI effects on mobile devices by using knowledge distillation and on-device optimization with MediaPipe to overcome computational limitations while preserving user identity.
Quick links
Effects are a huge part of the fun on YouTube Shorts, but for them to feel magical, they need to work in real-time in the camera as the creator is recording. This presents a challenge: how do we apply the latest capabilities of large generative AI models, such as cartoon style transfer, on creators' phones?
Our solution is a pipeline that distills the capability of a large model into a much smaller one focused on a single task. This narrowing of scope creates a compact, efficient model that can run directly on a phone, processing video frame-by-frame. Using this method, we've launched over 20 real-time effects for YouTube creators on Shorts. In this post, we'll detail how we accomplish this: including data curation, training, and the on-device setup.
It all starts with data
The foundation of our work is high-quality data. We began by building a face dataset using properly licensed images. We meticulously filtered our datasets to ensure they were diverse and uniformly distributed across different genders, ages, and skin tones (as measured by the Monk Skin Tone Scale) to build effects that work well for everyone.
The teacher and the student
Our approach revolves around a concept called knowledge distillation, which uses a "teacher–student" model training method. We start with a "teacher" — a large, powerful, pre-trained generative model that is an expert at creating the desired visual effect but is far too slow for real-time use. The type of teacher model varies depending on the goal. Initially, we used a custom-trained StyleGAN2 model, which was trained on our curated dataset for real-time facial effects. This model could be paired with tools like StyleCLIP, which allowed it to manipulate facial features based on text descriptions. This provided a strong foundation. As our project advanced, we transitioned to more sophisticated generative models like Google DeepMind’s Imagen. This strategic shift significantly enhanced our capabilities, enabling higher-fidelity and more diverse imagery, greater artistic control, and a broader range of styles for our on-device generative AI effects.
The "student" is the model that ultimately runs on the user’s device. It needs to be small, fast, and efficient. We designed a student model with a UNet-based architecture, which is excellent for image-to-image tasks. It uses a MobileNet backbone as its encoder, a design known for its performance on mobile devices, paired with a decoder that utilizes MobileNet blocks.
Distillation: Iteratively teaching the student
To achieve production-ready effects, we developed a robust training methodology that addresses the limitations of synthetic data distillation, which often leads to artifacts and reduced high-frequency details. Our approach leverages real-world data to generate "image pairs" and train student models to enable a more efficient hyperparameter search.
The distillation process for training the smaller student model involves two key steps:

Data Generation: We process a large dataset of images through the teacher model to create thousands of "before and after" image pairs. During generation, we incorporate augmentations, such as adding AR glasses and sunglasses, and occlusion with synthetic hands. We also use Pivotal Tuning Inversion to preserve user identity.
Student Training: The student model is then trained on these paired images. We utilize a combination of L1, LPIPS, Adaptive, and Adversarial loss functions to ensure the student's output is not only numerically accurate but also visually realistic and aesthetically pleasing. Furthermore, we employ a neural architecture search to optimize model architecture parameters (like "depth multiplier" and "width multiplier") allowing us to identify efficient architectures tailored to various use cases and effect types.
A critical challenge: Preserving user identity
The "editing" of the image happens in "latent" space, which is a compressed numerical representation of the image where meaningful features are encoded. The process of converting raw pixels to latent representation is called “inversion”. A major challenge in image-to-image generative models for facial effects is preserving a person's identity because the effect regenerates the entire frame. A naïve approach can easily distort key features, changing a person's skin tone, glasses, or clothing, resulting in an output that no longer looks like them. This issue, often called the "inversion problem", happens when a model struggles to accurately represent a real person's face in its latent space.
To solve this, we employ a technique called pivotal tuning inversion (PTI). Here is a simplified version of how it works:
The original image is transformed into an embedding, referred to as a pivotal code, using an encoder and generating an initial inversion with a generator (see below). This is typically a representation similar to the original image, but not identical (e.g., skin tone and facial details may not be accurate).
We fine-tune a generator using the PTI iterative process to preserve face identity and details. This results in a new generator that performs better for a specific face and its embedding neighborhood.
We apply the desired effect by editing the embedding, typically using a prepared vector created with techniques such as StyleCLIP.
We generate the final output image with an edited face using a fine-tuned generator and an edited embedding.
Running on device with MediaPipe from Google AI Edge
Once the student model is trained, it needs to be integrated into a pipeline that can run efficiently on a phone. We built our on-device solution using MediaPipe, our open-source framework for building cross-platform multimodal ML pipelines, from Google AI Edge. The final inference pipeline works as follows:
First, the MediaPipe Face Mesh module detects one or more faces in the video stream.
Because student models are sensitive to face alignment, the pipeline computes a stable, rotated crop of the face to ensure consistency.
This cropped image is converted into a tensor and fed into our lean student model.
The student model applies the effect (e.g., a smile or a cartoon style), and the resulting image is warped back and seamlessly composited onto the original video frame in real-time.
These experiences need to run at a minimum of 30 frames per second to feel responsive to the user, so the pipeline must execute faster than 33 milliseconds per frame. The model inference latencies are ~6 ms for Pixel 8 Pro on Google Tensor G3 and 10.6 ms for iPhone 13 GPU. We invested heavily in optimizing these pipelines for a wide range of mobile devices, leveraging GPU acceleration to ensure a smooth experience for everyone.
The result: Enhanced mobile creativity
This technology has been a crucial element of YouTube Shorts since 2023, enabling the successful launch of numerous popular features, including expression-based effects (e.g., Never blink), Halloween-themed masks (e.g., Risen zombie), and immersive full-frame effects (e.g., Toon 2). These significantly expanded creative possibilities for YouTube video creators.
By bridging the gap between massive generative models and the constraints of mobile hardware, we are defining what is technically possible for real-time, on-device generative effects. This is just the beginning; we are actively working on integrating our newest models, like Veo 3, and significantly reducing latency for entry-level devices, further democratizing access to cutting-edge generative AI in YouTube Shorts.
Acknowledgements
We would like to thank our co-authors and collaborators: Sarah Xu, Maciej Pęśko, Paweł Andruszkiewicz, Jacob Rockwell, Ronny Votel, Robert (Guohui) Wang, Tingbo Hou, Karthik Raveendran, Jianing Wei, Matthias Grundmann, Omer Tov, Ariel Ephrat, Shiran Zada, and Inbar Mosseri.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读