协同创作图像
内容来源:https://research.google/blog/a-collaborative-approach-to-image-generation/
内容总结:
谷歌研发新型AI图像生成工具 人机协作实现"创意对齐"
谷歌研究院高级科学家盖伊·特南霍兹与首席科学家克雷格·布蒂利耶近日发布了一项名为PASTA(偏好自适应序列文本转图像代理)的创新技术。该技术通过强化学习算法,让人工智能在与用户的多轮对话中持续学习个人偏好,显著提升了图像生成与用户创意的匹配度。
传统文本生成图像技术存在明显局限:用户输入描述词后,虽然能获得近似预期的图像,却难以通过简单调整提示词完全实现心中构想。PASTA系统通过模拟人类决策过程,将单次指令升级为多轮创意对话。用户从系统生成的候选图像中选择最接近预期的版本,AI据此持续优化后续方案,形成动态协作的创作闭环。
为实现这一突破,研究团队采用双轨训练策略:首先收集7000余组真实用户交互数据构建基础数据集,继而开发用户模拟器生成超过3万组模拟交互轨迹。该系统创新性地建立用户模型,通过预训练CLIP编码器识别不同审美倾向的用户类型(如偏好动物、风景或抽象艺术等),使AI能快速适应个性化需求。
在测试环节,研究人员发现仅使用真实数据或模拟数据训练的模型均未超越基线水平,但结合两类数据的混合训练模型表现出最优性能。在最终对比测试中,85%的用户更青睐PASTA生成的图像。特别是在处理抽象概念时,该系统展现出强大适应性:当输入"爱的图像"这类开放式指令时,它能根据用户类型生成从温馨肖像到几何抽象艺术的多元作品。
这项技术标志着生成式AI正朝着交互式、自适应和协作化方向演进。谷歌已开源相关数据集,为学界进一步研究提供支持。该成果显示,通过模拟人类偏好与交互行为,人工智能将能更精准地理解并实现人类用户的创意意图。
中文翻译:
一种协作式图像生成方法
2025年10月2日
谷歌研究院高级研究员盖伊·坦南霍尔兹与首席科学家克雷格·布蒂利耶联合撰文
我们正式推出PASTA——这款强化学习智能体通过与用户进行多轮交互,学习使用者的独特偏好来优化文生图输出。这项突破得益于一项创新的用户模拟技术。
快速链接
您脑海中浮现出一幅完美图像。输入提示词、点击生成后,结果虽接近设想却总有偏差。即便反复优化提示词、补充细节,理想与现实之间的鸿沟始终难以跨越。
这是普遍存在的困境。尽管文生图模型功能强大,但仅凭单次提示往往难以捕捉用户独特创作意图的细微差别。若能将图像生成转化为协作对话,将会如何?
本文介绍的“偏好自适应序列文生图智能体”(PASTA)正是基于此理念:这个强化学习智能体通过与用户协作逐步优化生成结果,让使用者无需反复试错即可获得理想图像。我们通过人工评估创建了全新的序列偏好数据集,并据此将PASTA与基线前沿模型进行对比。结果表明,采用真实与模拟混合数据训练的PASTA持续生成出满意度更高的图像。我们同步发布了包含7000余次人机交互记录的基础数据集。
PASTA工作原理
要训练AI智能体适应用户个性化偏好,需要大量多样化的交互数据。但受用户隐私等因素制约,真实数据采集面临挑战。为此我们采用两阶段训练策略,融合真实人类反馈与大规模用户模拟。
首先收集了含7000余名评分者序列交互的高质量基础数据集,其中包含由Gemini Flash大型多模态模型生成的提示词扩展内容,及对应Stable Diffusion XL模型生成的图像。这些真实偏好数据被用于训练用户模拟器,以生成模拟人类选择偏好的附加数据。
我们的用户模型核心包含两大组件:1)效用模型:预测用户对任意图像集的喜好程度;2)选择模型:预测用户在多组图像中的选择倾向。基于预训练CLIP编码器构建的模型通过期望最大化算法,既能学习用户偏好特征,又能自动识别具有相似审美的潜在用户类型(如偏好动物、风景或抽象艺术等)。
经训练的用户模拟器可对生成图像提供反馈、表达偏好并进行选择,由此生成超3万条模拟交互轨迹。该方法不仅扩充了数据量,更创造了可控环境来探索多元用户行为,从而训练PASTA智能体实现高效协作。
基于此数据驱动框架,PASTA被训练成能与任意用户有效互动,生成符合其偏好的图像。该智能体作为基于价值的强化学习模型,学会在每轮交互中选择最佳提示词扩展组合呈现给用户,其目标是最大化用户在全程互动中的累积满意度。
实际部署时,用户以初始提示词启动交互。PASTA首先通过候选生成器(大型多模态模型)创建多样化提示词扩展方案,再由候选选择器(经训练的RL智能体)筛选出四个最优方案用于生成图像供用户选择。用户选取最接近设想的图像后,该反馈将指引PASTA生成下一轮建议。这种协作循环使模型能实时学习用户偏好,逐步导向理想创作目标。
PASTA实践验证
我们采用隐式Q学习将PASTA训练为基于价值的强化学习智能体,重点评估不同训练数据对性能的影响。我们构建了三个版本:1)仅用真实志愿者数据训练;2)仅用模拟数据训练;3)混合数据训练。
随后通过四项指标进行人工评估:Pick-a-Pic数据集准确率、斯皮尔曼等级相关性、选择模型准确率和跨轮次准确率。前两项评估模型在现有大规模单轮数据集上的偏好预测能力,后两项分别衡量模型预测用户每轮选择的能力及所选图像是否持续改进。
结果显示:仅用模拟数据训练的版本未超越基线模型;仅用真实数据训练的版本虽有显著提升,但同样未超越基线;而混合数据训练版本表现最优,证实用户模拟技术既能捕捉人类交互关键动态,又能满足强化学习训练的数据规模需求。
在最终图像直接对比中,85%的评分者更青睐PASTA生成结果。面对抽象提示词时差异尤为显著:从“爱的图像”这般简单概念出发,PASTA能适配不同用户类型生成从温情肖像到几何抽象艺术的多元作品。
未来展望
PASTA昭示着生成式AI将更具交互性、偏好自适应性与协作性。我们开发的方法(特别是稳健用户模拟器)可扩展至其他生成任务,打造更契合人类需求的AI系统。
为推动后续研究,我们已开源序列评分者数据集与模拟用户数据,期待见证学术社区的创新应用。
致谢名单
核心贡献者包括:奥菲尔·纳巴蒂、盖伊·坦南霍尔兹、徐志伟、柳文贞、迪帕克·拉马钱德兰、周彦霖、李翔与克雷格·布蒂利耶。特别感谢马克·辛伯格协助撰写本文,以及金伯利·施韦德负责图表设计。
英文来源:
A collaborative approach to image generation
October 2, 2025
Guy Tennenholtz, Senior Research Scientist, and Craig Boutilier, Principal Scientist, Google Research
We introduce PASTA, a reinforcement learning agent that refines text-to-image output over multiple turns of interaction with a user by learning their unique preferences. This process is made possible by a novel user simulation technique.
Quick links
You have a perfect image in your mind. You enter a prompt, hit generate, and the result is close to what you were thinking, but not quite right. You try refining the prompt, adding more detail, but you can't seem to bridge the gap between your idea and the final image.
This is a common experience. While text-to-image (T2I) models are incredibly powerful, they often struggle to capture the nuance and specificity of an individual's unique creative intent given just a single prompt. What if we could turn image generation into a collaborative conversation?
In this post, we describe our research “Preference Adaptive and Sequential Text-to-image Agent” (PASTA), a reinforcement learning (RL) agent that collaborates with users to progressively refine T2I results. This approach eliminates the need for users to rely on trial-and-error prompt refinement to reach a desirable image. Through human evaluations, we created a novel dataset of sequential preferences, which we then used to compare PASTA with a baseline state-of-the-art model. The results demonstrated that PASTA, trained with our mix of real and simulated data, consistently produced images that users rated as more satisfying. We’ve also released our foundational dataset with a collection of over 7,000 human rater interactions with PASTA.
How PASTA works
To effectively train an AI agent to adapt to a user's individual preferences, a large, diverse set of interaction data is needed. However, gathering this data from real users is challenging due to several factors, including user privacy. To address this, we trained PASTA using a two-stage strategy that combines real human feedback with large-scale user simulation.
First, we collected a high-quality foundational dataset with over 7,000 raters' sequential interactions. These interactions included prompt expansions generated by a Gemini Flash large multimodal model and corresponding images generated by a Stable Diffusion XL (SDXL) T2I model. This initial seed of authentic preference data was then used to train a user simulator, designed to generate additional data that replicate real human choices and preferences.
At the heart of our method is a user model, comprising two key components: 1) a utility model that predicts the degree to which a user will like any set of images, and 2) a choice model that predicts which set of images they will select when presented with several sets. We constructed the user model using pre-trained CLIP encoders and added user-specific components. We trained the model using an expectation-maximization algorithm that allows us to simultaneously learn the specifics of user preferences while also discovering latent “user types,” that is, clusters of users with similar tastes (e.g., tendencies to prefer images with animals, scenic views, or abstract art).
The trained user simulator can provide feedback and express preferences on generated images, and make selections from sets of proposed images. This allows us to generate over 30,000 simulated interaction trajectories.. Our approach does more than just create more data; it gives us a controlled environment in which to explore a vast range of user behaviors so we can train the PASTA agent to effectively collaborate with users.
With this robust, data-driven foundation, the PASTA agent is trained to effectively engage with arbitrary users to generate images that match their preferences. The agent itself is a value-based reinforcement learning model that learns to select the best "slate" of prompt expansions (i.e., elaborations of the current prompt used to generate subsequent images) to show the user at each turn. Its goal is to maximize the user's cumulative satisfaction over the entire interaction.
Once PASTA is trained and deployed, a user initiates the engagement with an initial prompt. PASTA first uses a candidate generator (a large multimodal model) to create a diverse set of potential prompt expansions. Then, a candidate selector (our trained RL agent) selects the optimal slate of four such expansions, which are used to generate corresponding images to present to the user. The user selects the image that is closest to their vision, which provides feedback that guides PASTA's next set of suggestions. This collaborative back-and-forth allows the model to learn the user's preferences on the fly, steering the creative process toward their ideal goal with each step.
Putting PASTA to the test
To evaluate our approach, we trained PASTA as a value-based reinforcement learning agent using implicit Q-learning (IQL). We specifically wanted to see how the use of different training data impacted performance. We created three versions of the agent: 1) trained only on the real volunteer-rater data, 2) trained only on the simulated data, and 3) trained on a combination of real and simulated datasets.
We then ran a series of human evaluations comparing these agents to a baseline model (i.e., base Gemini Flash and SDXL models with no additional training) across four metrics: accuracy over the Pick-a-Pic dataset, Spearman’s rank correlation, choice model accuracy, and cross turn accuracy. Pick-a-Pic accuracy and Spearman's rank correlation assess the model's ability to predict user preferences and rankings on existing, large-scale, single-turn datasets. Choice model accuracy and cross-turn accuracy measure the model's ability to predict which image a user will choose at a given turn and whether the selected images are an improvement over the previous turn, respectively.
The results demonstrated that training PASTA on synthetic data alone didn't beat the baseline and while the agent trained on real human data showed significant improvement, it also didn’t outperform the baseline. However, the agent trained on the combination of both real and simulated data offered the best performance, confirming that our user simulation successfully captures key dynamics of human interaction while providing the scale needed for robust RL training.
When we asked raters to directly compare the final images from our best-performing agent against the baseline, 85% preferred PASTA's generated images. The difference is especially striking with abstract prompts. Starting with a simple idea like "an image of love", PASTA adapted to different user types to create a wide variety of results, from tender portraits to abstract, geometric art.
What's next?
PASTA shows that the future of generative AI can be more interactive, preference adaptive, and collaborative. The methods we developed, particularly the use of robust user simulators, can be applied to many other generative tasks to create AI that better aligns and adapts to human users.
To help spur further research, we have open-sourced our sequential rater dataset and our simulated user data. We can't wait to see what the community builds with it.
Acknowledgements
The author list is: Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, and Craig Boutilier. Special thanks to Mark Simborg for his help crafting this blog post and Kimberly Schwede for creating the figures in this post.