小模型,大成效:通过分解实现卓越的意图提取。

内容总结:
谷歌研究团队提出新型意图理解框架:小型模型通过“分步拆解”实现媲美大模型的效果
2026年1月22日,谷歌软件工程师Danielle Cohen与Yoni Halpern发表研究成果,介绍了一种基于小型多模态大模型(MLLM)在移动设备端实现高效用户意图理解的新方法。该方法通过将复杂的用户交互轨迹理解任务分解为两个阶段,使轻量级模型能够达到与参数量大得多的模型相媲美的性能,同时具备低成本、高速度与更好的隐私保护潜力。
当前,理解用户在手机等设备界面操作背后的意图,是构建真正智能助手的核心挑战。尽管大型多模态模型在此任务上表现良好,但其依赖云端服务器处理的方式存在延迟高、成本大及潜在隐私风险等问题。
在发表于EMNLP 2025的论文《小模型,大成果:通过分解实现卓越的意图提取》中,研究团队提出了一种“分解式”工作流程。该流程首先由小型模型独立总结用户与单个屏幕元素的每次交互内容,生成包含屏幕上下文、用户操作及潜在意图推测的摘要;随后,另一个精调的小型模型基于这些摘要序列,提炼出代表整个操作轨迹的最终用户意图。
为提升效果,团队采用了多项关键技术:利用公开自动化数据集对模型进行精调,使其学会识别关键信息;通过预处理剔除训练数据中摘要未包含的细节,有效减少了模型“幻觉”现象;此外,尽管第一阶段模型会生成意图推测,但在第二阶段提取最终意图时有意忽略这些推测,反而提升了整体准确性。
在评估方面,研究创新性地采用“Bi-Fact”方法,将参考意图和预测意图分解为不可再分的“原子事实”进行比对,从而精确计算精度、召回率与F1分数,并能追踪错误产生的具体环节。
实验结果表明,这种分解策略在移动设备和网页交互轨迹的理解任务上,均优于传统的思维链提示或端到端精调方法。值得注意的是,仅使用80亿参数的Gemini 1.5 Flash模型配合此方法,所取得的效果即可与庞大得多的Gemini 1.5 Pro模型相匹敌,同时在成本与速度上优势显著。
该研究证明了通过合理的任务分解,小型模型完全能够在设备端高效、准确地理解用户意图,为未来在移动设备上开发更多隐私安全、响应迅速的智能辅助功能奠定了基础。
中文翻译:
小模型,大成果:通过分解式方法实现卓越的用户意图提取
2026年1月22日
Danielle Cohen 与 Yoni Halpern,谷歌软件工程师
我们提出了一种创新方法,利用小型模型从用户界面交互轨迹中理解用户意图,其效果优于规模大得多的模型。
快速链接
随着人工智能技术的发展,真正有用的智能体将能更好地预测用户需求。若要使移动设备体验真正具备帮助性,底层模型需要理解用户在与设备交互时正在做什么(或试图做什么)。一旦理解了当前及历史任务,模型便能获得更多上下文信息来预测潜在后续操作。例如,若用户此前搜索过欧洲各地的音乐节,现在正查询前往伦敦的航班,智能体可主动提议在对应日期为用户寻找伦敦的音乐节信息。
大型多模态大语言模型已能较好地从用户界面轨迹中理解用户意图。但若使用大语言模型处理此任务,通常需将信息发送至服务器,这可能导致延迟、成本高昂,并存在敏感信息泄露的风险。
我们近期发表于EMNLP 2025的论文《小模型,大成果:通过分解式方法实现卓越的用户意图提取》,探讨了如何利用小型多模态大语言模型在设备端直接理解网页及移动设备上的用户交互序列。通过将用户意图理解分为两个阶段——先独立总结每个屏幕内容,再从生成的摘要序列中提取意图——我们使小型模型更能胜任该任务。我们还建立了模型性能评估的标准化指标,并证明该方法可获得与大型模型相媲美的结果,展现了其在设备端应用的潜力。此项工作基于我们团队先前在用户意图理解领域的积累。
方法详述
我们提出了一种基于用户交互的分解式意图理解工作流程。在推理阶段,模型主要执行两个步骤:第一步,对单个屏幕及界面元素的每次交互进行独立总结;随后,将这些摘要作为事件序列,用于预测整个界面交互轨迹的总体意图。
单屏幕摘要生成
在第一阶段,由小型多模态大语言模型对每次独立交互进行总结。
给定连续三个屏幕的滑动窗口(前一屏、当前屏、下一屏),模型需回答以下问题:
- 相关屏幕上下文是什么?列出当前屏幕的关键细节。
- 用户刚刚执行了哪些操作?列举用户在此次交互中的行为。
- 推测:用户通过此次交互试图达成什么目标?
基于摘要的意图提取
此阶段使用经微调的小型模型,从屏幕摘要中提取单一句子形式的意图。
我们发现以下技术手段具有助益:
- 微调训练:提供“优质”意图陈述的示例,能帮助模型聚焦摘要中的重要部分并剔除无用信息。我们使用公开的自动化数据集作为训练数据,因其包含意图与操作序列匹配的良好范例。
- 标签预处理:由于摘要可能缺失信息,若直接使用完整意图进行训练,模型可能被误导去补充摘要中未出现的细节(即产生幻觉)。为避免此问题,我们首先借助独立的大语言模型调用,从训练意图中移除所有未出现在摘要中的信息。
- 剔除推测内容:在第一阶段为模型提供专门输出用户目标推测的模块,有助于生成更完整的步骤摘要,但可能干扰第二阶段的意图提取器。因此我们在第二阶段不使用推测内容。尽管这看似矛盾——第一阶段要求推测却在第二阶段将其舍弃——但我们发现这能有效提升整体性能。
评估方法
我们采用双事实评估法,通过对比预测意图与参考意图的质量进行评估。该方法使用独立的大语言模型调用,将参考意图和预测意图拆分为不可再分的意图细节单元,即“原子事实”。例如,“单程航班”可作为一个原子事实,而“从伦敦飞往基加利的航班”则包含两个原子事实。随后,我们统计预测意图中涵盖的参考事实数量,以及参考意图中涵盖的预测事实数量,从而计算方法的精确率(预测事实中正确的比例)与召回率(正确预测的真实事实比例),并得出F1分数。
使用原子事实还有助于追踪分解式方法各阶段对错误的影响。下图展示了我们如何通过分析事实在系统中的流转,追踪每个阶段的细节遗漏与幻觉生成情况。
关键成果
在使用小型模型时,先独立总结各屏幕内容、再从摘要序列提取意图的分解式方法具有显著优势。我们将其与思维链提示、端到端微调等标准方法对比,发现其性能均优于后者。这一结论在移动设备与网页交互轨迹的测试中均成立,且适用于Gemini和Qwen2基础模型。我们甚至发现,采用Gemini 1.5 Flash 8B模型结合分解式方法所获结果,与使用成本更高、速度更慢的Gemini 1.5 Pro模型效果相当。更多实验细节请参阅论文原文。
结论与未来方向
我们证明了轨迹总结的分解式方法能有效提升小型模型的意图理解能力。随着模型性能持续进化、移动设备算力不断增强,我们期待设备端意图理解技术未来能成为移动设备众多辅助功能的基石。
致谢
感谢论文合著者:Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan 以及 Anatoly Efros。
英文来源:
Small models, big results: Achieving superior intent extraction through decomposition
January 22, 2026
Danielle Cohen and Yoni Halpern, Software Engineers, Google
We present a novel approach to tackle the task of understanding user intents from UI interaction trajectories using small models, which shows better results than significantly larger models.
Quick links
As AI technologies advance, truly helpful agents will become capable of better anticipating user needs. For experiences on mobile devices to be truly helpful, the underlying models need to understand what the user is doing (or trying to do) when users interact with them. Once current and previous tasks are understood, the model has more context to predict potential next actions. For example, if a user previously searched for music festivals across Europe and is now looking for a flight to London, the agent could offer to find festivals in London on those specific dates.
Large multimodal LLMs are already quite good at understanding user intent from a user interface (UI) trajectory. But using LLMs for this task would typically require sending information to a server, which can be slow, costly, and carries the potential risk of exposing sensitive information.
Our recent paper “Small Models, Big Results: Achieving Superior Intent Extraction Through Decomposition”, presented at EMNLP 2025, addresses the question of how to use small multimodal LLMs (MLLMs) to understand sequences of user interactions on the web and on mobile devices all on device. By separating user intent understanding into two stages, first summarizing each screen separately and then extracting an intent from the sequence of generated summaries, we make the task more tractable for small models. We also formalize metrics for evaluation of model performance and show that our approach yields results comparable to much larger models, illustrating its potential for on-device applications. This work builds on previous work from our team on user intent understanding.
Details
We introduce a decomposed workflow for user intent understanding from user interactions. At inference time the model performs two main steps. In the first step each individual interaction on a single screen and UI element is summarized independently. Next, those summaries are used as a series of events to predict the general intent of the entire UI trajectory.
Individual screen summaries
At the first stage, every individual interaction is summarized by a small multimodal LLM.
Given the a sliding window of three screens (previous, current, next), the following questions are asked:
- What is the relevant screen context? Give a short list of salient details on the current screen.
- What did the user just do? Provide a list of actions that the user took in this interaction.
- Speculate. What is the user trying to accomplish with this interaction?
Intent extraction from summaries
In this stage, a fine-tuned small model is used to extract a single sentence from the screen summaries.
We find that the following techniques are helpful. - Fine-tuning: Giving examples of what a “good” intent statement looks like helps the model focus on the important parts of the summaries and drop the non-useful ones. We use publicly available automation datasets for training data, since they have good examples that pair intent with sequences of actions.
- Label Preparation: Because the summaries may be missing information, if we train with the full intents, we inadvertently teach the model to fill in details that aren’t present (i.e., to hallucinate). To avoid this, we first remove any information that doesn’t appear in the summaries from the training intents (using a separate LLM call).
- Dropping speculations: Giving the model a specified place to output its speculations on what the user is trying to do helps create a more complete step summary in stage one, but can confuse the intent extractor in stage two. So we do not use the speculations during the second stage. While this may seem counterintuitive — asking for speculations in the first stage only to drop them in the second — we find this helps improve performance.
Evaluation approach
We use the Bi-Fact approach to evaluate the quality of a predicted intent against a reference intent. With this approach, we use a separate LLM call to split the reference and predicted intents into details of the intent that cannot be broken down further, which we call “atomic facts”. For example, “a one-way flight” would be an atomic fact, while “a flight from London to Kigali” would be two. We then count the number of reference facts that are entailed by the predicted intent and the number of predicted facts that are entailed by the reference intent. This enables us to know the precision (how many of the predicted facts are correct) and recall (how many of the true facts we correctly predicted) of our method and to calculate the F1 score.
Working with atomic facts also helps to track how the different stages of the decomposed approach contribute to errors. Below we show how we analyze the flow of facts through the system to track missed details and hallucinations at each stage.
Key results
The decomposed approach of summarizing each screen separately and then extracting an intent from the sequence of generated summaries is helpful when using small models. We compare it against standard approaches, including chain of thought prompting (CoT) and end-to-end fine-tuning (E2E), and find that it outperforms both. This result holds true when we tested on both mobile device and web trajectories and for Gemini and Qwen2 base models. We even find that applying the decomposed approach with the Gemini 1.5 Flash 8B model achieves comparable results to using Gemini 1.5 Pro at a fraction of the cost and speed. See the paper for additional experiments.
Conclusions & future directions
We have shown that a decomposed approach to trajectory summarization can be helpful for intent understanding with small models. Ultimately, as models improve in performance and mobile devices acquire more processing power, we hope that on-device intent understanding can become a building block for many assistive features on mobile devices going forward.
Acknowledgements
Thank you to our paper coauthors: Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, and Anatoly Efros.