语音到检索(S2R):语音搜索的全新解决方案
内容来源:https://research.google/blog/speech-to-retrieval-s2r-a-new-approach-to-voice-search/
内容总结:
谷歌研发团队近日宣布,其全新语音检索技术已正式应用于语音搜索服务。这项名为"语音直达检索"的创新方案突破了传统语音搜索的技术瓶颈,实现了从语音直接获取信息的革命性跨越。
传统语音搜索采用串联式建模,需先将语音转换为文字再进行检索。这种模式存在两大固有缺陷:语音识别阶段的细微误差会扭曲查询原意,导致搜索结果偏差;音频转换为单一文本时会丢失语义上下文,造成信息损耗。例如用户语音查询《呐喊》画作时,若系统将"scream"误识为"screen",便会返回无关的屏幕喷涂技术内容。
全新语音直达检索技术通过双编码器架构实现本质突破:音频编码器直接将语音转换为蕴含语义的向量表征,文档编码器同步处理文本资料,使语音向量与相关文档向量在表征空间中紧密关联。这种设计使系统能直接理解用户查询意图,跳过了容易出错的文字转写环节。
实验数据表明,在包含17种语言的简易语音问句数据集测试中,新系统的平均倒数排名显著优于传统语音识别系统,且性能接近理想化串联模型的理论上限。目前该技术已在实际搜索服务中部署,为多语言用户提供更精准的搜索体验。
为推动语音技术领域共同进步,谷歌已开源简易语音问句数据集及大规模语音嵌入基准测试标准。这项技术突破不仅代表着语音搜索体验的质的飞跃,更标志着人机交互方式正在向更自然、更智能的方向演进。
中文翻译:
语音直达检索 (S2R):语音搜索的全新突破
2025年10月7日
谷歌研究院研究科学家 埃桑·瓦里亚尼、迈克尔·莱利
语音搜索现已搭载全新的语音直达检索引擎,可直接从语音提问中获取答案,无需先转换为文字,从而为所有用户带来更快速、更可靠的搜索体验。
快速了解
语音网络搜索技术问世已久且用户众多,其底层技术正飞速演进以拓展应用场景。谷歌初代语音搜索方案采用自动语音识别技术,先将语音输入转为文字查询,再匹配对应文本的文档。但这种级联建模方式存在明显缺陷:语音识别阶段的微小误差可能彻底改变查询语义,导致结果偏差。
例如当用户语音搜索爱德华·蒙克的名画《呐喊》时,传统系统会先通过ASR将语音转为文字再执行搜索。若ASR将“呐喊”误识为“纳喊”,系统便会返回与“纳喊”相关的无效内容,而非蒙克的传世之作。
为规避此类问题,能否让系统绕过文字转译,直接将语音映射为检索意图?
语音直达检索技术由此应运而生。其核心在于直接解析语音查询并检索信息,跳过了可能产生误差的文字转录环节。这标志着机器处理人类语音的方式发生了根本性架构革新——传统技术聚焦“说了什么词”,而S2R致力于解答更具价值的问题:“用户想获取什么信息?”本文将通过SVQ数据集揭示当前语音搜索的质量缺陷,并展示S2R模型的解决方案。该数据集包含17种语言、26个地区的简短语音问句,已作为重要组成部分纳入开放基准测试集MSEB。
S2R潜力评估
传统ASR系统将音频转为单一文本时,会丢失消除歧义的关键语境线索(信息损耗),且识别错误会直接传导至缺乏纠错能力的搜索引擎(误差传递),最终导致搜索结果偏离用户意图。
为探究改进空间,我们设计了模拟理想ASR的实验:先采集典型语音查询样本,经人工标注形成“完美转录”,再建立两组对比系统(如下图所示):
- 级联ASR组:模拟现实场景,语音经ASR转文字后送入检索系统
- 级联真值组:将完美标注文本直接输入相同检索系统
通过人工评估两组结果质量,并采用词错误率衡量ASR水平、平均倒数排名评估搜索效能,发现两大关键结论:首先,对比图表可见WER降低与MRR提升并非绝对正相关,错误性质(而非仅存在与否)是影响结果的关键语言因素;更重要的是,所有测试语言中两组系统的MRR存在显著差距,这揭示了当前系统与理论最优解之间的性能鸿沟,也彰显了S2R模型的巨大提升潜力。
S2R架构:从声音到语义
我们的S2R模型采用双编码器架构:音频编码器将原始语音转换为蕴含语义的向量表征,文档编码器对文本进行同步处理。通过海量语音-文档配对数据训练,系统使语音向量与对应文档向量在表征空间中紧密相邻。这种设计让模型直接从音频中捕捉检索所需的本质意图,规避了级联设计中逐字转译的脆弱环节。
S2R运行机制
用户说出查询语句时,预训练的音频编码器会实时生成查询向量,通过复杂搜索排序流程从索引中快速筛选候选结果。以搜索《呐喊》画作为例:系统先将语音转换为表征深层语义的音频嵌入向量,据此扫描海量文档索引,初步匹配高相似度结果(如维基百科条目0.8分,蒙克博物馆网站0.7分)。随后搜索排序系统会综合数百种特征进行深度解析,在毫秒级时间内完成最终排序,确保呈现最权威可靠的信息。
S2R效能验证
在SVQ数据集上的测试表明(图示橙色柱状):
- S2R模型显著超越基线级联ASR系统
- 其表现逼近级联真值模型设定的理论上限
现存差距则提示仍需持续深化研究。
语音搜索新时代已来临
S2R驱动的新一代语音搜索并非理论构想,而是已落地的现实。谷歌研究院与搜索团队紧密协作,现已为多语言用户提供这项服务,其准确率较传统级联系统实现跨越式提升。为推动领域共同进步,我们已将SVQ数据集作为MSEB组成部分开源,诚邀全球研究者利用该数据、参与公开基准测试,共同构建下一代真正智能的语音交互体系。
致谢
作者由衷感谢所有项目贡献者,特别鸣谢同事哈维·亚伯拉罕、西里尔·阿洛曾、汤姆·巴格比等成员的技术支持,以及马克·辛伯格提供的深度文稿修订、金伯利·施韦德的精美插画和米基·沃茨的宝贵协助。
英文来源:
Speech-to-Retrieval (S2R): A new approach to voice search
October 7, 2025
Ehsan Variani and Michael Riley, Research Scientists, Google Research
Voice Search is now powered by our new Speech-to-Retrieval engine, which gets answers straight from your spoken query without having to convert it to text first, resulting in a faster, more reliable search for everyone.
Quick links
Voice-based web search has been around a long time and continues to be used by many people, with the underlying technology evolving rapidly to allow for expanded use cases. Google’s initial voice search solution used automatic speech recognition (ASR) to turn the voice input into a text query, and then searched for documents matching that text query. However, a challenge with this cascade modeling approach is that any slight errors in the speech recognition phase can significantly alter the meaning of the query, producing the wrong results.
For example, imagine someone does a voice-based web search for the famous painting, “The Scream”, by Edvard Munch. The search engine uses the typical approach of cascade modeling, first converting the voice query to text via ASR before passing the text to the search system. Ideally, the ASR transcribes the query perfectly. The search system then receives the correct text — “the Scream painting” — and provides relevant results, like the painting’s history, its meaning, and where it’s displayed. However, what if the ASR system mistakes the “m” of “scream” for an “n”? It misinterprets the query as “screen painting” and returns irrelevant results about screen painting techniques instead of details about Munch's masterpiece.
To prevent such errors in web search systems, what if the system could map directly from speech to the desired retrieval intent, bypassing the textual transcription entirely?
Enter Speech-to-Retrieval (S2R). At its core, S2R is a technology that directly interprets and retrieves information from a spoken query without the intermediate, and potentially flawed, step of having to create a perfect text transcript. It represents a fundamental architectural and philosophical shift in how machines process human speech. Where today's common voice search technologies are focused on the question, "What words were said?", S2R is designed to answer a more powerful question: "What information is being sought?" This post explores the substantial quality gap in current voice search experiences and demonstrates how the S2R model is poised to fill it. In addition, we are open-sourcing the Simple Voice Questions (SVQ) dataset, a collection of short audio questions recorded in 17 different languages and 26 locales, which we used to evaluate the performance potential of S2R. The SVQ dataset is part of the new Massive Sound Embedding Benchmark benchmark.
Evaluating the potential of S2R
When a traditional ASR system converts audio into a single text string, it may lose contextual cues that could help disambiguate the meaning (i.e., information loss). If the system misinterprets the audio early on, that error is passed along to the search engine, which typically lacks the ability to correct it (i.e., error propagation). As a result, the final search result may not reflect the user's intent.
To investigate this relationship, we conducted an experiment designed to simulate an ideal ASR performance. We began by collecting a representative set of test queries reflecting typical voice search traffic. Crucially, these queries were then manually transcribed by human annotators, effectively creating a "perfect ASR" scenario where the transcription is the absolute truth.
We then established two distinct search systems for comparison (see chart below):
- Cascade ASR represents a typical real-world setup, where speech is converted to text by an automatic speech recognition (ASR) system, and that text is then fed to a retrieval system.
- Cascade groundtruth simulates a "perfect" cascade model by sending the flawless ground-truth text directly to the same retrieval system.
The retrieved documents from both systems (cascade ASR and cascade groundtruth) were then presented to human evaluators, or "raters", alongside the original true query. The evaluators were tasked with comparing the search results from both systems, providing a subjective assessment of their respective quality.
We use word error rate (WER) to measure the ASR quality and to measure the search performance, we use mean reciprocal rank (MRR) — a statistical metric for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness and calculated as the average of the reciprocals of the rank of the first correct answer across all queries. The difference in MRR and WER between the real-world system and the groundtruth system reveals the potential performance gains across some of the most commonly used voice search languages in the SVQ dataset (shown below).
The results of this comparison lead to two critical observations. First, and as can be seen by comparing both charts above, we found that a lower WER does not reliably lead to a higher MRR across different languages. The relationship is complex, suggesting that the impact of transcription errors on downstream tasks is not fully captured by the WER metric. The specific nature of an error — not just its existence — appears to be a critical, language-dependent factor. Second, and more importantly, there’s a significant MRR difference between the two systems across all tested languages. This reveals a substantial performance gap between current cascade designs and what is theoretically possible with perfect speech recognition. This gap represents the clear potential for S2R models to fundamentally improve voice search quality.
The architecture of S2R: From sound to meaning
At the heart of our S2R model is a dual-encoder architecture. This design features two specialized neural networks that learn from vast amounts of data to understand the relationship between speech and information. An audio encoder processes the raw audio of a query, converting it into a rich vector representation that captures its semantic meaning. In parallel, a document encoder learns a similar vector representation for documents.
The key to this model is how it is trained. Using a large dataset of paired audio queries and relevant documents, the system learns to adjust the parameters of both encoders simultaneously.
The training objective ensures that the vector for an audio query is geometrically close to the vectors of its corresponding documents in the representation space. This architecture allows the model to learn something closer to the essential intent required for retrieval directly from the audio, bypassing the fragile intermediate step of transcribing every word, which is the principal weakness of the cascade design.
How the S2R model works
When a user speaks a query, the audio is streamed to the pre-trained audio encoder, which generates a query vector. This vector is then used to efficiently identify a highly relevant set of candidate results from our index through a complex search ranking process.
The animation above illustrates how S2R understands and answers a spoken query. It starts with a user's voice request for “The Scream painting”. An audio encoder translates the sound into a rich audio embedding, a vector that represents the deep meaning of the query. This embedding is then used to scan a massive index of documents, surfacing initial candidates with high similarity scores, like the Wikipedia page for “The Scream” (0.8) and the Munch Museum website (0.7).
But finding relevant documents is just the beginning. The crucial final step is orchestrated by the search ranking system. This powerful intelligence goes far beyond the initial scores, weaving them together with hundreds of other signals to deeply understand relevance and quality. It weighs all this information in a fraction of a second to choreograph the final ranking, ensuring the most helpful and trustworthy information is presented to the user.
Evaluating S2R
We evaluated the S2R system described above on the SVQ dataset:
The S2R model's performance (orange bar) shows two key results: - It significantly outperforms the baseline cascade ASR model.
- Its performance approaches the upper bound established by the cascade ground truth model.
While promising, the remaining gap indicates that further research is required.
The new era for voice search is now live
The move to S2R-powered voice search isn’t a theoretical exercise; it’s a live reality. In a close collaboration between Google Research and Search, these advanced models are now serving users in multiple languages, delivering a significant leap in accuracy beyond conventional cascade systems.
To help propel the entire field forward, we are also open-sourcing the SVQ dataset as part of the Massive Sound Embedding Benchmark (MSEB). We believe shared resources and transparent evaluation accelerates progress. In that spirit, we invite the global research community to use this data, test new approaches on public benchmarks, and join the effort to build the next generation of truly intelligent voice interfaces.
Acknowledgements
The authors sincerely thank all who contributed to this project, whose critical input made it possible. We are especially grateful to our colleagues Hawi Abraham, Cyril Allauzen, Tom Bagby, Karthik Kumar Bandi, Stefan Buettcher, Dave Dopson, Lucy Hadden, Georg Heigold, Sanjit Jhala, Shankar Kumar, Ji Ma, Eyal Mizrachi, Pandu Nayak, Pew Putthividhya, Jungshik Shin, Sundeep Tirumalareddy and Trystan Upstill. We also wish to acknowledge those who helped prepare this post: Mark Simborg for his extensive editing, Kimberly Schwede for the wonderful illustrations, and Mickey Wurts for his valuable assistance.