«

语音到检索(S2R):语音搜索的全新解决方案

qimuai 发布于 阅读:30 一手编译


语音到检索(S2R):语音搜索的全新解决方案

内容来源:https://research.google/blog/speech-to-retrieval-s2r-a-new-approach-to-voice-search/

内容总结:

谷歌研发团队近日宣布,其全新语音检索技术已正式应用于语音搜索服务。这项名为"语音直达检索"的创新方案突破了传统语音搜索的技术瓶颈,实现了从语音直接获取信息的革命性跨越。

传统语音搜索采用串联式建模,需先将语音转换为文字再进行检索。这种模式存在两大固有缺陷:语音识别阶段的细微误差会扭曲查询原意,导致搜索结果偏差;音频转换为单一文本时会丢失语义上下文,造成信息损耗。例如用户语音查询《呐喊》画作时,若系统将"scream"误识为"screen",便会返回无关的屏幕喷涂技术内容。

全新语音直达检索技术通过双编码器架构实现本质突破:音频编码器直接将语音转换为蕴含语义的向量表征,文档编码器同步处理文本资料,使语音向量与相关文档向量在表征空间中紧密关联。这种设计使系统能直接理解用户查询意图,跳过了容易出错的文字转写环节。

实验数据表明,在包含17种语言的简易语音问句数据集测试中,新系统的平均倒数排名显著优于传统语音识别系统,且性能接近理想化串联模型的理论上限。目前该技术已在实际搜索服务中部署,为多语言用户提供更精准的搜索体验。

为推动语音技术领域共同进步,谷歌已开源简易语音问句数据集及大规模语音嵌入基准测试标准。这项技术突破不仅代表着语音搜索体验的质的飞跃,更标志着人机交互方式正在向更自然、更智能的方向演进。

中文翻译:

语音直达检索 (S2R):语音搜索的全新突破
2025年10月7日
谷歌研究院研究科学家 埃桑·瓦里亚尼、迈克尔·莱利

语音搜索现已搭载全新的语音直达检索引擎,可直接从语音提问中获取答案,无需先转换为文字,从而为所有用户带来更快速、更可靠的搜索体验。

快速了解
语音网络搜索技术问世已久且用户众多,其底层技术正飞速演进以拓展应用场景。谷歌初代语音搜索方案采用自动语音识别技术,先将语音输入转为文字查询,再匹配对应文本的文档。但这种级联建模方式存在明显缺陷:语音识别阶段的微小误差可能彻底改变查询语义,导致结果偏差。

例如当用户语音搜索爱德华·蒙克的名画《呐喊》时,传统系统会先通过ASR将语音转为文字再执行搜索。若ASR将“呐喊”误识为“纳喊”,系统便会返回与“纳喊”相关的无效内容,而非蒙克的传世之作。

为规避此类问题,能否让系统绕过文字转译,直接将语音映射为检索意图?

语音直达检索技术由此应运而生。其核心在于直接解析语音查询并检索信息,跳过了可能产生误差的文字转录环节。这标志着机器处理人类语音的方式发生了根本性架构革新——传统技术聚焦“说了什么词”,而S2R致力于解答更具价值的问题:“用户想获取什么信息?”本文将通过SVQ数据集揭示当前语音搜索的质量缺陷,并展示S2R模型的解决方案。该数据集包含17种语言、26个地区的简短语音问句,已作为重要组成部分纳入开放基准测试集MSEB。

S2R潜力评估
传统ASR系统将音频转为单一文本时,会丢失消除歧义的关键语境线索(信息损耗),且识别错误会直接传导至缺乏纠错能力的搜索引擎(误差传递),最终导致搜索结果偏离用户意图。

为探究改进空间,我们设计了模拟理想ASR的实验:先采集典型语音查询样本,经人工标注形成“完美转录”,再建立两组对比系统(如下图所示):

通过人工评估两组结果质量,并采用词错误率衡量ASR水平、平均倒数排名评估搜索效能,发现两大关键结论:首先,对比图表可见WER降低与MRR提升并非绝对正相关,错误性质(而非仅存在与否)是影响结果的关键语言因素;更重要的是,所有测试语言中两组系统的MRR存在显著差距,这揭示了当前系统与理论最优解之间的性能鸿沟,也彰显了S2R模型的巨大提升潜力。

S2R架构:从声音到语义
我们的S2R模型采用双编码器架构:音频编码器将原始语音转换为蕴含语义的向量表征,文档编码器对文本进行同步处理。通过海量语音-文档配对数据训练,系统使语音向量与对应文档向量在表征空间中紧密相邻。这种设计让模型直接从音频中捕捉检索所需的本质意图,规避了级联设计中逐字转译的脆弱环节。

S2R运行机制
用户说出查询语句时,预训练的音频编码器会实时生成查询向量,通过复杂搜索排序流程从索引中快速筛选候选结果。以搜索《呐喊》画作为例:系统先将语音转换为表征深层语义的音频嵌入向量,据此扫描海量文档索引,初步匹配高相似度结果(如维基百科条目0.8分,蒙克博物馆网站0.7分)。随后搜索排序系统会综合数百种特征进行深度解析,在毫秒级时间内完成最终排序,确保呈现最权威可靠的信息。

S2R效能验证
在SVQ数据集上的测试表明(图示橙色柱状):

语音搜索新时代已来临
S2R驱动的新一代语音搜索并非理论构想,而是已落地的现实。谷歌研究院与搜索团队紧密协作,现已为多语言用户提供这项服务,其准确率较传统级联系统实现跨越式提升。为推动领域共同进步,我们已将SVQ数据集作为MSEB组成部分开源,诚邀全球研究者利用该数据、参与公开基准测试,共同构建下一代真正智能的语音交互体系。

致谢
作者由衷感谢所有项目贡献者,特别鸣谢同事哈维·亚伯拉罕、西里尔·阿洛曾、汤姆·巴格比等成员的技术支持,以及马克·辛伯格提供的深度文稿修订、金伯利·施韦德的精美插画和米基·沃茨的宝贵协助。

英文来源:

Speech-to-Retrieval (S2R): A new approach to voice search
October 7, 2025
Ehsan Variani and Michael Riley, Research Scientists, Google Research
Voice Search is now powered by our new Speech-to-Retrieval engine, which gets answers straight from your spoken query without having to convert it to text first, resulting in a faster, more reliable search for everyone.
Quick links
Voice-based web search has been around a long time and continues to be used by many people, with the underlying technology evolving rapidly to allow for expanded use cases. Google’s initial voice search solution used automatic speech recognition (ASR) to turn the voice input into a text query, and then searched for documents matching that text query. However, a challenge with this cascade modeling approach is that any slight errors in the speech recognition phase can significantly alter the meaning of the query, producing the wrong results.
For example, imagine someone does a voice-based web search for the famous painting, “The Scream”, by Edvard Munch. The search engine uses the typical approach of cascade modeling, first converting the voice query to text via ASR before passing the text to the search system. Ideally, the ASR transcribes the query perfectly. The search system then receives the correct text — “the Scream painting” — and provides relevant results, like the painting’s history, its meaning, and where it’s displayed. However, what if the ASR system mistakes the “m” of “scream” for an “n”? It misinterprets the query as “screen painting” and returns irrelevant results about screen painting techniques instead of details about Munch's masterpiece.
To prevent such errors in web search systems, what if the system could map directly from speech to the desired retrieval intent, bypassing the textual transcription entirely?
Enter Speech-to-Retrieval (S2R). At its core, S2R is a technology that directly interprets and retrieves information from a spoken query without the intermediate, and potentially flawed, step of having to create a perfect text transcript. It represents a fundamental architectural and philosophical shift in how machines process human speech. Where today's common voice search technologies are focused on the question, "What words were said?", S2R is designed to answer a more powerful question: "What information is being sought?" This post explores the substantial quality gap in current voice search experiences and demonstrates how the S2R model is poised to fill it. In addition, we are open-sourcing the Simple Voice Questions (SVQ) dataset, a collection of short audio questions recorded in 17 different languages and 26 locales, which we used to evaluate the performance potential of S2R. The SVQ dataset is part of the new Massive Sound Embedding Benchmark benchmark.
Evaluating the potential of S2R
When a traditional ASR system converts audio into a single text string, it may lose contextual cues that could help disambiguate the meaning (i.e., information loss). If the system misinterprets the audio early on, that error is passed along to the search engine, which typically lacks the ability to correct it (i.e., error propagation). As a result, the final search result may not reflect the user's intent.
To investigate this relationship, we conducted an experiment designed to simulate an ideal ASR performance. We began by collecting a representative set of test queries reflecting typical voice search traffic. Crucially, these queries were then manually transcribed by human annotators, effectively creating a "perfect ASR" scenario where the transcription is the absolute truth.
We then established two distinct search systems for comparison (see chart below):

谷歌研究进展

文章目录


    扫描二维码,在手机上阅读