«

谷歌最新AI模型能像真人一样使用网页浏览器。

qimuai 发布于 阅读:10 一手编译


谷歌最新AI模型能像真人一样使用网页浏览器。

内容来源:https://www.theverge.com/news/795463/google-computer-use-gemini-ai-model-agents

内容总结:

谷歌公司近日发布其新一代人工智能模型“Gemini 2.5 Computer Use”,该技术突破使AI能够像真人一样操作浏览器界面。据悉,该模型通过视觉理解与逻辑分析能力,可自主完成表单填写提交、页面点击滚动等复杂操作,并能获取传统API接口无法调取的网页数据。

这一技术不仅可用于测试用户界面,还能帮助用户完成购物车添加等实际任务。其研发背景正值行业竞争白热化——OpenAI日前刚在开发者大会推出新版ChatGPT应用,而Anthropic公司去年就已发布具备“计算机使用”功能的Claude模型。

与竞品不同,谷歌特别强调其模型目前仅限浏览器环境使用,尚未支持整机操作系统控制。演示视频显示,该AI已能执行打开浏览器、拖拽元素等13项基础操作。尽管官方宣称其“在多项目网页和移动端测试中表现优于主流竞品”,但现阶段仍处于开发者预览阶段,公众可通过Google AI Studio平台或Browserbase演示站体验其2048游戏、黑客新闻浏览等特色功能。

(注:已去除原文重复信息,按国内读者阅读习惯整合内容,保留所有技术要点与行业背景。)

中文翻译:

谷歌公司正在预览一款新型Gemini人工智能模型,该模型能通过浏览器在网页上进行导航与交互,使AI智能体可以在人类专属操作界面中执行任务。这款名为Gemini 2.5 Computer Use的模型运用"视觉理解与推理能力"分析用户需求并完成操作,例如填写提交网页表单。

这款全新Gemini 2.5 Computer Use模型能像人类一样在浏览器窗口中进行点击、滚动和输入操作,从而获取应用程序接口无法直接调取的数据。该技术既可用于界面测试,也能帮助无法调用API或其他直连服务的用户操作图形界面。该模型的早期版本已应用于AI Mode的智能体功能以及"航海家计划"——这项研究原型利用AI智能体在浏览器中自主执行任务,例如根据食材清单将商品加入购物车。

谷歌此次发布恰逢OpenAI在年度开发者大会次日宣布推出新版ChatGPT应用,后者持续聚焦于能代用户处理复杂任务的ChatGPT Agent功能。而Anthropic公司早在去年就已发布具备"计算机操作"功能的Claude AI模型。

谷歌公布的演示视频展示了该计算机操作工具的实际运行效果(视频经3倍速处理)。官方宣称该模型"在多项网页端与移动端基准测试中表现优于主流竞品"。与ChatGPT Agent和Anthropic的计算机操作工具不同,谷歌新模型仅支持浏览器环境操作。谷歌特别说明该技术"尚未针对桌面操作系统层级的控制进行优化",目前支持开启网页浏览器、文本输入、拖拽页面元素等13项基础操作。

开发者现可通过Google AI Studio和Vertex AI平台体验Gemini 2.5 Computer Use,Browserbase平台也提供了功能演示,观众可实时观看AI执行"玩2048游戏"或"浏览黑客新闻热门讨论"等任务。

英文来源:

Google is previewing a new Gemini AI model designed to navigate and interact with the web via a browser, letting AI agents do things inside interfaces designed for use by people and not robots. The model, called Gemini 2.5 Computer Use, uses “visual understanding and reasoning capabilities” to analyze a user’s request and carry out a task, such as filling out and submitting a form.
Google’s latest AI model uses a web browser like you do
The new Gemini 2.5 Computer Use model can click, scroll, and type in a browser window to access data that’s not available via an API.
The new Gemini 2.5 Computer Use model can click, scroll, and type in a browser window to access data that’s not available via an API.
It can be used for UI testing or navigating interfaces made for people who don’t have an API or other direct connection available. Other versions of this model have been used for agentic features in AI Mode and Project Mariner, a research prototype that uses AI agents to carry out tasks on its own in a browser, like adding items to your cart based on a list of ingredients.
Google’s announcement comes just one day after OpenAI revealed new apps for ChatGPT as part of its annual Dev Day, and continues to focus its attention on its ChatGPT Agent feature that can complete complex tasks on your behalf. Meanwhile, Anthropic had already released a version of its Claude AI model with “computer use” last year.
Google posted some demo videos showing its computer use tool in action, and notes that they are sped up 3x.
Google says its computer use model “outperforms leading alternatives on multiple web and mobile benchmarks.” Unlike ChatGPT Agent and Anthropic’s computer use tool, Google’s new AI model only has access to a browser — not an entire computer environment. Google notes that it shows “it is not yet optimized for desktop OS-level control” and currently supports 13 actions, including opening a web browser, typing text, as well as dragging and dropping elements.
Gemini 2.5 Computer Use is available to developers through Google AI Studio and Vertex AI, but there’s also a demo on Browserbase, where you watch as it completes tasks, like “Play a game of 2048” or “Browse Hacker News for trending debates.”

ThevergeAI大爆炸

文章目录


    扫描二维码,在手机上阅读