人机黑客马拉松内幕

qimuai 发布于 2025-9-10 08:01 阅读：62 一手编译

人机黑客马拉松内幕

内容来源：https://www.wired.com/story/san-francisco-hackathon-man-vs-machine/

内容总结：

上周六，旧金山举办了一场别开生面的编程马拉松比赛“人类对战机器”，旨在测试人工智能工具是否能真正提升编程效率与质量。活动由非营利AI研究机构METR联合主办，吸引了约37支队伍参加，参赛者被随机分为使用AI工具组和纯人工编程组。

比赛评审基于创意性、实用性、技术实现度和完成度四个维度。最终入围决赛的六支队伍中，三者使用AI、三者纯人工编程。冠军团队开发了一款基于AI的代码审查热力图工具，获得1.25万美元奖金；亚军则是一款纯人工开发的写作辅助工具。

值得注意的是，尽管AI组最终夺冠，但比赛过程竞争激烈。METR研究人员现场展示的实时评分表显示，两组得分始终紧咬。有参赛者坦言，无AI辅助的编程过程压力巨大，但也促使团队突破自我。

此次比赛结果进一步引发对AI编程实际效用的思考。METR近期研究曾指出，AI工具可能使经验丰富的开发者效率降低19%。而本次赛事中，既有资深程序员也有编程新手，且均需从头创建项目，为观察AI在不同场景下的作用提供了新视角。

尽管机器在此次对决中胜出，但人类团队展现的创造力与坚持同样令人印象深刻。正如参赛者所言：“人类与机器的结合才是最优解”。这场人机博弈不仅是一场技术竞赛，更成为审视AI与人类协作关系的生动样本。

中文翻译：

上周六，旧金山一个微风习习的午后，我置身于某个不起眼的共享办公空间，这里挤满了赤脚的程序员。百余位参与者涌入杜博斯三角区某栋办公楼，一场"人机对决"编程马拉松在此展开——使用AI编程工具的团队将与纯人类团队同台竞技（所有人在门口都被要求脱鞋）。这场名为"人类对战机器"的黑客松，旨在验证AI是否能真正提升编程效率与质量。

37支队伍被随机分为"人类组"与"AI辅助组"。事后有组织者透露，多名被分到人类组的选手直接弃赛。评审团将依据创意性、现实实用性、技术突破性和完成度四项标准打分，仅六支队伍能进入演示环节。冠军将获得12,500美元现金奖励及OpenAI与Anthropic的API额度，亚军奖金为2,500美元。

AI编程技术在硅谷始终争议不断。当人们对"工程师末日"的忧虑蔓延时，联合主办方非营利AI研究机构METR的新研究却发现：AI工具反而使经验丰富的开源开发者效率降低19%。本次黑客松正是该研究的深化——不同于此前针对资深程序员修改现有代码库的观察，此次活动中既有零基础新手参与，所有项目也均需从零构建。

METR技术团队成员乔尔·贝克尔指出，许多开发者效率研究采用提交请求数或代码行数等指标，但这些数据往往难以诠释："更多代码或请求未必意味着更好。同理，AI模型在基准测试中得分虽高，其实际应用能力仍存疑。"贝克尔押注机器阵营将获胜。

竞速时刻
活动Slack频道里，参赛者们抛出创意吸引队友：为钢琴家提供演奏反馈的AI工具、阅读追踪应用、邻里社交平台等。斯坦福大学AI伦理专业学生阿鲁希·阿加斯瓦尔虽从八年级开始编程，但近期专注研究AI社会影响。被随机分入人类组的她，决定构建评估AI模型谄媚倾向的框架（例如OpenAI的GPT-4o表现出的过度迎合特性）。

"人类组的创意可能更具深度，我希望评审不仅关注演示效果。"阿加斯瓦尔坦言。她最初坚信人类组能胜出，但开赛数小时后，她对能否在晚六点半截止前完成任务产生怀疑。

37岁的牙医背景创业者埃里克·庄则被分入机器组。"说实话，我庆幸能用AI工具。"他正在开发通过声纹面容识别自闭症的软件。面对关于数据偏见可能导致误诊的质疑，他承认存在风险："但结合声纹与微表情分析，有望提升早期诊断准确率。"

AGI的'塔可派对'
这个弥漫着有效利他主义气息的办公空间，处处彰显旧金山AI圈的特色。若非通过惊天诈骗案新闻知晓该运动，你或许会惊讶于其"最大化利用时间金钱资源做好事"的宗旨——活动次日这里便举办了"如何借助YouTube宣传少吃肉类"的讨论。

大楼四层贴满传单："AI 2027：AGI会接管塔可派对吗"的告示旁，贴着毫无上下文关联的"亲动物协作"海报。截止前半小时，程序员们啃着Ike's素食肉丸三明治冲刺项目。楼下评审陆续抵达：OpenAI应用AI团队的布莱恩·菲奥卡与夏马尔·希特什·阿纳德卡特、Anthropic应用AI团队的马里乌斯·布勒安德拉，以及联合主办方AI初创公司Factory的工程师瓦林·奈尔。

评审启动时，METR团队成员内特·拉什向我展示实时计分表：AI辅助组标绿，人类组标红。随着评审输入分数，各组名次持续浮动。"看出来了吗？"他问道。确实没有——开评半小时后，色块交织的表格仍未见明显胜者。这正是关键所在：出乎所有人意料，人机对决竟呈胶着状态。

决胜时刻
最终入围的六支队伍恰好平分秋色：三组人类三组机器。每场演示后，观众需举手投票猜测是否使用AI。首发的ViewSense工具旨在通过视频转文字帮助视障人士感知环境，其技术完成度令60%观众误判使用了AI（实际纯人工开发）。随后登场的是用摄像头实时追踪纸笔网站设计的无AI平台，以及通过AI生成钢琴演奏反馈的机器组项目。另一款标记代码变更热图的工具（安全漏洞标红/常规编辑标绿）则确实使用了AI。

我最青睐的是一款作家校对工具。"我们热爱阅读，认为AI时代不应与人类作家为敌，而应助其创作。"团队成员演示了能自动追踪人物特质与关系的系统，当出现前后矛盾（如角色关系设定冲突）时会发出提醒——这组未使用AI。

盘腿而坐的观众们以手拍地模拟鼓点烘托气氛。贝克尔邀请观众投票预测总冠军归属时，我举手支持AI组。最终AI辅助组胜出，约80%观众猜对结果。使用AI的代码审查热图工具夺得12,500美元头奖，人类组的写作工具屈居亚军。

大卫与歌利亚
亚军团队四位波兰初创创始人在我对面挤坐长椅。他们专程来旧金山体验AI热潮数月，截止前一小时几乎因无AI辅助的编程压力内讧退赛，最终在截止前惊险提交。"我们平时不争吵，但今天缺乏AI工具实在太煎熬。"成员瓦尔达坦言。坚持到底后他们反而庆幸："若不代表人类组出战，我们可能根本不会获奖。"

冠军团队由运营Stack Auth初创公司的康斯坦丁·沃文德与比拉尔·戈迪尔，以及AppFolio工程实习生加纳帕西组成。这对黑客松常客赛前并未预料夺冠，甚至坦言更想体验人类组"大卫迎战歌利亚"的刺激感。但他们清楚真相：人机协作才是王道。"人总想相信人类胜算，"沃文德告诉我，"但在这种赛制下，机器几乎总是赢家。"

（本文源自凯莉·罗宾逊的《模型行为》时事通讯，过往内容可在此处查阅）

英文来源：

On a breezy San Francisco afternoon last Saturday, I found myself at a nondescript coworking space filled with shoeless coders.
Just over a hundred visitors had crowded into an office building in the Duboce Triangle neighborhood for a showdown that would pit teams armed with AI coding tools against those made up of only humans (all were asked to ditch their shoes at the door). The hackathon was dubbed “Man vs. Machine,” and its goal was to test whether AI really does help people code faster—and better.
Roughly 37 groups were randomly assigned “human” or “AI-supported.” Later, an organizer told me several people dropped out after being placed on the human team. A panel of judges would rank projects based on four criteria: creativity, how useful it might be in the real world, technical impressiveness, and execution. Only six teams would make it to the demo. The winning team would earn a $12,500 cash prize and API credits from OpenAI and Anthropic. Second place would get $2,500.
AI coding has been somewhat of a lightning rod in Silicon Valley. While fears of an engineering apocalypse abound, a new study from METR—an AI research nonprofit that cohosted the hackathon—found that AI tools actually slowed experienced open source developers by 19 percent.
The weekend hackathon was meant to take METR’s research a step further. While the study looked at experienced coders working on existing codebases, at this event, some of the participants had very little coding experience and everyone would be proposing new projects.
Many studies on developer productivity use metrics like the number of pull requests or lines of code written, says Joel Becker, a member of the technical staff at METR. But these numbers can be hard to interpret. Writing more code or sending off more pull requests isn’t always better. Similarly, when we look at AI performance, even if a model scores 80 or 90 percent on a given benchmark, it’s not always clear what that means in terms of its practical abilities.
Becker bets the machine will win.
Crunch Time
In a Slack channel for the event, contestants pitched ideas to try to attract potential teammates: an AI tool for pianists to get performance feedback, an app to track what you’re reading, and a platform to help neighbors connect.
One contestant, Arushi Agastwar, is a student at Stanford studying AI ethics. She first started coding in eighth grade but has since taken a break to focus on evaluating AI’s impact on society. Agastwar was randomly selected to be on the human team, and she decided to build a framework that evaluates sycophancy (like the agreeableness that plagued OpenAI’s GPT-4o) in AI models.
“I have a feeling that some of the ideas that are going to be coming out from the ‘man’ teams are going to be really profound, and I'm hopeful that the demo aspect is not the only thing that the judges will be impressed by,” Agastwar tells me. Her initial bet was that a man team, i.e., one not using AI, would win. But several hours into the hackathon, she wasn’t so sure that she could complete the task by the 6:30 PM deadline.
Then there’s Eric Chong, a 37-year-old who has a background in dentistry and previously cofounded a startup that simplifies medical billing for dentists. He was placed on the “machine” team.
“I'm gonna be honest and say I'm extremely relieved to be on the machine team,” Chong says.
At the hackathon, Chong was building software that uses voice and face recognition to detect autism. Of course, my first question was: Wouldn’t there be a wealth of issues with this, like biased data leading to false positives?
“Short answer, yes,” Chong says. “I think that there are some false positives that may come out, but I think that with voice and with facial expression, I think we could actually improve the accuracy of early detection.”
The AGI ‘Tacover’
The coworking space, like many AI-related things in San Francisco, has ties to effective altruism.
If you’re not familiar with the movement through the bombshell fraud headlines, it seeks to maximize the good that can be done using participants’ time, money, and resources. The day after this event, the event space hosted a discussion about how to leverage YouTube “to communicate important ideas like why people should eat less meat.”
On the fourth floor of the building, flyers covered the walls—“AI 2027: Will AGI Tacover” shows a bulletin for a taco party that recently passed, another titled “Pro-Animal Coworking” provides no other context.
A half hour before the submission deadline, coders munched vegan meatball subs from Ike’s and rushed to finish up their projects. One floor down, the judges started to arrive: Brian Fioca and Shyamal Hitesh Anadkat from OpenAI’s Applied AI team, Marius Buleandra from Anthropic’s Applied AI team, and Varin Nair, an engineer from the AI startup Factory (which is also cohosting the event).
As the judging kicked off, a member of the METR team, Nate Rush, showed me an Excel table that tracked contestant scores, with AI-powered groups colored green and human projects colored red. Each group moved up and down the list as the judges entered their decisions. “Do you see it?” he asked me. No, I don’t—the mishmash of colors showed no clear winner even half an hour into the judging. That was his point. Much to everyone’s surprise, man versus machine was a close race.
Show Time
In the end, the finalists were evenly split: three from the “man” side and three from the “machine.” After each demo, the crowd was asked to raise their hands and guess whether the team had used AI.
First up was ViewSense, a tool designed to help visually impaired people navigate their surroundings by transcribing live videofeeds into text for a screen reader to read out loud. Given the short build time, it was technically impressive, and 60 percent of the room (by the emcee’s count) believed it used AI. It didn’t.
Next was a team that built a platform for designing websites with pen and paper, using a camera to track sketches in real time—no AI involved in the coding process. The pianist project advanced to the finals with a system that let users upload piano sessions for AI-generated feedback; it was on the machine side. Another team showcased a tool that generates heat maps of code changes: critical security issues show up in red, while routine edits appear in green. This one did use AI.
My favorite project was, of course, a proofreading tool for writers.
“We love reading books, and we think that the AI era that is coming is trying to fight with all the human writers, trying to take their jobs,” a member of the group explained. “And instead of fighting it with some blockages, we decided to make it easier for writers to write good books.”
They demo a system that, as you write, automatically tracks characters, traits, and relationships. If you contradict yourself—for example, saying two characters are best friends in one chapter but enemies in another—it flags the inconsistency. This team did not use AI.
The crowd, sitting cross-legged on the floor, did a brief drum roll. “Hands up if you think the overall winner was an AI-allowed team,” instructed Becker. My hand sprang up. In the end, AI took the top spot. By his count, 80 percent of the room had guessed correctly.
The $12,500 cash prize went to the code-review heat map, which used AI. The humans weren’t far behind—second place went to the writing tool.
David vs. Goliath
The second-place champions squeezed together on a bench across from me. Michał Warda, Dawid Kiełbasa, Marta Szczepaniak, and Paweł Sierant were a team of startup founders from Poland, visiting San Francisco for just a few months to experience the AI hype. They nearly dropped out an hour before the deadline, arguing under the pressure of coding without AI, but they pushed their demo minutes before submissions closed.
“We don't usually argue. But today was very tense, the lack of AI tooling. We've been programmers for a lot of time,” Warda tells me.
By the end, they were glad they’d stuck it out. “During the coding, yes, we wished we were on the machine team,” Szczepaniak says. Warda jumps in to add that if they weren’t fighting on behalf of humans, they might not have won a prize at all.
Then there were the winners: Konstantin Wohlwend, Aman Manvattira Ganapathy, and Bilal Godil. Wohlwend and Godil run a startup called Stack Auth. This is their third hackathon together. Ganapathy is an engineering intern at AppFolio.
They hadn’t expected to win, and they swore they’d rather have been on the human side for the thrill of playing David against Goliath. Still, they knew the truth: Man plus machine has the edge.
“You always want to believe in the man,” Wohlwend tells me. “But in this kind of format, the machine will almost always win.”
This is an edition of Kylie Robison’s Model Behavior newsletter. Read previous newsletters here.

连线杂志AI最前沿

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读