A brand new AI coding problem simply revealed its first outcomes – they usually aren’t fairly

July 24, 2025

132

[ad_1]

A brand new AI coding problem has revealed its first winner — and set a brand new bar for AI-powered software program engineers.

On Wednesday at 5pm PST, the nonprofit Laude Institute introduced the primary winner of the Okay Prize, a multi-round AI coding problem launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian immediate engineer named Eduardo Rocha de Andrade, who will obtain $50,000 for the prize. However extra stunning than the win was his remaining rating: he gained with appropriate solutions to only 7.5% of the questions on the check.

“We’re glad we constructed a benchmark that’s truly onerous,” stated Konwinski. “Benchmarks ought to be onerous in the event that they’re going to matter,” he continued, including: “Scores could be completely different if the massive labs had entered with their greatest fashions. However that’s sort of the purpose. Okay Prize runs offline with restricted compute, so it favors smaller and open fashions. I really like that. It ranges the enjoying area.”

Konwinski has pledged $1 million to the primary open-source mannequin that may rating greater than 90% on the check.

Much like the well-known SWE-Bench system, the Okay Prize assessments fashions towards flagged points from GitHub as a check of how effectively fashions can cope with real-world programming issues. However whereas SWE-Bench is predicated on a hard and fast set of issues that fashions can prepare towards, the Okay Prize is designed as a “contamination-free model of SWE-Bench,” utilizing a timed entry system to protect towards any benchmark-specific coaching. For spherical one, fashions had been due by March twelfth. The Okay Prize organizers then constructed the check utilizing solely GitHub points flagged after that date.

The 7.5% prime rating stands in marked distinction to SWE-Bench itself, which presently reveals a 75% prime rating on its simpler ‘Verified’ check and 34% on its more durable ‘Full’ check. Konwinski nonetheless isn’t positive whether or not the disparity is because of contamination on SWE-Bench or simply the problem of accumulating new points from GitHub, however he expects the Okay Prize challenge to reply the query quickly.

“As we get extra runs of the factor, we’ll have a greater sense,” he instructed TechCrunch, “as a result of we count on folks to adapt to the dynamics of competing on this each few months.”

Techcrunch occasion

San Francisco
|
October 27-29, 2025

It would seem to be an odd place to fall brief, given the big selection of AI coding instruments already publicly obtainable – however with benchmarks turning into too straightforward, many critics see initiatives just like the Okay Prize as a crucial step towards fixing AI’s rising analysis drawback.

“I’m fairly bullish about constructing new assessments for present benchmarks,” says Princeton researcher Sayash Kapoor, who put ahead the same thought in a current paper. “With out such experiments, we will’t truly inform if the difficulty is contamination, and even simply concentrating on the SWE-Bench leaderboard with a human within the loop.”

For Konwinski, it’s not only a higher benchmark, however an open problem to the remainder of the trade. “Should you hearken to the hype, it’s like we ought to be seeing AI medical doctors and AI attorneys and AI software program engineers, and that’s simply not true,” he says. “If we will’t even get greater than 10% on a contamination free SWE-Bench, that’s the fact test for me.”

[ad_2]

A brand new AI coding problem simply revealed its first outcomes – they usually aren’t fairly

Related Articles

India A group, Schedule, Dwell Streaming & All You Have to Know

Methods to Keep Constant With Your Psychological and Bodily Objectives Through the Holidays

Palantir CEO slams ‘parasitic’ critics calling the tech a surveillance instrument: ‘Not solely is patriotism proper, patriotism will make you wealthy’

Latest Articles

India A group, Schedule, Dwell Streaming & All You Have to Know

Methods to Keep Constant With Your Psychological and Bodily Objectives Through the Holidays

Palantir CEO slams ‘parasitic’ critics calling the tech a surveillance instrument: ‘Not solely is patriotism proper, patriotism will make you wealthy’

Laurence Moroney on AI on the Edge – O’Reilly

Valve’s Steam Machine may repair two huge SteamOS gaming issues – and I’m getting ready to ditch Home windows 11 for good