Ai Benchmarks for Code

First Benchmark for Legacy Code Comprehension Shows Specialized AI Approach Outperforms General-PurposeModels

LegacyCodeBench tests whether AI can understand COBOL well enough to document itaccurately not just generate plausible ...

Three questions, three answers: AI – the productivity boost for coding?

Large language models promise more efficiency in software development. But, despite all the promises, there are still a few ...

MemRL outperforms RAG on complex agent benchmarks without fine-tuning

MemRL separates stable reasoning from dynamic memory, giving AI agents continual learning abilities without model fine-tuning ...

Ars Technica

Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks

Claude is popular with some software developers thanks to Claude Code, and Anthropic is confident about the latest version of Sonnet’s coding capability: “Claude Sonnet 4.5 is the best coding model in ...

Inc

Google’s New Gemini 3 AI Crushed OpenAI and Anthropic in a Benchmark Test for Business Operations

Google has released Gemini 3, the latest in its line of advanced AI models. As most AI companies do when announcing a new flagship model, Google boasted that Gemini 3 is its most intelligent model yet ...

26don MSNOpinion

AI agents arrived in 2025 -- here's what's next for 2026

AI agents have emerged from the lab, bringing promise and peril. A Carnegie Mellon University researcher explains what's ...

VentureBeat

Has this stealth startup finally cracked the code on enterprise AI agent reliability? Meet AUI's Apollo-1

For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet even as large language models (LLMs) like ChatGPT, Gemini, and Claude learn to reason, ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results