Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress. Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research. NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention nor internet access. We found that: 1. Coding agents mostly spend compute on hyperparameter tuning, rarely attempting the algorithmic research that make human records successful. In one instance, Codex spent 121 H100 hours adjusting two values in the training code: cooldown fraction and window size schedule parameters. 2. When coding agents consider algorithmic work, they rarely succeed. Instead, they either reason themselves away or regress performance. For example, Autoresearch repeatedly considered reducing the number of value embeddings from 3 to 2, but avoided the change after deeming it risky without any experimentation. We thank Larry Dial for helpful discussions, Keller Jordan for the original NanoGPT Speedrun, as well as all human contributors for their efforts in producing world records! Blog: https://lnkd.in/gxAwwn-c GitHub: https://lnkd.in/gy-X8vxH
Intology
Research Services
San Francisco, California 1,085 followers
Automating the process of discovery.
About us
Intology is a research lab building end-to-end automated research systems. Our Artificial Scientists have published discoveries at A* conferences and outperform human experts on AI research tasks.
- Website
-
https://intology.ai/
External link for Intology
- Industry
- Research Services
- Company size
- 2-10 employees
- Headquarters
- San Francisco, California
- Type
- Privately Held
- Founded
- 2025
Locations
-
Get directions
251 Rhode Island St
104
San Francisco, California 94103, US
Employees at Intology
Updates
-
Come chat with the team in San Diego #neurips2025 https://luma.com/u79epzon
Introducing Locus: the first AI system to outperform human experts at AI R&D Locus conducts research autonomously over multiple days and achieves superhuman results on RE-Bench given the same resources as humans, as well as SOTA performance on GPU kernel & ML engineering tasks. > RE-Bench is a collection of several frontier AI research tasks that typically take human experts (e.g., top ML PhDs and frontier lab researchers) several days. By scaling experimentation to far longer time horizons than previous systems, Locus represents a step change in AI scientist capabilities. Locus excels at tackling open-ended problems. In areas like kernel engineering, Locus demonstrates a remarkable ability to explore vast solution spaces, achieving up to 100x speedups. This is essential to Locus’ ability to generate novel discoveries. Additionally, Locus predictably scales performance with compute on challenging domains. We expect Locus to easily continue scaling to longer and harder problems. Locus is still a very early iteration in our research program. We see a clear path forward in automating scientific discovery and imagine deploying Locus on week or month-long runs to tackle the most difficult challenges in computational science. We’d like to thank Modal and Mithril for being our compute partners. We are a lean, talent-dense team based in SF, and are hiring. If our mission excites you, join us: https://lnkd.in/gDUGheS7 Blog: https://lnkd.in/gb6HGkvP
-
-
Introducing Locus: the first AI system to outperform human experts at AI R&D Locus conducts research autonomously over multiple days and achieves superhuman results on RE-Bench given the same resources as humans, as well as SOTA performance on GPU kernel & ML engineering tasks. > RE-Bench is a collection of several frontier AI research tasks that typically take human experts (e.g., top ML PhDs and frontier lab researchers) several days. By scaling experimentation to far longer time horizons than previous systems, Locus represents a step change in AI scientist capabilities. Locus excels at tackling open-ended problems. In areas like kernel engineering, Locus demonstrates a remarkable ability to explore vast solution spaces, achieving up to 100x speedups. This is essential to Locus’ ability to generate novel discoveries. Additionally, Locus predictably scales performance with compute on challenging domains. We expect Locus to easily continue scaling to longer and harder problems. Locus is still a very early iteration in our research program. We see a clear path forward in automating scientific discovery and imagine deploying Locus on week or month-long runs to tackle the most difficult challenges in computational science. We’d like to thank Modal and Mithril for being our compute partners. We are a lean, talent-dense team based in SF, and are hiring. If our mission excites you, join us: https://lnkd.in/gDUGheS7 Blog: https://lnkd.in/gb6HGkvP
-
-
This Friday, we are hosting Alex Zhang from CSAIL MIT to discuss his work on Recursive Language Models. 🔬 AI4Science on alphaXiv 🗓 Friday October 31st 2025 · 11AM PT 🎙 Featuring Alex Zhang 💬 Casual Talk + Open Discussion Topic: The Recursive Language Models (RLM) paradigm for inference scaling on language models to handle near infinite contexts. Event invite: https://lnkd.in/gav8UsB7
-
This Friday, we are hosting Jeff Clune from The University of British Columbia & Google DeepMind to discuss recent efforts in open-ended agents. 🔬 AI4Science on alphaXiv 🗓 Friday September 19th 2025 · 10AM PT 🎙 Featuring Jeff Clune 💬 Casual Talk + Open Discussion Topic: Recent work including OMNI (Open-endedness via Models of human Notions of Interestingness), Video Pre-Training (VPT), Automatically Designing Agentic Systems (ADAS), the Darwin Gödel Machine, and The AI Scientist. Event invite: https://luma.com/jviy4bz5
-
This Friday, we are hosting Nicolas Ballas from Meta (FAIR) to discuss his recent work on V-JEPA 2. 🔬 AI4Science on alphaXiv 🗓 Friday August 22nd 2025 · 10AM PT 🎙 Featuring Nicolas Ballas 💬 Casual Talk + Open Discussion Topic: Video Joint Embedding Predictive Architecture 2 (V-JEPA 2), the first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments. Event invite: https://lu.ma/v5sy5iea
-
This Friday, we are hosting Peiyao Sheng from Sentient Labs to talk about her work on LiveCodeBench Pro. 🔬 AI4Science x AI Security on alphaXiv 🗓 Friday August 15th 2025 · 10AM PT 🎙 Featuring Peiyao Shang (Sentient Labs) 💬 Casual Talk + Open Discussion Topic: LiveCodeBench Pro, a continuously updated benchmark showing that despite recent claims, frontier LLMs still significantly lag behind human Olympiad medalists in competitive programming. Event invite: https://lu.ma/v45b9ltc
-
This Friday, we are hosting James Zou from Stanford University to discuss his recent work on the Virtual Lab, recently published in Nature Magazine. 🔬 AI4Science on alphaXiv 🗓 Friday August 8th 2025 · 11AM PT 🎙 Featuring James Zou (Stanford University) 💬 Casual Talk + Open Discussion Topic: The Virtual Lab brings together large language models and human researchers to collaboratively solve complex scientific problems, exemplified by the AI-driven design of new SARS-CoV-2 nanobodies. Event invite: https://lnkd.in/gbJsKEe2
-
This Friday, we are hosting Reza Bayat from Mila - Quebec Artificial Intelligence Institute to discuss his recent work on Mixture-of-Recursions. 🔬 AI4Science on alphaXiv 🗓 Friday August 1st 2025 · 11AM PT 🎙 Featuring Reza Bayat (Mila - Quebec Artificial Intelligence Institute) 💬 Casual Talk + Open Discussion Topic: Mixture-of-Recursions, a breakthrough architecture that unifies parameter sharing with adaptive computation. By dynamically assigning different recursion depths to individual tokens, MoR achieves large-model quality with significantly fewer parameters and computational resources. Event invite: https://lu.ma/91nazpca
-
Next week, we are hosting Kexin Huang from Stanford University to discuss his recent work on Biomni, a general-purpose AI agent for biology. 🔬 AI4Science on alphaXiv 🗓 Friday July 25th 2025 · 10AM PT 🎙 Featuring Kexin Huang (Stanford University) 💬 Casual Talk + Open Discussion Topic: Biomni, a general-purpose AI agent designed to function as a virtual AI biologist capable of autonomously conducting research across diverse biomedical domains. Event invite: https://lnkd.in/gupKegAf
