close

DEV Community

# benchmarking

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

How I Built a 95K-Line Cognitive AI Pipeline That Takes an 8B Model to GPT-4 Territory

Comments
4 min read
Your model speed benchmark is measuring the wrong thing

Your model speed benchmark is measuring the wrong thing

Comments
3 min read
LLM Benchmark Rankings 2026: 15 Models Tested on 38 Real Coding Tasks

LLM Benchmark Rankings 2026: 15 Models Tested on 38 Real Coding Tasks

Comments
28 min read
Google Said It Had Native Function Calling. I Tested It.

Google Said It Had Native Function Calling. I Tested It.

Comments
3 min read
We Tested 10 Untested LLMs on Agent Coding — The Results Are In

We Tested 10 Untested LLMs on Agent Coding — The Results Are In

BERJAYA 3
Comments
3 min read
We Benchmarked SupportSage Against Traditional Supports: Here's the Data

We Benchmarked SupportSage Against Traditional Supports: Here's the Data

Comments
3 min read
Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

Comments
4 min read
KVQuant / BitForge: same model, smarter context, better answer

KVQuant / BitForge: same model, smarter context, better answer

Comments
1 min read
Qwen sky proof: compressed memory made a tiny model behave better — with the receipts

Qwen sky proof: compressed memory made a tiny model behave better — with the receipts

Comments
1 min read
Why You Should Never Use std::unordered_set in Hot C++ Loops

Why You Should Never Use std::unordered_set in Hot C++ Loops

BERJAYA 1
Comments
2 min read
Gemini-3-Flash: My ai agent benchmark terminalbench Win & 3 Fixes

Gemini-3-Flash: My ai agent benchmark terminalbench Win & 3 Fixes

BERJAYA 1
Comments
7 min read
The Last Pivot: Why Quality Gates Killed My Final KV-Cache Speedup

The Last Pivot: Why Quality Gates Killed My Final KV-Cache Speedup

Comments
7 min read
184 MCP installs and a 93.9% adversarial signal GPT-4o can't replicate

184 MCP installs and a 93.9% adversarial signal GPT-4o can't replicate

Comments
4 min read
A 70ms Local NLI Judge Hits 0.596 Pearson r With Groq Llama 3.3 70B on DSPy Reward Scoring

A 70ms Local NLI Judge Hits 0.596 Pearson r With Groq Llama 3.3 70B on DSPy Reward Scoring

Comments
5 min read
How to Benchmark LLM Inference Performance: TTFT, ITL, and Throughput Metrics

How to Benchmark LLM Inference Performance: TTFT, ITL, and Throughput Metrics

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.