LLMs & Models Articles
Browse 261 articles about LLMs & Models.
AI Benchmarks Are Broken: 5 Methodological Flaws in Time Horizon Metrics You Need to Understand
A fixed-slope fix alone would push Meter's numbers up 35%. Five structural problems with how AI capability benchmarks are built and reported.
Beth Barnes on Meter's Time Horizons: The Error Bars Are 2x — Here's What the Benchmark Actually Tells You
Meter's co-founder admits error bars are 2x in either direction. Here's the honest breakdown of what time horizon benchmarks can and can't tell you.
GPQA: The Graduate-Level Benchmark Every Major AI Lab Uses — and Why Its Creator Says It Has Limits
David Rein built GPQA and now co-authors Hcast. He's the first to explain where graduate-level benchmarks mislead capability estimates.
John Preskill's Quantum Paper Used an Open-Source LLM Optimizer — and It Made Algorithms 1,000x Better
Caltech's John Preskill co-authored a paper where AI did the heavy lifting — improving early quantum algorithms by 1,000x via OpenEvolve.
Omar Khattab's DSPy Follow-Up: How Auto-Optimizing Your Harness Put a Tiny Model at #1 on TerminalBench
DSPy's creator showed a Haiku-powered harness beat larger models on TerminalBench. The secret: 10M tokens of automated harness optimization.
SWE-Bench Score vs. Real Merge Rate: Why Your Agent's Benchmark Number Doesn't Match Production Reality
Agent solutions pass SWE-bench but merge at half the rate of human solutions. The gap between benchmark and production is wider than you think.
What Is the Verifiability Principle? Why AI Excels at Some Tasks and Fails at Others
AI models peak in domains where outputs can be verified like code and math. Learn why this creates jagged intelligence and what it means for automation.
What Is Software 3.0? How Prompting Replaced Programming
Software 3.0 is the era where prompts and context windows replace code. Learn what this means for how you build AI agents and automate workflows.
AlphaQubit: How Google DeepMind's AI System Solved the Error Correction Problem Blocking Fault-Tolerant Quantum Computers
AlphaQubit is an AI error decoder that identifies quantum computing errors with state-of-the-art accuracy — directly accelerating the 2029 cryptography threat.
Andrej Karpathy Said 'The Tokenizer Must Go' — DeepSeek's Vision Architecture Is Starting to Prove Him Right
Karpathy called pixels better inputs than text tokens after DeepSeek's OCR paper. Their new visual primitives model takes that idea further with 7,000x…
Big Tech Cloud Earnings Week: 5 Numbers That Prove AI Infrastructure Has Hit Escape Velocity
Google Cloud +63%, Azure +40%, AWS +28%. OpenAI's CFO called token demand 'a vertical wall.' Here's what the Q1 2026 numbers actually mean.
Claude Code /ultra review: 5 Things You Need to Know Before Running It ($5–$20 Per Run)
Ultra review spins parallel reviewer agents but costs $5–$20 per run and requires a Claude account, not just an API key. What to know first.
DeepSeek's 'Thinking with Visual Primitives': 5 Technical Breakthroughs in the Paper That Briefly Disappeared
DeepSeek's vision paper was published then pulled. Here are 5 key technical details — including inline bounding-box tokens and a 7,000x compression ratio.
DeepSeek V4 Flash vs Claude Sonnet 4.6: Which Model Is Best for AI Agent Workflows?
Compare DeepSeek V4 Flash and Claude Sonnet 4.6 on cost, speed, and quality for agentic coding, automation, and multi-step workflows.
DeepSeek Vision's 7,000x Image Compression Pipeline: From 756px Input to 81 KV Cache Entries
DeepSeek's vision model compresses a 756x756 image through four stages down to 81 KV cache entries — a ~7,000x total compression ratio. Here's each step.
DeepSeek Vision Beats GPT-5.4 by 17 Points on Maze Navigation — The Topological Reasoning Benchmark Explained
On maze navigation, DeepSeek's vision model scores 67% vs. GPT-5.4's 50% — a 17-point gap driven by inline bounding-box spatial reasoning.
DeepSeek Vision vs. Claude Sonnet 4.6 vs. Gemini Flash 3: Which Vision Model Uses 10x Less KV Cache?
DeepSeek's vision model uses ~90 KV cache entries per image vs. ~870 for Sonnet 4.6 and ~1,000 for Gemini Flash 3. Here's what that means for cost.
How to Use Free Claude Code Alternatives: OpenRouter, NVIDIA NIM, and Ollama Setup Guide
Run Claude Code with DeepSeek, GLM, or Gemma models via OpenRouter, NVIDIA NIM, or Ollama to cut costs by up to 99% with the free-claude-code proxy.
Google AI Co-Clinician vs. GPT-5.4 with Search: Which Medical AI Do Physicians Actually Prefer?
In blind physician evaluations, Google's AI Co-clinician beat GPT-5.4 thinking with search 63% to 30%. Here's what drove the gap.
Google DeepMind's AI Co-Clinician: 4 Benchmark Results That Surprised Even the Evaluators
AI Co-clinician beat GPT-5.4 63% to 30%, hit zero critical errors in 97 of 98 queries, and matched physicians in 68 of 140 consultation dimensions.