Independent AI analysis
The most complete AI ranking of 2026, with 577 active LLMs compared across 13 official benchmarks (GPQA, MMLU-Pro, AIME, HLE, LiveCodeBench, SciCode, IFBench, AA-LCR, Terminal-Bench and Tau²) — covering reasoning, math, coding, speed and cost — plus latency and per-token pricing metrics. Use this ranking to find the best AI models of 2026 by category.
Luis Fernando Roquette · SWEN · methodology described at the bottom of this page · last updated: Jun 01, 2026
Use now
Top 10 · AA Intelligence Index
Top 10 · Output tokens/second
Top 10 · USD / 1M tokens input
Ranking by Artificial Analysis composite score (0–100). Top 30 benchmark models.
Daily progression of the Intelligence Index for the top 8 models.
Ranking by composite coding benchmark score. Top 20 models.
Competitive programming problems. Top 20.
Ranking by composite math benchmark score. Top 20 models.
American Invitational Mathematics Examination. Top 20.
500 competitive math problems. Top 20.
Massive Multitask Language Understanding Pro. Top 20.
Graduate-level Physics, Chem, Bio questions. Top 20.
Hardest benchmark, focused on reasoning. Top 20.
Ranking by generation speed (tokens/s). Top 20 models.
Initial latency. Lower = better. Top 20.
Time to first answer token (TTFA). Includes reasoning chain. Lower = better. Top 20.
Tokens the model can process. Top 15.
Top 25 cheapest models in USD/1M input tokens. Cost-efficiency benchmark.
Ranking by human preference in blind side-by-side comparisons.
Scientific code generation (physics, chemistry, biology). Top 20.
Adherence to complex and constrained instructions. Top 20.
Reasoning over long context (understanding and using info in 100K+ tokens). Top 20.
Agentic task execution in a real Linux terminal. Top 20.
Tool use in simulated environments (airline, retail, telecom). Top 20.
Subjective score (0–10) based on visual quality, physics, duration and cost. SWEN editorial review.
According to the AA Intelligence Index — a composite index aggregating GPQA Diamond, MMLU-Pro, AIME, HLE and LiveCodeBench — Claude Opus 4.8 (Fast) (Anthropic) leads the ranking in 2026 with a score of 61.4/100, followed by Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.4) and GPT-5.5 (60.2). The Intelligence Index is calculated by Artificial Analysis based on independent evaluations and reflects real technical capability in reasoning, math, science and coding. It differs from LMArena ELO, which measures human preference in open conversations. For tasks requiring deep reasoning, code or scientific analysis, models at the top of the Intelligence Index typically perform best. For everyday conversations and creativity, ELO is a more representative guide. See the updated ranking for real-time positions.
ELO comes from LMArena (Chatbot Arena), where real users compare responses from two anonymized models and pick the best one. It is a measure of subjective human preference — reflecting naturalness, usefulness and perceived quality in everyday conversations. A model with a high ELO may not be the most accurate on technical tasks, but it is what people prefer to use. The AA Intelligence Index, calculated by Artificial Analysis, is objective: it aggregates results from standardized benchmarks such as GPQA Diamond (PhD-level questions), MMLU-Pro (broad academic knowledge), AIME (olympiad math), HLE (frontier scientific knowledge) and LiveCodeBench (programming). The higher the score, the more technical capability the model demonstrated in controlled evaluations. Use ELO to choose a general conversational assistant; use the Intelligence Index to select models for technical or scientific pipelines.
For coding, the most relevant benchmarks are LiveCodeBench — code challenges evaluated with real execution — and the AA Coding Index. In 2026, GPT-5.5 leads the coding ranking (59.1/100), with GPT-5.4 in second and Claude Opus 4.8 (Fast) in third. The ideal choice depends on context: for code generation via API, cost per token and context window matter as much as accuracy. For interactive IDE development (Cursor, VS Code), latency is critical. For multi-file projects, context windows above 100K tokens are required. See the full table to compare coding models by score, price and speed.
The SWEN ranking is updated automatically and continuously from three main sources. Artificial Analysis benchmark data (Intelligence Index, Coding Index, Math Index, inference speed) is synced every 6 hours via automated integration. API pricing — input and output per 1M tokens — is updated daily via OpenRouter, reflecting provider changes in near real time. LMArena ELO (Chatbot Arena) is synced weekly. The page revalidates its cache every 5 minutes via ISR (Incremental Static Regeneration): when a new model enters or a score changes, the ranking updates within 5 minutes without a manual rebuild. The last sync occurred on Jun 01, 2026.
Google's Gemini 3 family does not follow sequential linear numbering. Google released Gemini 3 Flash, Gemini 3.1 Pro/Flash Lite and Gemini 3.5 Flash — without publishing an official “Gemini 3.2”. Each number denotes a distinct technical generation: 3.1 brought reasoning improvements; 3.5 expanded capability at an intermediate cost. Gemini 3.1 Pro costs $2.00/1M tokens with a 1-million-token context window, positioning itself as an alternative to GPT-4o and Claude 3.7. See the full Gemini 3 family comparison →
“Gemini Spark” is a name circulating online that Google has never officially launched as a product. The term appeared in APK teardowns linked to a possible ultra-lightweight version of Gemini for edge devices. Google's confirmed lightweight models are: Gemini Nano (on-device, Pixel 8 Pro/Pixel 9) and Gemini Flash(via API, $0.075/1M tokens). Any prediction about “Gemini Spark” is speculation until official confirmation. Read what is known about Gemini Spark →
Artificial Analysis — provides Intelligence Index, Coding Index, Math Index and individual benchmarks (GPQA Diamond, MMLU-Pro, HLE, AIME, MATH-500, LiveCodeBench). Synced every 6h via automated cron.
LMArena — Human preference ELO in blind side-by-side comparisons. Updated weekly.
OpenRouter — provider pricing in USD per 1M tokens. Updated daily.
Historical snapshots — daily score capture at 06:30 UTC to feed temporal evolution charts. Started on Jun 01, 2026.
Benchmarks are indicative — always test on your specific use case before deciding. Performance varies by inference provider (same model, different latency).