AI Ranking 2026 — independent analysis of 500+ models across 13 benchmarks

Independent AI analysis

AI Ranking 2026

The most complete AI ranking of 2026, with 577 active LLMs compared across 13 official benchmarks (GPQA, MMLU-Pro, AIME, HLE, LiveCodeBench, SciCode, IFBench, AA-LCR, Terminal-Bench and Tau²) — covering reasoning, math, coding, speed and cost — plus latency and per-token pricing metrics. Use this ranking to find the best AI models of 2026 by category.

Luis Fernando RoquetteLuis Fernando Roquette · SWEN · methodology described at the bottom of this page · last updated: Jun 01, 2026

Source: Artificial AnalysisView as table →

Use now

Most intelligent

Top 10 · AA Intelligence Index

Fastest

Top 10 · Output tokens/second

Cheapest

Top 10 · USD / 1M tokens input

Intelligence Index

Ranking by Artificial Analysis composite score (0–100). Top 30 benchmark models.

Intelligence Index

Intelligence over time

Daily progression of the Intelligence Index for the top 8 models.

11 pontos · 30d janela
53565962652026-05-222026-06-01Claude Opus 4.8 (Fast) · 2026-05-30 · 61.4Claude Opus 4.8 (Fast) · 2026-05-31 · 61.4Claude Opus 4.8 (Fast) · 2026-06-01 · 61.4Claude Opus 4.8 (Adaptive Reasoning, Max Effort) · 2026-05-29 · 61.4Claude Opus 4.8 (Adaptive Reasoning, Max Effort) · 2026-05-30 · 61.4Claude Opus 4.8 (Adaptive Reasoning, Max Effort) · 2026-05-31 · 61.4Claude Opus 4.8 (Adaptive Reasoning, Max Effort) · 2026-06-01 · 61.4GPT-5.5 · 2026-05-22 · 60.2GPT-5.5 · 2026-05-23 · 60.2GPT-5.5 · 2026-05-24 · 60.2GPT-5.5 · 2026-05-25 · 60.2GPT-5.5 · 2026-05-26 · 60.2GPT-5.5 · 2026-05-27 · 60.2GPT-5.5 · 2026-05-28 · 60.2GPT-5.5 · 2026-05-29 · 60.2GPT-5.5 · 2026-05-30 · 60.2GPT-5.5 · 2026-05-31 · 60.2GPT-5.5 · 2026-06-01 · 60.2Claude Opus 4.7 · 2026-05-22 · 57.3Claude Opus 4.7 · 2026-05-23 · 57.3Claude Opus 4.7 · 2026-05-24 · 57.3Claude Opus 4.7 · 2026-05-25 · 57.3Claude Opus 4.7 · 2026-05-26 · 57.3Claude Opus 4.7 · 2026-05-27 · 57.3Claude Opus 4.7 · 2026-05-28 · 57.3Claude Opus 4.7 · 2026-05-29 · 57.3Claude Opus 4.7 · 2026-05-30 · 57.3Claude Opus 4.7 · 2026-05-31 · 57.3Claude Opus 4.7 · 2026-06-01 · 57.3Gemini 3.1 Pro Preview · 2026-05-22 · 57.2Gemini 3.1 Pro Preview · 2026-05-23 · 57.2Gemini 3.1 Pro Preview · 2026-05-24 · 57.2Gemini 3.1 Pro Preview · 2026-05-25 · 57.2Gemini 3.1 Pro Preview · 2026-05-26 · 57.2Gemini 3.1 Pro Preview · 2026-05-27 · 57.2Gemini 3.1 Pro Preview · 2026-05-28 · 57.2Gemini 3.1 Pro Preview · 2026-05-29 · 57.2Gemini 3.1 Pro Preview · 2026-05-30 · 57.2Gemini 3.1 Pro Preview · 2026-05-31 · 57.2Gemini 3.1 Pro Preview · 2026-06-01 · 57.2GPT-5.4 · 2026-05-22 · 56.8GPT-5.4 · 2026-05-23 · 56.8GPT-5.4 · 2026-05-24 · 56.8GPT-5.4 · 2026-05-25 · 56.8GPT-5.4 · 2026-05-26 · 56.8GPT-5.4 · 2026-05-27 · 56.8GPT-5.4 · 2026-05-28 · 56.8GPT-5.4 · 2026-05-29 · 56.8GPT-5.4 · 2026-05-30 · 56.8GPT-5.4 · 2026-05-31 · 56.8GPT-5.4 · 2026-06-01 · 56.8Qwen3.7 Max · 2026-05-22 · 56.6Qwen3.7 Max · 2026-05-23 · 56.6Qwen3.7 Max · 2026-05-24 · 56.6Qwen3.7 Max · 2026-05-25 · 56.6Qwen3.7 Max · 2026-05-26 · 56.6Qwen3.7 Max · 2026-05-27 · 56.6Qwen3.7 Max · 2026-05-28 · 56.6Qwen3.7 Max · 2026-05-29 · 56.6Qwen3.7 Max · 2026-05-30 · 56.6Qwen3.7 Max · 2026-05-31 · 56.6Qwen3.7 Max · 2026-06-01 · 56.6Qwen3.7 Max · 2026-05-22 · 56.6Qwen3.7 Max · 2026-05-23 · 56.6Qwen3.7 Max · 2026-05-24 · 56.6Qwen3.7 Max · 2026-05-25 · 56.6Qwen3.7 Max · 2026-05-26 · 56.6Qwen3.7 Max · 2026-05-27 · 56.6Qwen3.7 Max · 2026-05-28 · 56.6Qwen3.7 Max · 2026-05-29 · 56.6Qwen3.7 Max · 2026-05-30 · 56.6Qwen3.7 Max · 2026-05-31 · 56.6Qwen3.7 Max · 2026-06-01 · 56.6
Claude Opus 4.8 (Fast)· AnthropicClaude Opus 4.8 (Adaptive Reasoning, Max Effort)· AnthropicGPT-5.5· OpenAIClaude Opus 4.7· AnthropicGemini 3.1 Pro Preview· GoogleGPT-5.4· OpenAIQwen3.7 Max· AlibabaQwen3.7 Max· Alibaba

Coding

Math

AA Math Index

Ranking by composite math benchmark score. Top 20 models.

AIME 2025

American Invitational Mathematics Examination. Top 20.

MATH-500

500 competitive math problems. Top 20.

Knowledge & reasoning

MMLU-Pro

Massive Multitask Language Understanding Pro. Top 20.

GPQA Diamond

Graduate-level Physics, Chem, Bio questions. Top 20.

HLE — Humanity's Last Exam

Hardest benchmark, focused on reasoning. Top 20.

Performance

Output tokens/second

Ranking by generation speed (tokens/s). Top 20 models.

tokens/s

Time to First Token (TTFT)

Initial latency. Lower = better. Top 20.

ms

End-to-End Response Time

Time to first answer token (TTFA). Includes reasoning chain. Lower = better. Top 20.

seconds

Context window

Tokens the model can process. Top 15.

tokens

Cost

Human preference

Advanced capabilities

SciCode

Scientific code generation (physics, chemistry, biology). Top 20.

IFBench — Instruction Following

Adherence to complex and constrained instructions. Top 20.

AA-LCR — Long Context Reasoning

Reasoning over long context (understanding and using info in 100K+ tokens). Top 20.

Terminal-Bench Hard

Agentic task execution in a real Linux terminal. Top 20.

Tau²-Bench — Tool Use

Tool use in simulated environments (airline, retail, telecom). Top 20.

Video models

Editorial quality — video models

Subjective score (0–10) based on visual quality, physics, duration and cost. SWEN editorial review.

Score /10
For LLMs we use objective benchmarks. For video there is no industry-standard index yet — this is our curated assessment.Fonte: SWEN editorial review

Explore more

Frequently asked questions about the AI ranking

What is the most intelligent AI in the world in 2026?

According to the AA Intelligence Index — a composite index aggregating GPQA Diamond, MMLU-Pro, AIME, HLE and LiveCodeBench — Claude Opus 4.8 (Fast) (Anthropic) leads the ranking in 2026 with a score of 61.4/100, followed by Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.4) and GPT-5.5 (60.2). The Intelligence Index is calculated by Artificial Analysis based on independent evaluations and reflects real technical capability in reasoning, math, science and coding. It differs from LMArena ELO, which measures human preference in open conversations. For tasks requiring deep reasoning, code or scientific analysis, models at the top of the Intelligence Index typically perform best. For everyday conversations and creativity, ELO is a more representative guide. See the updated ranking for real-time positions.

What is the difference between ELO and Intelligence Index?

ELO comes from LMArena (Chatbot Arena), where real users compare responses from two anonymized models and pick the best one. It is a measure of subjective human preference — reflecting naturalness, usefulness and perceived quality in everyday conversations. A model with a high ELO may not be the most accurate on technical tasks, but it is what people prefer to use. The AA Intelligence Index, calculated by Artificial Analysis, is objective: it aggregates results from standardized benchmarks such as GPQA Diamond (PhD-level questions), MMLU-Pro (broad academic knowledge), AIME (olympiad math), HLE (frontier scientific knowledge) and LiveCodeBench (programming). The higher the score, the more technical capability the model demonstrated in controlled evaluations. Use ELO to choose a general conversational assistant; use the Intelligence Index to select models for technical or scientific pipelines.

Which AI is best for coding in 2026?

For coding, the most relevant benchmarks are LiveCodeBench — code challenges evaluated with real execution — and the AA Coding Index. In 2026, GPT-5.5 leads the coding ranking (59.1/100), with GPT-5.4 in second and Claude Opus 4.8 (Adaptive Reasoning, Max Effort) in third. The ideal choice depends on context: for code generation via API, cost per token and context window matter as much as accuracy. For interactive IDE development (Cursor, VS Code), latency is critical. For multi-file projects, context windows above 100K tokens are required. See the full table to compare coding models by score, price and speed.

How often is the ranking updated?

The SWEN ranking is updated automatically and continuously from three main sources. Artificial Analysis benchmark data (Intelligence Index, Coding Index, Math Index, inference speed) is synced every 6 hours via automated integration. API pricing — input and output per 1M tokens — is updated daily via OpenRouter, reflecting provider changes in near real time. LMArena ELO (Chatbot Arena) is synced weekly. The page revalidates its cache every 5 minutes via ISR (Incremental Static Regeneration): when a new model enters or a score changes, the ranking updates within 5 minutes without a manual rebuild. The last sync occurred on Jun 01, 2026.

What is the difference between Gemini 3, 3.1 and 3.5?

Google's Gemini 3 family does not follow sequential linear numbering. Google released Gemini 3 Flash, Gemini 3.1 Pro/Flash Lite and Gemini 3.5 Flash — without publishing an official “Gemini 3.2”. Each number denotes a distinct technical generation: 3.1 brought reasoning improvements; 3.5 expanded capability at an intermediate cost. Gemini 3.1 Pro costs $2.00/1M tokens with a 1-million-token context window, positioning itself as an alternative to GPT-4o and Claude 3.7. See the full Gemini 3 family comparison →

What is Google's Gemini Spark?

“Gemini Spark” is a name circulating online that Google has never officially launched as a product. The term appeared in APK teardowns linked to a possible ultra-lightweight version of Gemini for edge devices. Google's confirmed lightweight models are: Gemini Nano (on-device, Pixel 8 Pro/Pixel 9) and Gemini Flash(via API, $0.075/1M tokens). Any prediction about “Gemini Spark” is speculation until official confirmation. Read what is known about Gemini Spark →

Methodology & sources

Artificial Analysis provides Intelligence Index, Coding Index, Math Index and individual benchmarks (GPQA Diamond, MMLU-Pro, HLE, AIME, MATH-500, LiveCodeBench). Synced every 6h via automated cron.

LMArena Human preference ELO in blind side-by-side comparisons. Updated weekly.

OpenRouter provider pricing in USD per 1M tokens. Updated daily.

Historical snapshots daily score capture at 06:30 UTC to feed temporal evolution charts. Started on Jun 01, 2026.

Benchmarks are indicative — always test on your specific use case before deciding. Performance varies by inference provider (same model, different latency).