Does SWEN run its own benchmarks?

SWEN aggregates data from specialized sources (LMArena, LiveBench, Artificial Analysis). The Intelligence Index and benchmarks are synced automatically every 6 hours via the Artificial Analysis API.

Is the data real-time?

The Intelligence Index and benchmarks are automatically synced every 6 hours via the Artificial Analysis API. New models are imported in the same window. Pricing and technical specs (context window, vision support) are enriched weekly via sync-model-metadata.

Yes. The data is aggregated from public sources and properly attributed. For commercial use or API integration, please contact us. We plan to offer a public API soon.

How do I report incorrect data?

If you find outdated or incorrect data, send an email to contato@swen.ia.br with the model name and suggested correction. We verify and update within 24 hours.

AI Benchmark Methodology

Transparency is fundamental. This page documents how SWEN collects, processes and presents AI model benchmark data. Our sources are public, our process is automated and our data is updated daily.

Principles

Independence

SWEN has no commercial relationship with any AI provider. We receive no payment to position models. Rankings reflect exclusively the data from the sources listed below.

Transparency

All data sources are public and linked. Our sync code is documented. Anyone can verify the data against the original sources.

Updates

Intelligence Index and benchmarks are automatically synced every 6 hours via Artificial Analysis. New models are imported in the same window. Pricing and technical specs are enriched weekly.

Where does the benchmark data come from?

SWEN aggregates data from 4 specialized sources, each contributing different evaluation dimensions:

1. LMArena (Chatbot Arena) — ELO Score

URL: lmarena.ai
What we collect: ELO score per model, ranking, vote count
Frequency: Daily
Source methodology: LMArena (formerly LMSYS Chatbot Arena) operates a human voting platform where users compare anonymous responses from two models and pick the better one. The ELO system, analogous to chess rankings, computes a relative rating based on millions of cumulative votes. It is widely considered the most reliable industry benchmark because it reflects real human preference, not synthetic metrics.

2. Artificial Analysis — Intelligence Index + Detailed Benchmarks

URL: artificialanalysis.ai
What we collect: Intelligence Index (composite score 0-100), Coding Index, Math Index, MMLU Pro, GPQA Diamond, MATH-500, AIME 2025, LiveCodeBench, SWE Bench Verified, speed (tokens/s), latency (TTFT)
Frequency: Daily via API v2
Source methodology: The Intelligence Index combines 10 different evaluations into a composite score. Artificial Analysis runs each model against standardized evaluation datasets and measures both quality (accuracy) and performance (speed, latency). Speed and latency data are measured on proprietary infrastructure under controlled conditions.

3. LiveBench — Contamination-Resistant Benchmarks

URL: livebench.ai
What we collect: Global Average, Reasoning, Coding, Math, Data Analysis, Language scores (0-100)
Frequency: Daily
Source methodology: LiveBench is a self-updating benchmark that generates new questions periodically, reducing the risk of contamination (when models memorize answers from the training dataset). Questions are categorized across 6 dimensions and automatically evaluated against verified answer keys.

4. OpenRouter — Pricing, Specs and Availability

URL: openrouter.ai
What we collect: Price per million tokens (input/output), context window, max output tokens, supported modalities (text, image, audio, video), tool calling support, reasoning capability, model description
Frequency: Weekly via public API (no authentication)
Source methodology: OpenRouter is an AI API aggregator offering unified access to 300+ models. Pricing data reflects values from the original providers (OpenAI, Anthropic, Google, etc.) with OpenRouter markup. Prices shown on SWEN are values reported by OpenRouter, not direct provider prices.

How often is data updated?

Automatic sync: Edge functions (Supabase) run daily, collecting data from all 4 sources via APIs and controlled web scraping.
Model matching: A fuzzy matching algorithm with 5 precision levels (exact, normalized, partial, alphanumeric, base) maps model names across different sources (e.g., "claude-opus-4-6" ↔ "Anthropic: Claude Opus 4.6").
Deduplication: Duplicate benchmarks (same model, same benchmark) are resolved by keeping the most recent score.
Validation: Scores outside expected ranges (ELO < 800 or > 2000, Intelligence Index < 0 or > 100) are automatically discarded.
Publishing: Validated data is served on the site via ISR (Incremental Static Regeneration) with a 1-hour cache.

What are the data limitations?

English-centric benchmarks. Most benchmarks test models in English only. Performance in other languages may vary significantly and is not captured in the scores presented.
Approximate pricing. Prices come from OpenRouter and may differ from direct provider pricing. They include OpenRouter markup and may not reflect volume discounts or enterprise agreements.
Variable speed and latency. Performance metrics depend on infrastructure, region, time of day and load. Reported values are averages under controlled conditions.
Potential conflict. SWEN has no conflict of interest with AI providers. If any commercial relationship develops in the future, it will be explicitly disclosed on this page.

How to report incorrect data?

If you find incorrect or outdated data, or have suggestions to improve our methodology, please contact us:

Email: contato@swen.ia.br
X/Twitter: @SwenAI

Verified corrections are applied within 24 hours. We especially appreciate contributions from researchers, developers and AI professionals.

SWEN MethodologyHow We Evaluate AI Models