launching Aug 29, 2026

Benchmarks

Measured freshness, correctness, and efficiency with reproducible methodology.

73%

Fewer incorrect API calls

More info per token

<1 day

Behind latest releases

2.3x

Faster task completion

Benchmark Dashboard

API Correctness Rate94.2%

Average task score % across all tasks (1,000 total).

Freshness Lag (Days)0.6

Average days behind latest release for target versions.

Token Efficiency4.1x

Base

Sota

Tokens per correct task (normalized vs baseline).

Tool Calls per Task1.7

Without

With SotaDocs

Average MCP tool calls per completed task.

Blindspot Report Impact

Tasks involving changed APIs (model training cutoff vs current docs)

Without Blindspot Report

62%

Correct on stale-API tasks

With Blindspot Report

92%

Correct on stale-API tasks

Methodology

Dataset

1,000 API usage test cases
50 popular npm/PyPI packages
Multiple version ranges tested
Includes deprecated API scenarios
RLM-tagged multi-hop tasks included

Models Tested

Claude 3.5 Sonnet
GPT-4o
Gemini 1.5 Pro
Codex

Scoring

API signature correctness
Parameter type accuracy
Deprecated API detection
Version-appropriate responses
RLM budget adherence (when enabled)

Reproducibility

Test suite available on GitHub
Deterministic prompts
Version-pinned dependencies
Monthly rerun cadence

See it for yourself

Try the playground and compare results with your current setup.

Available Aug 29, 2026

Try Playground

launching Aug 29, 2026

Install MCP Get API Key

Benchmarks

Measured freshness, correctness, and efficiency with reproducible methodology.

73%

Fewer incorrect API calls

More info per token

<1 day

Behind latest releases

2.3x

Faster task completion

Benchmark Dashboard

API Correctness Rate94.2%

Average task score % across all tasks (1,000 total).

Freshness Lag (Days)0.6

Average days behind latest release for target versions.

Token Efficiency4.1x

Base

Sota

Tokens per correct task (normalized vs baseline).

Tool Calls per Task1.7

Without

With SotaDocs

Average MCP tool calls per completed task.

Blindspot Report Impact

Tasks involving changed APIs (model training cutoff vs current docs)

Without Blindspot Report

62%

Correct on stale-API tasks

With Blindspot Report

92%

Correct on stale-API tasks

Methodology

Dataset

1,000 API usage test cases
50 popular npm/PyPI packages
Multiple version ranges tested
Includes deprecated API scenarios
RLM-tagged multi-hop tasks included

Models Tested

Claude 3.5 Sonnet
GPT-4o
Gemini 1.5 Pro
Codex

Scoring

API signature correctness
Parameter type accuracy
Deprecated API detection
Version-appropriate responses
RLM budget adherence (when enabled)

Reproducibility

Test suite available on GitHub
Deterministic prompts
Version-pinned dependencies
Monthly rerun cadence

See it for yourself

Try the playground and compare results with your current setup.

Available Aug 29, 2026

Try Playground