ProductJune 5, 202618 min readRepath Khan

Why We Built the Critique Inference API: Western Servers, NVIDIA-Powered Capacity, and Credits You Already Own

v5.2 ships OpenAI-compatible chat completions on the same credit pool as review and Builder — not because the big vendors finally gave us permission, but because agent teams kept asking for one wallet, one key, and infrastructure we could stress-test before we asked customers to trust it.

Critique Inference API

Western servers · NVIDIA capacity · your credits

critique.sh

v5.2

Ship — Inference API + usage dashboard

Launch models on /api/v1/models

Default context — DeepSeek V4 Flash

Western

Sweetener server region — no mirror to training pipelines

The problem was never “another model API”

Every week we talk to teams running Coding Agent sidecars, internal eval harnesses, and PR-adjacent copilots. They all describe the same friction: Critique credits on one dashboard, a separate inference vendor on another, finance asking why the “AI budget” is three spreadsheets, and security asking why prompts leave the jurisdiction they thought they had standardized on review.

The Coding Agent API solved the heavy path — clone, patch, optional draft PR. Plenty of workflows do not need a sandbox. They need token-in, token-out chat at 2 a.m. when your orchestrator already knows the repo context. That gap is what Inference API closes.

Why now — and why not “when the big firms say so”

There is a lazy story in enterprise AI: wait until a hyperscaler ships the “official” bundle, then bolt your product on. That story optimizes for slide decks, not for teams shipping this quarter.

Critique exists because review became the bottleneck when generation got cheap. Inference API exists because the same teams that hit that bottleneck are now running agents that call models hundreds of times per hour — and they want one trust boundary, one credit meter, and one acceptable-use story their security lead already read when they approved PR review.

The aim is simple: make hosted inference a first-class citizen of Critique, not a footnote on someone else’s pricing page. The plan is incremental — ship models we can stand behind, publish rate cards, expose usage dashboards, enforce per-user caps — then widen the catalog as we prove each lane under real agent traffic.

Sweetener servers, NVIDIA-powered capacity, stress-tested before launch

Marketing language for “cloud GPU” is cheap. What is not cheap is running agent-shaped traffic against a new surface until the p99 stops embarrassing you.

Inference API traffic routes through Western sweetener servers — our name for the capacity layer we use when we want low-latency, high-throughput completion work without parking customer payloads in jurisdictions our review customers did not sign up for. NVIDIA-powered clusters back the frontier lanes (including Nemotron 3 Ultra) where MoE width and KV pressure actually matter.

Before we listed models on GET /api/v1/models, we ran synthetic agent loops, streaming completion storms, and credit-metering soak tests against the same paths review uses for operational discipline. Not a demo. A gate. If a model lane could not survive that traffic profile, it did not ship.

Stress-tested

Streaming + non-streaming soak before public model IDs

NVIDIA

Powered capacity for Nemotron-class MoE inference

Western

Sweetener server routing — default for all launch models

Metered

Credits + usage dashboard — no mystery invoices

Privacy is the default, not the footnote

Critique does not use Inference API prompts or completions to train foundation models. Payloads are processed to return completions, metered for billing, and not retained for product analytics beyond what operations need to charge credits fairly.

We do not mirror customer traffic to non-Western training pipelines. We do not sell prompt data. The acceptable-use policy on /inference-api is short on purpose: if you would not run it in your production repo, do not run it through our keys.

Usage dashboards at /inference-dashboard (signed in) show model mix, token volume, API key attribution, and limit status — so finance and platform teams see the same numbers engineering sees, without opening a second vendor console.

First models on the API — what shipped in v5.2

DeepSeek V4 Flash (default)
284B MoE, 13B active, 1M-token context class. Default for quickstart examples because most sidecars want throughput and reasoning headroom without flagship burn. Western-hosted. Standard private tier: $0.15 / $0.30 per million input/output tokens, billed via Critique credits.
deepseek/deepseek-v4-flash
Tencent Hy3 Preview
205B MoE built for agentic workflows with configurable reasoning depth (disabled, low, high). 0.5 credits per PR review run on Critique. Inference API at 10% below market ($0.0567 / $0.189 per M vs $0.063 / $0.21). 262K context. Western sweetener servers; no logs; no training data retained.
tencent/hy3-preview

NVIDIA Nemotron 3 Ultra (nvidia/nemotron-3-ultra-550b-a55b) rounds out the launch trio: 550B MoE (55B active) for long-horizon agent orchestration. Intro pricing on review and API continues through 19 June 2026 UTC — 2 credits per PR review run (then 3 shelf), API tokens at 50% off market for the same window. After that, Nemotron API pricing returns to standard market rates. See the rate card on /inference-api#pricing.

Quickstart — same key as Builder

Replace crt_… with a key from Settings → Connections. baseURL is Critique, not a third-party marketplace.

curl https://critique.sh/api/v1/chat/completions \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v4-flash",
    "messages": [
      { "role": "user", "content": "Summarize this retry policy in three bullets." }
    ]
  }'

Limits, caps, and the ops story

Hosted inference without guardrails is just a faster way to drain a credit pool. Settings → Connections includes optional daily and monthly Inference API caps, request caps, credit reserve for review, and an enable/disable toggle — plus the full dashboard for charts and paginated activity logs.

That is deliberate product design: we would rather you throttle yourself than discover a runaway agent at month-end.

What comes next

More models will land on the same endpoints as we validate them on sweetener capacity. The bar does not change: Western default, published rate cards, stress tests, usage visibility, and honest privacy copy.

If you are already on Critique for review, try Inference API with the credits you have. If you are evaluating agent infrastructure, read the ship notes at /version for v5.2 and the dashboard docs linked from /inference-api.

DeepSeek V4 Flash only — after the infrastructure story
75% off when you opt in to training logs.Private-by-default stays the standard tier at $0.15 / $0.30 per million input/output tokens. If you explicitly opt in — account toggle in Settings → Connections or header X-Critique-DeepSeek-Training-Opt-In: true on a request — DeepSeek V4 Flash bills at **25% of list price (75% off)**: $0.0375 / $0.075 per M. Western-hosted either way. This deal applies to DeepSeek V4 Flash on Inference API only, not review runs or other models.
DeepSeek
75% off
DeepSeek V4 Flash
$0.0375 / $0.075 per M$0.15 / $0.30 private tier
Ends Opt-in only — Settings or request header

Primary sources

Critique Inference API

Docs, rate cards, API key gate

Inference usage dashboard

Signed in — charts + activity

Leemer Labs — Critique

Product arc + ecosystem

Leemer Labs — Founder

Background on the bet

Ship log

Operator release notes — v6.0 Platform API; v5.2 Inference API

Try Inference API on credits you already have.

Sign in, create a crt_ key with inference scopes, point your OpenAI SDK at https://critique.sh/api/v1, and call deepseek/deepseek-v4-flash today.

Open Inference API →

Compare Critique

Compare the main AI code review options.

If this article is part of a buying process, these pages compare Critique with the tools most teams evaluate for GitHub PR review.

Best AI code review tools AI code review pricing

← All essays Privacy & Terms

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy