Skip to content
Product18 min readRepath Khan

Why We Built the Critique Inference API: Western Servers, NVIDIA-Powered Capacity, and Credits You Already Own

v5.2 ships OpenAI-compatible chat completions on the same credit pool as review and Builder — not because the big vendors finally gave us permission, but because agent teams kept asking for one wallet, one key, and infrastructure we could stress-test before we asked customers to trust it.

Critique Inference API

Western servers · NVIDIA capacity · your credits

critique.sh

v5.2
Ship — Inference API + usage dashboard
0
Launch models on /api/v1/models
0M
Default context — DeepSeek V4 Flash
Western
Sweetener server region — no mirror to training pipelines

Every week we talk to teams running Coding Agent sidecars, internal eval harnesses, and PR-adjacent copilots. They all describe the same friction: Critique credits on one dashboard, a separate inference vendor on another, finance asking why the “AI budget” is three spreadsheets, and security asking why prompts leave the jurisdiction they thought they had standardized on review.

The Coding Agent API solved the heavy path — clone, patch, optional draft PR. Plenty of workflows do not need a sandbox. They need token-in, token-out chat at 2 a.m. when your orchestrator already knows the repo context. That gap is what Inference API closes.

There is a lazy story in enterprise AI: wait until a hyperscaler ships the “official” bundle, then bolt your product on. That story optimizes for slide decks, not for teams shipping this quarter.

Critique exists because review became the bottleneck when generation got cheap. Inference API exists because the same teams that hit that bottleneck are now running agents that call models hundreds of times per hour — and they want one trust boundary, one credit meter, and one acceptable-use story their security lead already read when they approved PR review.

The aim is simple: make hosted inference a first-class citizen of Critique, not a footnote on someone else’s pricing page. The plan is incremental — ship models we can stand behind, publish rate cards, expose usage dashboards, enforce per-user caps — then widen the catalog as we prove each lane under real agent traffic.

Marketing language for “cloud GPU” is cheap. What is not cheap is running agent-shaped traffic against a new surface until the p99 stops embarrassing you.

Inference API traffic routes through Western sweetener servers — our name for the capacity layer we use when we want low-latency, high-throughput completion work without parking customer payloads in jurisdictions our review customers did not sign up for. NVIDIA-powered clusters back the frontier lanes (including Nemotron 3 Ultra) where MoE width and KV pressure actually matter.

Before we listed models on `GET /api/v1/models`, we ran synthetic agent loops, streaming completion storms, and credit-metering soak tests against the same paths review uses for operational discipline. Not a demo. A gate. If a model lane could not survive that traffic profile, it did not ship.

Stress-tested
Streaming + non-streaming soak before public model IDs
NVIDIA
Powered capacity for Nemotron-class MoE inference
Western
Sweetener server routing — default for all launch models
Metered
Credits + usage dashboard — no mystery invoices

Critique does not use Inference API prompts or completions to train foundation models. Payloads are processed to return completions, metered for billing, and not retained for product analytics beyond what operations need to charge credits fairly.

We do not mirror customer traffic to non-Western training pipelines. We do not sell prompt data. The acceptable-use policy on [/inference-api](/inference-api) is short on purpose: if you would not run it in your production repo, do not run it through our keys.

Usage dashboards at [/inference-dashboard](/inference-dashboard) (signed in) show model mix, token volume, API key attribution, and limit status — so finance and platform teams see the same numbers engineering sees, without opening a second vendor console.

DeepSeek V4 Flash (default)

284B MoE, 13B active, 1M-token context class. Default for quickstart examples because most sidecars want throughput and reasoning headroom without flagship burn. Western-hosted. Standard private tier: $0.15 / $0.30 per million input/output tokens, billed via Critique credits.

deepseek/deepseek-v4-flash
Tencent Hy3 Preview

205B MoE built for agentic workflows with configurable reasoning depth (disabled, low, high). **0.5 credits** per PR review run on Critique. Inference API at **10% below market** ($0.0567 / $0.189 per M vs $0.063 / $0.21). 262K context. Western sweetener servers; no logs; no training data retained.

tencent/hy3-preview

NVIDIA Nemotron 3 Ultra (`nvidia/nemotron-3-ultra-550b-a55b`) rounds out the launch trio: 550B MoE (55B active) for long-horizon agent orchestration. Intro pricing on review and API continues through 19 June 2026 UTC — 2 credits per PR review run (then 3 shelf), API tokens at 50% off market for the same window. After that, Nemotron API pricing returns to standard market rates. See the rate card on [/inference-api#pricing](/inference-api#pricing).

Quickstart — same key as Builder

Replace crt_… with a key from Settings → Connections. baseURL is Critique, not a third-party marketplace.

curl https://critique.sh/api/v1/chat/completions \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v4-flash",
    "messages": [
      { "role": "user", "content": "Summarize this retry policy in three bullets." }
    ]
  }'

Hosted inference without guardrails is just a faster way to drain a credit pool. Settings → Connections includes optional daily and monthly Inference API caps, request caps, credit reserve for review, and an enable/disable toggle — plus the full dashboard for charts and paginated activity logs.

That is deliberate product design: we would rather you throttle yourself than discover a runaway agent at month-end.

More models will land on the same endpoints as we validate them on sweetener capacity. The bar does not change: Western default, published rate cards, stress tests, usage visibility, and honest privacy copy.

If you are already on Critique for review, try Inference API with the credits you have. If you are evaluating agent infrastructure, read the ship notes at [/version](/version) for v5.2 and the dashboard docs linked from [/inference-api](/inference-api).

DeepSeek V4 Flash only — after the infrastructure story

75% off when you opt in to training logs.

Private-by-default stays the standard tier at $0.15 / $0.30 per million input/output tokens. If you explicitly opt in — account toggle in Settings → Connections or header X-Critique-DeepSeek-Training-Opt-In: true on a request — DeepSeek V4 Flash bills at **25% of list price (75% off)**: $0.0375 / $0.075 per M. Western-hosted either way. This deal applies to DeepSeek V4 Flash on Inference API only, not review runs or other models.

75% off
DeepSeek V4 Flash
$0.0375 / $0.075 per M$0.15 / $0.30 private tier
Ends Opt-in only — Settings or request header

Try Inference API on credits you already have.

Sign in, create a crt_ key with inference scopes, point your OpenAI SDK at https://critique.sh/api/v1, and call deepseek/deepseek-v4-flash today.

Open Inference API →