NostrumAI delivers serverless AI inference at speeds that were impossible until now. One API. Every major model. No infrastructure. Pay per token. Start in seconds.
NostrumAI is a pure serverless inference provider. No clusters to manage, no GPUs to provision, no DevOps. Drop in our API and run the world’s best models at speeds no cloud provider can match.
14ms time-to-first-token at the p50. Our inference engine is built from the ground up for speed — not retrofitted onto cloud VMs. Every request is routed to the nearest available compute.
Zero cold starts. Zero idle cost. The platform scales from a single request to millions per second with no configuration changes. You write code; we run it.
Drop in your API key and swap the base URL. Works with any SDK or tool that speaks the OpenAI chat completions spec. Migration takes two lines of code.
Your application sends a standard chat completions request to api.nostrum.ai/v1.
Global router selects the lowest-latency inference node. Edge cache checked first.
Request executes on dedicated inference silicon. Streaming begins at 14ms p50.
Tokens stream back. Usage metered to the token. No standing costs, no minimums.
All models available at api.nostrum.ai/v1/chat/completions.
Switch models by changing a single parameter.
Context windows up to 128K. Streaming on every endpoint.
| Model ID | Parameters | Context | Speed | Latency (p50) | Status |
|---|---|---|---|---|---|
| nostrum-3-70b | 70B | 128K | 2.8M tok/s | 14 ms | GA |
| nostrum-3-8b | 8B | 128K | 5.1M tok/s | 8 ms | GA |
| nostrum-3.5-72b | 72B | 128K | 2.4M tok/s | 18 ms | GA |
| nostrum-vision-11b | 11B | 128K | 1.9M tok/s | 22 ms | New |
| nostrum-compound-beta | — | 128K | variable | — | Preview |
| nostrum-whisper-large | 1.5B | — | 189× RT | — | GA |
We benchmark every model against every major provider on every release. NostrumAI is consistently 10–18× faster than GPU cloud equivalents. Speed compounds: faster inference means faster iteration, faster products, and dramatically lower costs at scale.
NostrumAI’s inference stack is designed for enterprises and developers who treat latency as a first-class requirement. We operate the full stack — silicon to API surface.
Inference nodes in 12 regions across North America, Europe, and Asia-Pacific. Requests automatically route to the nearest available node. No configuration required.
Contractual SLA with automatic credits for downtime. Our systems run active-active across regions with transparent status at status.nostrum.ai.
No request logging. No model training on your data. Every inference call is stateless — your inputs and outputs are never stored. SOC 2 Type II certified.
Pay per million tokens. No seat licenses, no reserved capacity, no minimums. Start free, scale to billions of tokens without a procurement call.
NostrumAI was founded in 2020 by a team that spent years building production ML infrastructure at scale. We are a focused team — no enterprise bureaucracy, no distractions. Just inference.
Founded on the thesis that inference speed is the primary unlock for AI adoption. The first version of the inference engine is built in six weeks and deployed to early design partners.
The core serverless dispatch layer is built and validated. Zero-cold-start architecture prototyped. First external developers onboarded. API spec stabilised on the OpenAI-compatible format.
The unified model router ships, enabling developers to switch between models with a single parameter. Edge network expands to eight regions. Streaming latency drops below 20ms p50 for the first time.
NostrumAI crosses $50M in annual recurring revenue. Throughput exceeds 2.8 million tokens per second. Enterprise SLA programme launched. SOC 2 Type II certification achieved.
$50M ARR and compounding. Six production models across text, vision, and audio. 12 edge regions. Five years of inference-focused engineering — and still the fastest API on the market.
One API key. Every major model. 14ms latency. Free tier available — no credit card required.