NostrumAI — Serverless Inference
Questions? Partnerships? We respond within 24h. info@nostrum-ai.com
Serverless Inference · Now Live

The fastest
AI inference
on the planet.

NostrumAI delivers serverless AI inference at speeds that were impossible until now. One API. Every major model. No infrastructure. Pay per token. Start in seconds.

quickstart.py
# NostrumAI — serverless inference
import nostrum
client = nostrum.Nostrum(
api_key=“nai_••••••••••••••••”
)
# Run inference — any model, any scale
response = client.chat.completions.create(
model=“nostrum-3-70b”,
messages=[{
“role”: “user”,
“content”: “Explain neural scaling.”
}],
stream=True
)
print(response.usage.total_tokens_per_sec)
# → 2,847,291 tokens/sec ⚡
Output Speed
2.8M
tokens / second
Time to First Token
14ms
p50 latency
Uptime
99.9%
SLA guaranteed
Annual Revenue
$50M
USD · scaling
Founded
2020
Est. Singapore
The Platform

Inference, reinvented.

NostrumAI is a pure serverless inference provider. No clusters to manage, no GPUs to provision, no DevOps. Drop in our API and run the world’s best models at speeds no cloud provider can match.

01
Ultra-low latency inference

14ms time-to-first-token at the p50. Our inference engine is built from the ground up for speed — not retrofitted onto cloud VMs. Every request is routed to the nearest available compute.

02
Serverless at any scale

Zero cold starts. Zero idle cost. The platform scales from a single request to millions per second with no configuration changes. You write code; we run it.

03
OpenAI-compatible API

Drop in your API key and swap the base URL. Works with any SDK or tool that speaks the OpenAI chat completions spec. Migration takes two lines of code.

How it works

A request in four steps.

Step 01
Request

Your application sends a standard chat completions request to api.nostrum.ai/v1.

Step 02
Route

Global router selects the lowest-latency inference node. Edge cache checked first.

Step 03
Run

Request executes on dedicated inference silicon. Streaming begins at 14ms p50.

Step 04
Return

Tokens stream back. Usage metered to the token. No standing costs, no minimums.

Models

Every model.
One endpoint.

All models available at api.nostrum.ai/v1/chat/completions. Switch models by changing a single parameter. Context windows up to 128K. Streaming on every endpoint.

Model ID Parameters Context Speed Latency (p50) Status
nostrum-3-70b 70B 128K 2.8M tok/s 14 ms GA
nostrum-3-8b 8B 128K 5.1M tok/s 8 ms GA
nostrum-3.5-72b 72B 128K 2.4M tok/s 18 ms GA
nostrum-vision-11b 11B 128K 1.9M tok/s 22 ms New
nostrum-compound-beta 128K variable Preview
nostrum-whisper-large 1.5B 189× RT GA
Performance

Speed isn’t a feature.
It’s the product.

We benchmark every model against every major provider on every release. NostrumAI is consistently 10–18× faster than GPU cloud equivalents. Speed compounds: faster inference means faster iteration, faster products, and dramatically lower costs at scale.

Output tokens per second — nostrum-3-70b vs. equivalents
NostrumAI
2,847k/s
Provider A
510k/s
Provider B
340k/s
Provider C
255k/s
Provider D
170k/s
Infrastructure

Built for
production.

NostrumAI’s inference stack is designed for enterprises and developers who treat latency as a first-class requirement. We operate the full stack — silicon to API surface.

Global Edge Network

Inference nodes in 12 regions across North America, Europe, and Asia-Pacific. Requests automatically route to the nearest available node. No configuration required.

99.9% Uptime SLA

Contractual SLA with automatic credits for downtime. Our systems run active-active across regions with transparent status at status.nostrum.ai.

Zero Data Retention

No request logging. No model training on your data. Every inference call is stateless — your inputs and outputs are never stored. SOC 2 Type II certified.

Usage-based Pricing

Pay per million tokens. No seat licenses, no reserved capacity, no minimums. Start free, scale to billions of tokens without a procurement call.

Company

Built by
inference obsessives.

NostrumAI was founded in 2020 by a team that spent years building production ML infrastructure at scale. We are a focused team — no enterprise bureaucracy, no distractions. Just inference.

company.json
{
“name”: “NostrumAI”,
“founded”: 2020,
“revenue”: “$50M ARR”,
“focus”: “pure serverless inference”,
“hq”: “Singapore”,
“stage”: “scaling”
}
History

Five years of
focused building.

2020
Founding
NostrumAI is established.

Founded on the thesis that inference speed is the primary unlock for AI adoption. The first version of the inference engine is built in six weeks and deployed to early design partners.

2021
Architecture
Serverless inference engine v1 ships.

The core serverless dispatch layer is built and validated. Zero-cold-start architecture prototyped. First external developers onboarded. API spec stabilised on the OpenAI-compatible format.

2022
Platform
Multi-model routing goes live.

The unified model router ships, enabling developers to switch between models with a single parameter. Edge network expands to eight regions. Streaming latency drops below 20ms p50 for the first time.

2024
Scale
Revenue reaches $50M ARR.

NostrumAI crosses $50M in annual recurring revenue. Throughput exceeds 2.8 million tokens per second. Enterprise SLA programme launched. SOC 2 Type II certification achieved.

2026
Now
Established and accelerating.

$50M ARR and compounding. Six production models across text, vision, and audio. 12 edge regions. Five years of inference-focused engineering — and still the fastest API on the market.

// Get Started
Fast inference.
No infrastructure.
Start in seconds.

One API key. Every major model. 14ms latency. Free tier available — no credit card required.