NostrumAI — Serverless Inference

Questions? Partnerships? We respond within 24h. info@nostrum-ai.com

Serverless Inference · Now Live

The fastest
AI inference
on the planet.

NostrumAI delivers serverless AI inference at speeds that were impossible until now. One API. Every major model. No infrastructure. Pay per token. Start in seconds.

Start Building Free Explore Models

quickstart.py

# NostrumAI — serverless inference

import nostrum

client = nostrum.Nostrum(

api_key=“nai_••••••••••••••••”

)

# Run inference — any model, any scale

response = client.chat.completions.create(

model=“nostrum-3-70b”,

messages=[{

“role”: “user”,

“content”: “Explain neural scaling.”

}],

stream=True

)

print(response.usage.total_tokens_per_sec)

# → 2,847,291 tokens/sec ⚡

Output Speed

2.8M

tokens / second

Time to First Token

14ms

p50 latency

Uptime

99.9%

SLA guaranteed

Annual Revenue

$50M

USD · scaling

Founded

2020

Est. Singapore

The Platform

Inference, reinvented.

NostrumAI is a pure serverless inference provider. No clusters to manage, no GPUs to provision, no DevOps. Drop in our API and run the world’s best models at speeds no cloud provider can match.

⚡

Ultra-low latency inference

14ms time-to-first-token at the p50. Our inference engine is built from the ground up for speed — not retrofitted onto cloud VMs. Every request is routed to the nearest available compute.

∞

Serverless at any scale

Zero cold starts. Zero idle cost. The platform scales from a single request to millions per second with no configuration changes. You write code; we run it.

≡

OpenAI-compatible API

Drop in your API key and swap the base URL. Works with any SDK or tool that speaks the OpenAI chat completions spec. Migration takes two lines of code.

How it works

A request in four steps.

Step 01

Request

Your application sends a standard chat completions request to api.nostrum.ai/v1.

Step 02

Route

Global router selects the lowest-latency inference node. Edge cache checked first.

Step 03

Run

Request executes on dedicated inference silicon. Streaming begins at 14ms p50.

Step 04

Return

Tokens stream back. Usage metered to the token. No standing costs, no minimums.

Models

Every model.
One endpoint.

All models available at api.nostrum.ai/v1/chat/completions. Switch models by changing a single parameter. Context windows up to 128K. Streaming on every endpoint.

Model ID	Parameters	Context	Speed	Latency (p50)	Status
nostrum-3-70b	70B	128K	2.8M tok/s	14 ms	GA
nostrum-3-8b	8B	128K	5.1M tok/s	8 ms	GA
nostrum-3.5-72b	72B	128K	2.4M tok/s	18 ms	GA
nostrum-vision-11b	11B	128K	1.9M tok/s	22 ms	New
nostrum-compound-beta	—	128K	variable	—	Preview
nostrum-whisper-large	1.5B	—	189× RT	—	GA

Performance

Speed isn’t a feature.
It’s the product.

We benchmark every model against every major provider on every release. NostrumAI is consistently 10–18× faster than GPU cloud equivalents. Speed compounds: faster inference means faster iteration, faster products, and dramatically lower costs at scale.

Output tokens per second — nostrum-3-70b vs. equivalents

NostrumAI

2,847k/s

Provider A

510k/s

Provider B

340k/s

Provider C

255k/s

Provider D

170k/s

Infrastructure

Built for
production.

NostrumAI’s inference stack is designed for enterprises and developers who treat latency as a first-class requirement. We operate the full stack — silicon to API surface.

Global Edge Network

Inference nodes in 12 regions across North America, Europe, and Asia-Pacific. Requests automatically route to the nearest available node. No configuration required.

99.9% Uptime SLA

Contractual SLA with automatic credits for downtime. Our systems run active-active across regions with transparent status at status.nostrum.ai.

Zero Data Retention

No request logging. No model training on your data. Every inference call is stateless — your inputs and outputs are never stored. SOC 2 Type II certified.

Usage-based Pricing

Pay per million tokens. No seat licenses, no reserved capacity, no minimums. Start free, scale to billions of tokens without a procurement call.

Company

Built by
inference obsessives.

NostrumAI was founded in 2020 by a team that spent years building production ML infrastructure at scale. We are a focused team — no enterprise bureaucracy, no distractions. Just inference.

company.json

{

“name”: “NostrumAI”,

“founded”: 2020,

“revenue”: “$50M ARR”,

“focus”: “pure serverless inference”,

“hq”: “Singapore”,

“stage”: “scaling”

}

History

Five years of
focused building.

2020

Founding

NostrumAI is established.

Founded on the thesis that inference speed is the primary unlock for AI adoption. The first version of the inference engine is built in six weeks and deployed to early design partners.

2021

Architecture

Serverless inference engine v1 ships.

The core serverless dispatch layer is built and validated. Zero-cold-start architecture prototyped. First external developers onboarded. API spec stabilised on the OpenAI-compatible format.

2022

Platform

Multi-model routing goes live.

The unified model router ships, enabling developers to switch between models with a single parameter. Edge network expands to eight regions. Streaming latency drops below 20ms p50 for the first time.

2024

Scale

Revenue reaches $50M ARR.

NostrumAI crosses $50M in annual recurring revenue. Throughput exceeds 2.8 million tokens per second. Enterprise SLA programme launched. SOC 2 Type II certification achieved.

2026

Now

Established and accelerating.

$50M ARR and compounding. Six production models across text, vision, and audio. 12 edge regions. Five years of inference-focused engineering — and still the fastest API on the market.

// Get Started

Fast inference.
No infrastructure.
Start in seconds.

One API key. Every major model. 14ms latency. Free tier available — no credit card required.

Get API Key — Free Read the Docs

The fastestAI inferenceon the planet.

Inference, reinvented.

A request in four steps.

Every model.One endpoint.

Speed isn’t a feature.It’s the product.

Built forproduction.