LayerScale

The fastest inference engine for agents and streams.

Stateful inference for the next generation of AI.

Book a Demo Get Started

Accelerated by

33ms

Average query latency

Faster than vLLM and SGLang

O(1)

Query complexity

80x

Faster than Claude Opus 4.5

How It Works

Rethinking inference.
Process the stream, not the request.

LayerScale introduces a new computation model. Data is processed the moment it arrives, not when you query. By the time you ask a question, the answer is already there.

Data-Driven Processing

Data flows continuously into the cache. Processing happens in the background, amortized across your data stream. Queries become lightweight consumers of pre-computed state.

Constant Query Latency

Query latency depends on your question length, not your context size. Whether you have 100 data points or 10,000, query time stays the same.

Full Attention Preserved

Unlike sparse attention or state-space models that sacrifice model capacity for speed, LayerScale maintains full quadratic self-attention. No quality compromises.

Benchmarks

8x Faster Than Leading
Inference Engines

LayerScale

33ms

SGLang

219ms 7x

vLLM

223ms 7x

TensorRT-LLM

255ms 8x

llama.cpp

261ms 8x

GPT-5.2

926ms 28x

GPT-4o

1,124ms 34x

Opus 4.5

2,628ms 80x

Local inference Cloud API

Benchmarked on streaming OHLCV data (155 to 1,200 samples, ~2,600 to 19,000 tokens). Local engines use Meta-Llama-3.1-8B-Instruct on NVIDIA L40S GPU. Cloud APIs called via their respective endpoints.

Multi-Agent Tool Calling

Fastest Inference Engine
for Agentic AI

AI coding agents make dozens of tool calls per task: reading files, writing code, running tests. We benchmarked a 35-turn React app build from scratch against four production engines. LayerScale completes the full session in 42.5 seconds, 1.4x faster than vLLM and SGLang, by preserving computation across turns instead of reprocessing from scratch.

LayerScale SGLang TensorRT-LLM vLLM llama.cpp

Cumulative wall time for a 35-turn agentic coding session (React app from scratch), Llama-3.1-8B-Instruct on NVIDIA L40S GPU (10 iterations). Lower is faster. LayerScale completes in 42.5s vs 60.3s for vLLM, 62.1s for SGLang, 63.0s for llama.cpp, and 69.3s for TensorRT-LLM.

Architecture

Built for Streaming
from the Ground Up

LayerScale manages context through intelligent regions so each data pattern gets the right treatment. A lock-free ingestion pipeline ensures data pushes never block. O(1) tokenization via pre-computed lookup tables eliminates overhead on the critical path, and fast-path detection returns single-token classification answers immediately.

Conventional

Query

Full ReprocessO(n)

Response

LayerScale

Data Stream

Pre-computed Cachealways-on

Query

ResponseO(1)

Use Cases

Where LayerScale Excels

▵

Market Data

Real-time analysis over continuous OHLCV feeds with sub-50ms responses.

▸

Log Analysis

Stream logs continuously and query for anomalies without reprocessing.

◥

Sensor Monitoring

IoT data processed in real-time with instant analytical queries.

▶

Real-Time Analytics

Instant answers over accumulated context as data streams in.

Universal Compatibility

Run Any Model
on Any Hardware

Any open-weight transformer model, optimized and production-ready across all major accelerator platforms.

Open Models

Meta Llama 4

Google Gemma 4

DeepSeek DeepSeek V3.2

Mistral AI Mistral 4

Qwen Qwen 3

Microsoft Phi-4

NVIDIA Nemotron 3

+ More Any open-weight transformer

Hardware Platforms

NVIDIA CUDA RTX 20xx+, T4, L40S, A10, A100, H100, B200, B300

AMD ROCm Instinct MI250, MI300X, MI455X

Intel Arc Arc A-Series, Data Center GPU Max

Apple Metal M1, M2, M3, M4, M5 Pro / Max / Ultra

Developer Experience

Anthropic/OpenAI Compatible API

Drop-in replacement for existing workflows. Standard endpoints with streaming support, plus specialized session APIs for continuous data injection. Python and TypeScript client libraries available. Full API reference

Endpoints

POST /v1/chat/completions OpenAI-compatible
POST /v1/messages Anthropic-compatible
POST /v1/sessions/init Create session
POST /v1/sessions/{id}/stream/push Push data
POST /v1/sessions/{id}/generate O(1) query
GET /v1/sessions/{id}/stream/status Stream stats

Streaming endpoints also available via WebSockets, TCP sockets, and Server-Sent Events (SSE) for low-latency persistent connections.

              
              
              
              terminal
            

# Initialize a streaming session
curl -X POST https://api.layerscale.ai/v1/sessions/init \
  -H "Content-Type: application/json" \
  -d '{"prompt": "You are a financial analyst..."}'

# Push streaming data (non-blocking)
curl -X POST https://api.layerscale.ai/v1/sessions/{id}/stream/push \
  -H "Content-Type: application/json" \
  -d '{"data": [{"o": 150.25, "h": 151.00, "l": 149.80, "c": 150.90, "v": 100000}]}'

# Query with constant latency
curl -X POST https://api.layerscale.ai/v1/sessions/{id}/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the current trend?"}'
            

Zero-Latency Queries

Flash Queries

Define your questions upfront and get pre-computed answers after every data update, streamed back via SSE with zero GPU work. Learn more →

Deployment

Run Anywhere

Docker Recommended

docker run --gpus all -p 8080:8080 \
  -e HF_TOKEN=$HF_TOKEN \
  layerscale/layerscale:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --license-key $LAYERSCALE_LICENSE_KEY

Don't have a key yet? Get a license key

Cloud API

# Use the hosted API directly
curl -X POST https://api.layerscale.ai/v1/sessions/init \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "You are a financial analyst..."}'
            

GPU Support

NVIDIA T4, RTX 20xx+, L4, L40S, A10, A100, H100, B200, B300, MI455X. Supports CUDA, ROCm, and Metal.

Model Support

Any open-weight model. Llama, Mistral, Qwen, Gemma, and more work out of the box.

Platforms

Linux recommended for production. macOS supported for development.

FAQ

Frequently Asked Questions

What models does LayerScale support?

Virtually any open-weight transformer model. Popular models like Llama, Mistral, Qwen, and Gemma work out of the box, with automatic format conversion included.

How is this different from vLLM or TensorRT-LLM?

Those are request-driven frameworks optimized for serving many users with shared prefixes. LayerScale is data-driven, optimized for streaming scenarios where context changes continuously and queries need instant answers.

Does this work for chat applications?

LayerScale works fine for chat applications, but provides no benefit over conventional inference engines in that scenario. If you don't have continuous data ingestion, a conventional inference engine is probably a better fit.

Can I use this with my existing Anthropic or OpenAI code?

Yes. We support the Anthropic /v1/messages endpoint and the OpenAI /v1/chat/completions endpoint. Just point your existing client to the LayerScale API.

What's the memory overhead?

Each session maintains cache proportional to context length. We implement session pooling and LRU eviction for multi-session deployments. Bounded memory with configurable sliding windows keeps resource usage predictable.

The fastest inference engine for agents and streams.

Rethinking inference.Process the stream, not the request.

Data-Driven Processing

Constant Query Latency

Full Attention Preserved

8x Faster Than LeadingInference Engines

Fastest Inference Enginefor Agentic AI

Built for Streamingfrom the Ground Up

Where LayerScale Excels

Market Data

Log Analysis

Sensor Monitoring

Real-Time Analytics

Run Any Modelon Any Hardware

Anthropic/OpenAI Compatible API

Endpoints

Flash Queries

Run Anywhere

Docker Recommended

Cloud API

GPU Support

Model Support

Platforms

Frequently Asked Questions

Start Building withConstant-Time Queries

Rethinking inference.
Process the stream, not the request.

8x Faster Than Leading
Inference Engines

Fastest Inference Engine
for Agentic AI

Built for Streaming
from the Ground Up

Run Any Model
on Any Hardware

Start Building with
Constant-Time Queries