The fastest inference engine for agents and streams.
Stateful inference for the next generation of AI.
Rethinking inference.
Process the stream, not the request.
LayerScale introduces a new computation model. Data is processed the moment it arrives, not when you query. By the time you ask a question, the answer is already there.
Data-Driven Processing
Data flows continuously into the cache. Processing happens in the background, amortized across your data stream. Queries become lightweight consumers of pre-computed state.
Constant Query Latency
Query latency depends on your question length, not your context size. Whether you have 100 data points or 10,000, query time stays the same.
Full Attention Preserved
Unlike sparse attention or state-space models that sacrifice model capacity for speed, LayerScale maintains full quadratic self-attention. No quality compromises.
8x Faster Than Leading
Inference Engines
Benchmarked on streaming OHLCV data (155 to 1,200 samples, ~2,600 to 19,000 tokens). Local engines use Meta-Llama-3.1-8B-Instruct on NVIDIA L40S GPU. Cloud APIs called via their respective endpoints.
Fastest Inference Engine
for Agentic AI
AI coding agents make dozens of tool calls per task: reading files, writing code, running tests. We benchmarked a 35-turn React app build from scratch against four production engines. LayerScale completes the full session in 42.5 seconds, 1.4x faster than vLLM and SGLang, by preserving computation across turns instead of reprocessing from scratch.
Cumulative wall time for a 35-turn agentic coding session (React app from scratch), Llama-3.1-8B-Instruct on NVIDIA L40S GPU (10 iterations). Lower is faster. LayerScale completes in 42.5s vs 60.3s for vLLM, 62.1s for SGLang, 63.0s for llama.cpp, and 69.3s for TensorRT-LLM.
Built for Streaming
from the Ground Up
LayerScale manages context through intelligent regions so each data pattern gets the right treatment. A lock-free ingestion pipeline ensures data pushes never block. O(1) tokenization via pre-computed lookup tables eliminates overhead on the critical path, and fast-path detection returns single-token classification answers immediately.
Where LayerScale Excels
Market Data
Real-time analysis over continuous OHLCV feeds with sub-50ms responses.
Log Analysis
Stream logs continuously and query for anomalies without reprocessing.
Sensor Monitoring
IoT data processed in real-time with instant analytical queries.
Real-Time Analytics
Instant answers over accumulated context as data streams in.
Run Any Model
on Any Hardware
Any open-weight transformer model, optimized and production-ready across all major accelerator platforms.
Anthropic/OpenAI Compatible API
Drop-in replacement for existing workflows. Standard endpoints with streaming support, plus specialized session APIs for continuous data injection. Python and TypeScript client libraries available. Full API reference
Endpoints
- POST /v1/chat/completions OpenAI-compatible
- POST /v1/messages Anthropic-compatible
- POST /v1/sessions/init Create session
- POST /v1/sessions/{id}/stream/push Push data
- POST /v1/sessions/{id}/generate O(1) query
- GET /v1/sessions/{id}/stream/status Stream stats
Streaming endpoints also available via WebSockets, TCP sockets, and Server-Sent Events (SSE) for low-latency persistent connections.
# Initialize a streaming session curl -X POST https://api.layerscale.ai/v1/sessions/init \ -H "Content-Type: application/json" \ -d '{"prompt": "You are a financial analyst..."}' # Push streaming data (non-blocking) curl -X POST https://api.layerscale.ai/v1/sessions/{id}/stream/push \ -H "Content-Type: application/json" \ -d '{"data": [{"o": 150.25, "h": 151.00, "l": 149.80, "c": 150.90, "v": 100000}]}' # Query with constant latency curl -X POST https://api.layerscale.ai/v1/sessions/{id}/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "What is the current trend?"}'
Flash Queries
Define your questions upfront and get pre-computed answers after every data update, streamed back via SSE with zero GPU work. Learn more →
Run Anywhere
Docker Recommended
docker run --gpus all -p 8080:8080 \ -e HF_TOKEN=$HF_TOKEN \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-Instruct \ --license-key $LAYERSCALE_LICENSE_KEY
Don't have a key yet? Get a license key
Cloud API
# Use the hosted API directly curl -X POST https://api.layerscale.ai/v1/sessions/init \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{"prompt": "You are a financial analyst..."}'
GPU Support
NVIDIA T4, RTX 20xx+, L4, L40S, A10, A100, H100, B200, B300, MI455X. Supports CUDA, ROCm, and Metal.
Model Support
Any open-weight model. Llama, Mistral, Qwen, Gemma, and more work out of the box.
Platforms
Linux recommended for production. macOS supported for development.