Real-time data breaks every inference engine. We built LayerScale that stays fast no matter how much data you throw at it.
LayerScale introduces a new computation model. Data is processed the moment it arrives, not when you query. By the time you ask a question, the answer is already there.
Data flows continuously into the cache. Processing happens in the background, amortized across your data stream. Queries become lightweight consumers of pre-computed state.
Query latency depends on your question length, not your context size. Whether you have 100 data points or 10,000, query time stays the same.
Unlike sparse attention or state-space models that sacrifice model capacity for speed, LayerScale maintains full quadratic self-attention. No quality compromises.
Benchmarked on streaming OHLCV data (155 to 1,200 samples, ~2,600 to 19,000 tokens). Local engines use Meta-Llama-3.1-8B-Instruct on NVIDIA L40S GPU. Cloud APIs called via their respective endpoints.
AI agents call tools dozens of times per task. Every call forces conventional engines to re-process the full conversation from scratch. LayerScale persists prior computation across calls, so each agent's state stays resident on the GPU. The result: constant latency no matter how deep the conversation gets.
Per-call latency across 45 interleaved multi-agent tool calls (3 iterations × 3 agents × 5 turns), Llama-3.1-8B-Instruct on NVIDIA L40S GPU. LayerScale maintains constant 66ms latency while llama.cpp climbs from 89ms to 130ms as context grows.
LayerScale manages context through intelligent regions so each data pattern gets the right treatment. A lock-free ingestion pipeline ensures data pushes never block. O(1) tokenization via pre-computed lookup tables eliminates overhead on the critical path, and fast-path detection returns single-token classification answers immediately.
Real-time analysis over continuous OHLCV feeds with sub-50ms responses.
Stream logs continuously and query for anomalies without reprocessing.
IoT data processed in real-time with instant analytical queries.
Instant answers over accumulated context as data streams in.
Any open-weight transformer model, optimized and production-ready across all major accelerator platforms.
Drop-in replacement for existing workflows. Standard endpoints with streaming support, plus specialized session APIs for continuous data injection. Python and TypeScript client libraries available.
Streaming endpoints also available via WebSockets, TCP sockets, and Server-Sent Events (SSE) for low-latency persistent connections.
# Initialize a streaming session curl -X POST https://api.layerscale.ai/v1/sessions/init \ -H "Content-Type: application/json" \ -d '{"prompt": "You are a financial analyst..."}' # Push streaming data (non-blocking) curl -X POST https://api.layerscale.ai/v1/sessions/{id}/stream/push \ -H "Content-Type: application/json" \ -d '{"data": [{"o": 150.25, "h": 151.00, "l": 149.80, "c": 150.90, "v": 100000}]}' # Query with constant latency curl -X POST https://api.layerscale.ai/v1/sessions/{id}/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "What is the current trend?"}'
Define your questions upfront and get pre-computed answers after every data update, streamed back via SSE with zero GPU work. Learn more →
docker run --gpus all \ -v /models:/models -p 8080:8080 \ layerscale-server:latest \ --model /models/your-model
# Use the hosted API directly curl -X POST https://api.layerscale.ai/v1/sessions/init \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{"prompt": "You are a financial analyst..."}'
NVIDIA T4, RTX 20xx+, L4, L40S, A10, A100, H100, B200, B300, MI455X. Supports CUDA, ROCm, and Metal.
Any open-weight model. Llama, Mistral, Qwen, Gemma, and more work out of the box.
Linux recommended for production. macOS supported for development.