Pluggable eviction policies with predictable performance characteristics. Unified builder API plus direct access for policy-specific operations. Optional metrics and benchmarks to validate trade-offs.
This repository aims to record papers of system-aware, serving-time, KV-centric optimization methods that improve system metrics without retraining or architecture modification (which we call this ...
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel ...
Abstract: Solid-state drives (SSDs) are widely deployed in systems where device lifetime and operational efficiency directly affect the system cost and sustainability. A key lever for optimization is ...
Abstract: The fifth-generation (5G) network cloudification enables third parties to deploy their applications (e.g., edge caching and edge computing) at the network edge. Many previous works have ...
At CES 2026, Nvidia announced that the Nvidia BlueField-4 data processor, part of the full-stack Nvidia BlueField platform, powers Nvidia Inference Context Memory Storage Platform, a new class of ...
NVIDIA BlueField-4 powers NVIDIA Inference Context Memory Storage Platform, a new kind of AI-native storage infrastructure designed for gigascale inference, to accelerate and scale agentic AI. The new ...