Ir directamente a la navegación principal Ir directamente a la búsqueda Ir directamente al contenido principal

A forecasting-based perception layer for energy-aware resource management in LLM serving deployed on high-performance computing clusters

  • University of Deusto

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

Resumen

The deployment of Large Language Models (LLMs) in multi-Graphics Processing Unit (GPU) environments faces significant challenges regarding energy consumption and load distribution. While most research focuses on optimizing inference throughput, there is a critical lack of frameworks bridging fine-grained telemetry with proactive, energy-aware load balancing. This paper presents a modular forecasting driven perception layer that leverages near real-time GPU power telemetry to enable optimized workload allocation. Using fine-grained telemetry from an operational High Performance Computing (HPC) cluster, we evaluate state-of-the-art time-series architectures, including Spiking Neural Networks (SNN), Recurrent Neural Networks (RNN), Transformers, and Structured State Space Models (SSSM). These models are assessed across operational horizons of 30 s for near-instantaneous balancing and 1 min for near-future system stability. Our results demonstrate that the Gated Recurrent Unit (GRU) achieves superior performance, with a Mean Absolute Error (MAE) of 7.97 W for the 30 s window and 9.7 W for the 1 min window. By establishing a validated forecasting backbone, this approach provides a plug-and-play forecasting component that can be integrated into Deep Reinforcement Learning (DRL) or heuristic schedulers, offering a scalable solution to improve the sustainability and efficiency of large-scale LLM serving.

Idioma originalInglés
Número de artículo101333
PublicaciónSustainable Computing: Informatics and Systems
Volumen50
DOI
EstadoPublicada - jun 2026

Huella

Profundice en los temas de investigación de 'A forecasting-based perception layer for energy-aware resource management in LLM serving deployed on high-performance computing clusters'. En conjunto forman una huella única.

Citar esto