As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to…
As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. To deliver high LLM inference performance, an efficient parallel computing architecture and a flexible and highly-optimized software stack are required. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set…