Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and TensorRT-LLM

As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to…

As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. To deliver high LLM inference performance, an efficient parallel computing architecture and a flexible and highly-optimized software stack are required. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set…

Source