Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM

Best-in-class AI performance requires an efficient parallel computing architecture, a productive tool stack, and deeply optimized algorithms. NVIDIA released…

Best-in-class AI performance requires an efficient parallel computing architecture, a productive tool stack, and deeply optimized algorithms. NVIDIA released the open-source NVIDIA TensorRT-LLM, which includes the latest kernel optimizations for the NVIDIA Hopper architecture at the heart of the NVIDIA H100 Tensor Core GPU. These optimizations enable models like Llama 2 70B to execute using accelerated FP8 operations on H100 GPUs while maintaining inference accuracy.

At a recent launch event, AMD talked about the inference performance of the H100 GPU compared to that of its MI300X chip. The results shared did not use optimized software, and the H100, if benchmarked properly, is 2x faster.

The following is the actual measured performance of a single NVIDIA DGX H100 server with eight NVIDIA H100 GPUs on the Llama 2 70B model. This includes results for both “Batch-1” where an inference request is processed one at a time, as well as results using fixed response-time processing.

Figure 1. Llama 2 70B server inference performance in queries per second with 2,048 input tokens and 128 output tokens for “Batch 1” and various fixed response time settings

AMD’s implied claims for H100 are measured based on the configuration taken from AMD launch presentation footnote #MI300-38. Using vLLM v.02.2.2 inference software with NVIDIA DGX H100 system, Llama 2 70B query with an input sequence length of 2,048 and output sequence length of 128. They claimed relative performance compared to DGX H100 with 8x GPU MI300X system.

For NVIDIA measured data, DGX H100 with 8x NVIDIA H100 Tensor Core GPUs with 80 GB HBM3 with publicly available NVIDIA TensorRT-LLM, v0.5.0 for batch 1 and v0.6.1 for latency threshold measurements. Workload details same as footnote #MI300-38.

DGX H100 can process a single inference in 1.7 seconds using a batch size of one—in other words, one inference request at a time. A batch size of one results in the fastest possible response time for serving a model. To optimize both response time and data center throughput, cloud services set a fixed response time for a particular service. This enables them to combine multiple inference requests into larger “batches” and increase the overall inferences per second of the server. Industry-standard benchmarks like MLPerf also measure performance with this fixed response time metric.

Small tradeoffs in response time can yield x-factors in the number of inference requests that a server can process in real time. Using a fixed 2.5-second response time budget, an 8-GPU DGX H100 server can process over five Llama 2 70B inferences per second compared to less than one per second with batch one.

AI is moving fast and the NVIDIA CUDA ecosystem enables us to optimize our stack quickly and continuously. We look forward to continuing to improve AI performance with every update of our software, so be sure to check out our performance pages and GitHub sites for the latest.

How to reproduce these AI inference results

DGX H100 AMD Footnote was measured by NVIDIA in vLLM based on the configurations provided by AMD in their footnotes using vLLM and its provided benchmarking script with the following command lines:

$ python benchmarks/benchmark_latency.py –model “meta-llama/Llama-2-70b-hf” –input-len 2048 –output-len 128 –batch-size 1 -tp 8

MI300X 8-Chip System is the inferred data based on AMD’s claimed speedup over DGX H100 AMD Footnote measured vLLM results.

DGX H100 Measured was measured by NVIDIA using publicly available versions of TensorRT-LLM available on GitHub and using the command lines outlined in the TensorRT-LLM benchmarking guide for Llama 2.

// Build TensorRT optimized Llama-2-70b for H100 fp8 tensorcore
$ python examples/llama/build.py –remove_input_padding –enable_context_fmha –parallel_build –output_dir DTYPE.float16_TP.8_BS.14_ISL.2048_OSL.128 –dtype float16 –use_gpt_attention_plugin float16 –world_size 8 –tp_size 8 –pp_size 1 –max_batch_size 14 –max_input_len 2048 –max_output_len 128 –enable_fp8 –fp8_kv_cache –strongly_typed –n_head 64 –n_kv_head 8 –n_embd 8192 –inter_size 28672 –vocab_size 32000 –n_positions 4096 –hidden_act silu –ffn_dim_multiplier 1.3 –multiple_of 4096 –n_layer 80

// Benchmark Llama-70B
$ mpirun -n 8 –allow-run-as-root –oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark –model llama_70b –engine_dir DTYPE.float16_TP.8_BS.14_ISL.2048_OSL.128 –warm_up 1 –batch_size 14 –duration 0 –num_runs 5 –input_output_len 2048,1;2048,128