Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and TensorRT-LLM

As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. To deliver high LLM inference performance, an efficient parallel computing architecture and a flexible and highly-optimized software stack are required. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set new LLM performance records on industry-standard, peer-reviewed MLPerf Inference v4.0 benchmarks, demonstrating the capabilities of the NVIDIA full-stack inference platform. 

Recently, LLMs based on a mixture-of-experts (MoE) architecture have emerged, offering potential advantages in model capacity, training cost, and first-token serving latency compared to LLMs employing dense architectures. The popular Mixtral 8x7B open-weights model developed by Mistral Al, which employs an MoE architecture, for instance, has shown impressive capabilities. In this post, we show how NVIDIA H100 Tensor Core GPUs, based on the NVIDIA Hopper GPU architecture, and TensorRT-LLM software deliver outstanding performance on Mixtral 8x7B. 

When deploying LLMs at scale, it’s common for cloud services to set query response time targets and then seek to optimize the number of user queries that can be served in parallel within those constraints, by grouping them into “batches”. TensorRT-LLM supports in-flight batching, which allows for completed requests to be replaced with new requests during LLM serving, helping to improve performance.

The process of selecting a response time budget requires a careful balancing of throughput and user interactivity, as increases in one translates into reductions in the other. Plots of throughput versus latency can be helpful tools in selecting the optimal deployment scenario. Often, there’s a steep part of the throughput versus latency curve, during which large improvements in throughput can be had for just small increases in response time. For production deployments, choosing latency targets within this window can yield great user experiences at relatively low deployment cost. 

H100 throughput vs. response latency

The chart below plots throughput, in terms of total requests processed per second, against the time to generate a response to each request, using two H100 SXM GPUs running TensorRT-LLM software using both FP16 precision as well as FP8 precision.

Figure 1. Mixtral 8x7B throughput vs. latency on two H100 GPUs running TensorRT-LLM using both FP16 precision and FP8 precision. 

Mixtral 8x7B results measured using Tensor Parallel=2 running TensorRT-LLM v0.10 and CUDA compilation tools, release 12.4, v12.4.131,  using two NVIDIA H100 SXM GPUs. Average ISL = 573, and average OSL = 50. 

The NVIDIA Hopper architecture is equipped with fourth-generation Tensor Cores, which support FP8 data type at twice the peak computational rate compared to either FP16 or BF16. TensorRT-LLM software provides support for FP8 quantization, allowing you to convert model weights into FP8 and automatically use highly-tuned FP8 kernels.

The performance benefits of FP8 are significant, allowing the H100 GPU to deliver nearly 50% more throughput within a response limit of 0.5 seconds. With the additional performance provided by FP8, developers can increase throughput, decreasing cost, while maintaining the same user experience. Or, for a given throughput, users can enjoy even faster response times, improving the user experience for about the same cost.   

H100 throughput vs. mean time per output token

Next, we show performance of H100 GPUs and TensorRT-LLM when running in streaming mode. In this use case, rather than wait for the full inference request to process and then report total latency, results are reported back as soon as an output token is produced. This makes it possible to record the time taken per output token rather than the time required for the entire request. 

Figure 2. Mixtral 8x7B throughput vs. mean time per output token on two H100 GPUs running TensorRT-LLM using both FP16 precision and FP8 precision. 

Mixtral 8x7B results measured using Tensor Parallel=2 running TensorRT-LLM v0.10 and CUDA compilation tools, release 12.4, v12.4.131, using two NVIDIA H100 SXM GPUs. Average ISL = 573, and average OSL = 50. 

NVIDIA H100 GPUs and TensorRT-LLM software also deliver great performance in streaming mode, achieving high throughput even with very low average time per output token. At a mean time per output token of just 0.016 seconds – or more than 60 tokens per second flying across the screen for each user– a pair of H100 GPUs running TensorRT-LLM with FP8 precision achieves high throughput of 38.4 requests per second. Once again, by using FP8, you can either improve responsiveness for a given deployment cost, or increase the number of users that can be served at a given level responsiveness, reducing cost. 

H100 throughput without latency constraints

Finally, to measure performance in latency-unconstrained scenarios, we provide inference throughput without latency constraints. While online scenarios are more popular for real-time use cases, offline scenarios, like data labeling, sentiment analysis, or summarization, can be a good measure of the peak achievable throughput of a platform. Below is a table of offline throughput at various batch sizes, using input and output sequence lengths of 128: 

Figure 3. Mixtral 8x7B throughput at various batch sizes on two H100 SXM GPUs running TensorRT-LLM using FP16 and FP8 precision. ISL = 128, OSL = 128. 

Mixtral 8x7B results measured using Tensor Parallel=2 running TensorRT-LLM v0.10 and CUDA compilation tools, release 12.4, v12.4.131, using two NVIDIA H100 SXM GPUs. FP16 and FP8 precisions. ISL = 128, OSL = 128. 

As the batch size increases, the workload becomes increasingly compute-intensive, amplifying the benefits of the greater FP8 throughput of the Hopper architecture. Additionally, use of FP8 reduces memory footprint, allowing for even larger batches to be processed. At a batch size of 1,024, inference throughput reaches nearly 21,000 tokens/second with FP8. 

TensorRT-LLM is an open-source library for optimizing inference for LLMs like Mixtral, and provides the latest performance optimizations for the most popular LLMs in a simple, open-source Python API. This includes general LLM optimizations like optimized attention kernels, KV caching techniques, and FP8 or INT4 AWQ quantization without sacrificing accuracy. Mixtral deployed with TensorRT-LLM has custom techniques specifically for MoEs, including expert parallelism (EP) and optimized expert kernels. For maximum GPU utilization and workload balancing, TensorRT-LLM also supports a hybrid of expert and tensor parallelism for the Mixtral MoE model.  Mixtral with TensorRT-LLM can be hosted with NVIDIA Triton Inference Server software. To see a full list of supported models, features, and optimizations or learn more, refer to TensorRT-LLM GitHub.

A Mixture of Experts (MoE) combines the outputs of specialized “experts” or various models together. This combination of experts improves accuracy, generalization, and scalability.

Expert accuracy

Each expert is trained on a specific dataset and skill. Prompt tokens are dynamically routed or gated to specific experts. Having each expert trained on a specific dataset or subset of data increases accuracy on that specific domain. For example, one expert can focus on code completion, while another expert can focus on mathematics, and yet another expert can focus on grammar and language semantics. Experts can be as specific as possible.

Ensembled generalization

Combining expert knowledge helps improve generalization. Each expert can provide its own response to a prompt, providing its own strength and specificity. The MoE architecture weights the responses of each expert based on relevancy to the prompt. Having specific experts improves accuracy and fit, while weight-averaging the experts improves generalization. The inferential outputs are weight-averaged to produce the final output.

Sparsity for scalability

Mixtral is a sparse MoE (SMoE) LLM. In a SMoE, only a subset of experts are activated for each input. Reducing the number of experts per prompt increases computational efficiency. TensorRT-LLM further reduces latency and increases throughput, as shown in the beginning of the post.

In summary, specific experts improve accuracy, weight averaging experts improve generalization, and sparsely selecting experts improves scalability.

 To quote the Mixtral 8 paper:

“We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e., experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.” 

Key takeaways

NVIDIA Hopper GPUs, running TensorRT-LLM, deliver outstanding inference performance for the latest LLMs, including MoE models like Mixtral 8x7B. And, NVIDIA also continues to optimize its software stack,  delivering both continuous performance gains as well as rapid support for the latest models, helping to minimize total cost of ownership and increase return on investment. 

In addition to continuous software advances, NVIDIA is quickly innovating across silicon and systems, providing customers with even more performance. Products based on the groundbreaking NVIDIA Blackwell architecture will be available from partners later this year. GB200 NVL72, which combines 36 NVIDIA Grace CPUs with 72 NVIDIA Blackwell GPUs in a rack-scale architecture, will deliver large speedups for real-time 1.8 trillion parameter MoE LLM inference. 

Try Mixtral with NVIDIA TensorRT-LLM

Here’s a sample Python script to try Mixtral with TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral. It includes downloading the weights, building the engine, and showcases the TensorRT-LLM features Parallelism, Normalization, Quantization, and FP8 Post-Training Quantization.

Acknowledgements

We would like to thank Bryce Long and Flora Tasse on the inference benchmarking team who contributed to this post.