Power Your AI Projects with New NVIDIA NIMs for Mistral and Mixtral Models

Large language models (LLMs) are growing in adoption across enterprise organizations, with many building them into their AI applications. Foundation models are powerful starting points but require work to build into production-ready environments. NVIDIA NIM simplifies this process, enabling organizations to run AI models anywhere, across the data center, cloud,  workstations, and PCs. 

Designed for enterprises, NIMs provide a comprehensive suite of prebuilt, cloud-native microservices that integrate into existing infrastructure effortlessly. These microservices are maintained and continuously updated, providing out-of-the-box performance and ensuring access to the latest advancements in AI inference technology.

New NVIDIA NIMs for LLMs

The growth of foundation models is due to their ability to cater to various enterprise needs. However, no single model can fulfill an organization’s requirements and it’s common for enterprises to use different foundation models across their use cases, based on specific data needs and AI application workflows. 

Recognizing the diverse needs of enterprises, we’ve expanded our NIM offerings to include Mistral-7B, Mixtral-8x7B, and Mixtral-8x22B. Each foundation model excels in a specified task.

Figure 1. New Mixtral 8x7B Instruct NIM is available from the NVIDIA API

Mistral 7B NIM

The Mistral 7B Instruct model excels in text generation and language understanding tasks, and fits on a single GPU, making it perfect for applications such as language translation, content generation, and chatbots. When deploying Mistral 7B NIM on NVIDIA H100 data center GPUs, developers can achieve an out-of-the-box performance increase of up to 2.3x tokens per second for content generation compared to deploying the model without NIM.

Figure 1. Mistral 7B NIM shows improved throughput for content generation

 input 500 tokens, output 2000 tokens. NIM ON: FP8. throughput 5,697 tokens/s, TTFT 0.6 s, ITL: 26ms. NIM OFF: FP16, throughput 2,529 tokens/s, TTFT:1.4s, ITL:60ms on 1xH100

Mixtral-8x7B and Mixtral-8x22B NIMs

The Mixtral-8x7B and Mixtral-8x22B models utilize a Mixture of Experts (MoE) architecture, providing fast and cost-effective inference. These models excel in tasks, such as summarization, question answering, and code generation, and are ideal for applications demanding real-time responses.

NIM delivers out-of-the-box optimized performance for both of these models. When used for content generation Mixtral-8x7B NIM sees up to 4.1x improved throughput on four H100s. Mixtral-8x22B NIM sees up to 2.9x improved throughput on eight H100s for content generation and translation use cases.

Figure 2. Mistral 8x7B NIM shows improved throughput for content generation

input 500 tokens, output 2000 tokens. 200 Concurrent requests. NIM ON: FP8. Throughput 9,410 toks/s. TTFT 740ms, ITL 21ms. NIM OFF: FP16. Throughput 2,300 toks/s, TTFT 1,321ms, ITL 86ms

Figure 3. Mistral 8x22B NIM shows improved throughput for content generation and translation

Input 1000 tokens, output 1000 tokens. Concurrency of 250 requests. NIM ON: Throughput 6,070 toks/s, TTFT 3 sec, ITL 38ms. NIM OFF: Throughput 2,067 toks/s, TTFT 5 sec, ITL 116ms.

Accelerating AI application deployments with NVIDIA NIM

Developers can use NIM to shorten the time it takes to build AI applications for production deployments, enhance AI inference efficiency, and reduce operational costs. With NIM, optimized AI models are containerized, giving developers the following benefits:

Performance and scale

These cloud-powered microservices provide low-latency, high-throughput AI inference that easily scales, providing up to 5x higher throughput with the Llama 3 70B NIM. Further enhancing AI inference performance, NIMs also support precise, fine-tuned models for superior accuracy without starting from scratch.

Ease of use

Accelerate market entry with streamlined integration into existing systems and provide optimized performance on NVIDIA-accelerated infrastructure. With APIs and tools designed for enterprise use, developers can maximize their AI capabilities.

Security and manageability

Ensure robust control and security for your AI applications and data. Through NVIDIA AI Enterprise, NIM supports flexible, self-hosted deployments on any infrastructure, offering enterprise-grade software, rigorous validation, and direct access to NVIDIA AI experts. 

The future of AI inference: NVIDIA NIMs and beyond

NVIDIA NIM represents a major advancement in AI inference. As the need for AI-powered applications grows across various industries, deploying these applications efficiently becomes crucial. Enterprises looking to tap into the transformative power of AI can use NVIDIA NIM to easily incorporate prebuilt, cloud-native microservices into their existing systems. This enables them to speed up their product launches, keeping them ahead in innovation.

The future of AI inference goes beyond individual NVIDIA NIMs. As ‌demand for advanced AI applications increases, linking multiple NVIDIA NIMs will be crucial. This network of microservices will enable smarter applications that can work together and adapt to various tasks, transforming how we use technology. To deploy NIM inference microservices on your infrastructure, check out A Simple Guide to Deploying Generative AI with NVIDIA NIM. 

NVIDIA regularly releases new NIMs providing organizations with the most powerful AI models to power their enterprise applications. Visit the API catalog for the latest NVIDIA NIM for LLMs, vision, retrieval, 3D, and digital biology models.