Since the introduction and subsequent wide adoption of Large Language Models (LLMs) – data has been the lifeblood of businesses building accurate and safe AI systems. A company’s data represents its cumulative knowledge and can be leveraged in various ways, from customization (Supervised Fine-Tuning, Parameter Efficient Fine-Tuning, continued pre-training, and more), to training brand-new domain-specific Small Language Models (SLMs). Data, while being one of the most critical pieces of a modern AI pipeline, has traditionally been costly and limiting during the development of innovative LLMs and SLMs – from paying human annotators to navigating the sourcing of large volumes of domain-specific data – the current process of generating high-quality data is a difficult task.
Through a process called Synthetic Data Generation (SDG), which will be more carefully defined in the rest of the blog, businesses can augment their existing data stores by leveraging LLMs to create customized high-quality data in large volumes.
NVIDIA is announcing a new suite of models specifically built for SDG – the Nemotron-4 340B family of models, including a state-of-the-art Reward Model, and a Instruct model to aid in SDG, all released under a permissive license that will enable businesses and developers alike to use the model outputs to build incredible models.
NVIDIA Open Model License
With the release of the Nemotron-4 340B family of models – which includes a Base, Instruct, and Reward Model – NVIDIA is introducing the NVIDIA Open Model License, a permissive license that allows distribution, modification, and use of the Nemotron-4 340B models and its outputs for personal, research, and commercial use, without attribution requirements.
Introducing Nemotron-4 340B Reward Model
Nemotron-4 340B Reward Model is a state-of-the-art multidimensional Reward Model. The model takes a text prompt as input – and returns a list of floating point numbers that are associated with the five attributes in the HelpSteer2 dataset, listed below.
The model has been evaluated using Reward Bench and shown to achieve benchmark-topping performance despite only containing 10K human-annotated response pairs.
Figure 1. Reward Bench results from the HelpSteer2 paper.
Given a prompt, a Reward Model provides a score for a response according to human preference. In other words, it can align with human preferences for a given prompt and is therefore able to replace a large amount of human annotations. The newly released Nemotron-4 340B Reward leads Reward Bench with an overall score of 92.0. Notably, Nemotron-4 340B Reward has the most significant lead in Chat-Hard, beating the next best alternative by nearly seven percentage points. Chat-Hard is a subset of the test data that evaluates “a reward model’s abilities to understand trick questions and subtly different instruction responses.” (RewardBench paper)
HelpSteer2 Dataset
With the release of Nemotron-4 340B Reward, we also introduced HelpSteer2. This dataset is permissively licensed (CC-BY-4.0) with ten thousand response pairs. Each prompt in the dataset contains two responses which are human-annotated using a Likert-5 Scale (from 0–4, with higher meaning better) for five attributes:
Helpfulness: Overall helpfulness of the response to the prompt.
Correctness: Inclusion of all pertinent facts without errors.
Coherence: Consistency and clarity of expression.
Complexity: Intellectual depth required to write a response (i.e., whether the response can be written by anyone with basic language competency or requires deep domain expertise).
Verbosity: Amount of detail included in the response, relative to what is asked for in the prompt.
The dataset is focused on conversational data, including multi-turn conversations in the English language.
More details on the dataset are available in the HelpSteer2 dataset paper.
SteerLM Reward Model Training
Figure 2. The Nemotron-3 340B reward model was created by aligning the base model with the HelpSteer2 dataset using NeMo Aligner, a toolkit for efficient model alignment.
The Nemotron-4 340B Reward Model was trained on the Nemotron-4 340B Base model with an additional linear layer that converts the final layer representation of the end-of-response token into five scalar values, each corresponding to a HelpSteer attribute, referred to as SteerLM Reward Model training. More detailed information on the training process can be found in the HelpSteer2 paper.
Unlike binary preference-based methods, the SteerLM Reward Model training process allows the model to provide more expressive feedback on which responses are considered good and why. Whereas binary-trained reward models might sometimes conflate a long response with a good response, SteerLM Reward Model training explicitly teaches the model to disambiguate verbosity as a scored attribute.
A Primer on Synthetic Data Generation
Before we illustrate how developers can utilize the Nemotron-4 340B family of models for Synthetic Data Generation (SDG), we first provide a primer. SDG refers to the process of creating datasets that can be used for a variety of model customizations, from Supervised Fine-Tuning (SFT), Parameter Efficient Fine-Tuning (PEFT) including Low Rank Adaptation (LoRA), and model alignment (using methods like RLAIF, DPO, and more). Additionally, use cases for SDG are not limited for model alignment, but can apply to a wide range of applications, from retrieval, to evaluation dataset curation, to recommender systems. For this blog post, we focus on model alignment as the primary use case for the Nemotron-4 340B family of models. Alignment training is a rapidly growing subdiscipline in the Generative AI domain and can be implemented in several different ways. Out of the existing methods, we discuss a specific implementation of a SDG pipeline as outlined below.
Critically, robust SDG methods go beyond just generating response data, but also include verification and checks to ensure data quality remains high. LLM accuracy is often directly determined by the quality, rather than quantity, of the training data, making the step of “quality filtering” crucial in SDG recipes.
A Synthetic Data Generation Flow
Figure 3. A Synthetic Data Generation pipeline comprised of two steps at a high level: 1) Generating synthetic response using the Nemotron-4 340B Instruct model; 2) Ranking the synthetic responses using the Nemotron-4 340B Reward model, and filtering the synthetic responses to retain only high-quality samples.
In general terms, SDG is split between two primary pieces, outlined below.
Synthetic Response Generation
Synthetic response data can be generated by giving Nemotron-4 340B Instruct domain-specific input queries. This allows the model to generate responses that are aligned with the input query in a format similar to those used in the Instruction Tuning with GPT-4 paper. These responses can be generated with a zero-shot, few-shot, or chain-of-thought style prompt – depending on the desired response format. Multiple responses to each query can be generated for filtering in the next step as well if required.
NOTE: Nemotron-4 340B Instruct model can also be used to generate domain-specific queries initially – thereby alleviating the need for a dataset of pre-established queries. However, this use case is not covered in the tutorial material.
Reward Model Verification
Due to the multi-attribute nature of Nemotron-4 340B Reward – synthetic responses can be ranked by the most desired HelpSteer2 attributes so that only the highest-performing responses are kept. This emulates the process of Human Evaluation of the quality of prompts and adds a layer of quality monitoring in SDG pipelines.
Case Study:
NVIDIA researchers were able to demonstrate the effectiveness of SDG in the HelpSteer2 paper. A total of 100K rows of conversational synthetic data (referenced as “Daring Anteater” or “DA” in the benchmarks below) were created through the above pipeline. Using this dataset, the NVIDIA research team was able to align Llama 3 70B (base model) to match or exceed Llama 3 70B Instruct on a number of standard benchmarks. This was achieved despite using only 1% of the human-annotated data that the Llama 3 70B Instruct model was trained with.
Figure 3. Results from HelpSteer2 Paper for Llama 3 70B.
The results showcase the effectiveness of SDG – and how using tools like Nemotron-4 340B Reward, and Nemotron-4 340B Instruct can be used to add value to businesses’ data pipelines today.
It is important to note that there are many SDG pipelines and this is still an active topic of research. Nemotron-4 340B Instruct was itself trained with a variation of the SDG pipeline similar to the flow illustrated in Figure 3, with 98% of its alignment training data being synthetically generated (learn more in the technical report). We encourage developers to evaluate and develop different pipelines and share best practices, as we continue to refine our own SDG methodologies.
Conclusion
Data serves as the backbone of LLMs. Recognizing Synthetic Data Generation as the next frontier of improving Gen AI applications for enterprises, NVIDIA offers the Nemotron-4 340B family of models and SDG pipeline to enable developers and enterprises alike to turbo-charge a wide range of synthetic data use cases, with a permissive license and one of the highest-quality, openly available instruct model and reward models.
Instructions for how to deploy the models are available on their respective model cards, with NeMo Framework instructions available for Nemotron-4 340B Base and Nemotron-4 340B Instruct, and NeMo Aligner instructions available for Nemotron-4 340B Reward.
In the coming weeks, we’ll be releasing Nemotron-4 340B NIMs for optimized inference on NVIDIA GPUs, as well as a technical walkthrough including tutorials on how to create the above SDG pipeline.
Try out Nemotron-4 340B Instruct through the preview inference API available here.