Accelerating Vector Search: RAPIDS cuVS IVF-PQ Part 2, Performance Tuning

In the first part of the series, we presented an overview of the IVF-PQ algorithm and explained how it builds on top of the IVF-Flat algorithm, using the Product Quantization (PQ) technique to compress the index and support larger datasets.

In this part two of the IVF-PQ post, we cover the practical aspects of tuning IVF-PQ performance. It’s worth noting again that IVF-PQ uses a lossy compression, which affects the quality of the results (measured in terms of recall). We explore the algorithm hyper-parameters, which control memory footprint and search performance, to tune for a wide range of desired recall levels. We also cover the refinement reranking technique, which can be an important step for achieving high recall with PQ compression, especially when working with billion-scale datasets.

All of the index build and search parameters described in this blog post are described in cuVS’s API reference documentation (C, C++, Python, and Rust). Also helpful is the cuVS IVF-PQ tutorial notebook and projects to get started.

Tuning parameters for index building

Since IVF-PQ belongs to the same family of IVF methods as IVF-Flat, some parameters are shared, such as the coarse-level indexing and search hyper-paremeters. Please see our prior blog post on IVF-Flat for an introduction and guide to tuning these parameters. While these parameters have the same names, there are a few details specific to IVF-PQ that are worth considering when setting them. Aside from these shared parameters, IVF-PQ adds a few additional parameters that control the compression.

n_lists

The n_lists parameter has the same meaning as in IVF-Flat: the number of partitions (inverted lists), into which to cluster the input dataset. During search, the algorithm scans through some of the lists. Both the number of lists probed and their sizes affect the performance.

Figure 1 presents the QPS/recall curves for the DEEP-100M dataset for a wide range of partitions (n_lists). Each curve represents the number of queries per section (QPS) and the recall values with varying search parameters. In this case, we’re varying the number of probed lists in the range between 10 and 5000. Typically, one decides on the desired recall level in advance and looks for the hyper-parameter combination that yields the highest QPS. Often, there’s no single parameter combination that gives best QPS for all recall levels. QPS/recall curves are a good way to visualize the trade-off and select an appropriate configuration for the desired recall range.

Figure 1. Effects of n_lists on QPS/recall trade-off.

Depending on the desired recall, the models with 10K and 20K clusters might be most desirable as they have the best throughput for a wide range of recall values. The n_lists = 5K curve has significantly lower throughput than the other settings across the whole recall range and the 200K and 500K curves also have very low throughput compared to the others.

The above experiment suggests that the n_lists in the range of 10K to 50K are likely to yield good performance across recall levels, though a good setting for this parameter will often still depend on the dataset.

pq_dim, pq_bits

Compression is mainly controlled with the pq_dim parameter and often a good technique for tuning this parameter is to start with one fourth the number of features in your dataset, increasing it in steps.

An interesting fact about the IVF-PQ search is that its core component, the fine search step, doesn’t depend on the original dimensionality of the data. On the one hand, it scans through the quantized, encoded database vectors, which depends only on pq_bits and pq_dim. On the other hand, the search vectors are transformed to a pq_dim·pq_len representation using a relatively cheap matrix multiplication (GEMM) operation early in the search pipeline. This allows us to experiment freely on the pq_dim parameter range. It’s also possible to set pq_dim larger than the dim, although this likely won’t improve the recall.

Figure 2. Effects of pq_dim on QPS/recall trade-off. Note the logarithmic scale of the vertical axis.

Figure 2 demonstrates the effects of setting a range of pq_dim values for the IVF-PQ model on a 10 million vector subsample of the DEEP-1B dataset. This dataset has only 96 dimensions, and when the pq_dim is set larger than the number of dimensions, a random transformation is performed to increase the feature space. Note, this does not make much sense in practice, except for benchmarking the QPS of the model, but in this experiment we’ve opted to keep using the DEEP dataset or consistency with the rest of this blog post.

Figure 2 illustrates significant drops in QPS, which happen for several reasons, and cannot be explained by the increasing throughput alone. We’ll have a  closer look at the curves as pq_dim increases.

First, changing pq_dim from 128 to 256 or 384 effectively doubles or triples the amount of compute work and the size of required shared memory per CUDA block. This likely leads to lower GPU occupancy and, as a result, the QPS drops by a factor of three or more. 

Second, changing pq_dim to 512 or more likely triggers the look-up table (LUT) placement in the global memory. Otherwise, there is not enough shared memory per CUDA block to launch the kernel. This leads to another 4x slowdown.

The pq_bits parameter is the number of bits used in each individual PQ code. It controls the codebook size (2pq_bits), or the number of possible values each code can have. IVF-PQ supports the codebook sizes from 16 to 256, which means pq_bits can be in the range of [4, 8].

pq_bits also has an effect on the compression. For example, an index with pq_bits = 4 is going to be twice as small as with the pq_bits = 8. Though much stronger pq_bits affects the size of the LUT, as it is proportional to 2pq_bits. This has a drastic effect on recall. For more details about the lut_size formula, see the Product quantization section in the first part of this post. 

A few things to consider about pq_bits:

It is required that (pq_dim * pq_bits) divide evenly by 8. In general, keeping pq_dim in powers of two improves data alignment, and thus, the search performance.

Using pq_dim * pq_bits >= 128 and having (pq_dim * pq_bits) divide evenly by 32 maximizes the GPU memory bandwidth utilization.

In general, a good starting point is setting pq_bits = 8 and decreasing from there.

The recall loss due to smaller pq_bits can be compensated by performing a refinement step (more on this in the Refinement section).

For extremely high-dimensional data (more than a few hundreds of dimensions), and large pq_dims, a lower pq_bits setting can yield a drastic search speedup because the LUT can be made small enough to fit in shared memory.

A small pq_bits value can reduce shared memory bank conflicts, which can improve the QPS.

Alternatively, as we will see later in this blog, setting the search parameter lut_dtype to reduced precision floating-point (fp8) may be also enough to keep the LUT in shared memory. 

codebook_kind

The codebook_kind parameter determines how the codebooks for the second-level quantizer are constructed. The second-level quantizers are trained either for each subspace or for each cluster:

subspace creates pq_dim second-level quantizers, one for each slice of the data along the feature space (columns).

cluster creates n_lists second-level quantizers, one for each first-level cluster.

In both settings, the centroids are found using k-means clustering, interpreting the data as having pq_len dimensions. There is no definitive way to determine in advance which of the two options will yield better performance for a particular use-case but the  following guidelines may help:

A per-cluster codebook tends to take longer to train, as n_lists is usually much higher than pq_dim (more codebooks to train).

Search with a per-cluster codebook usually has better utilization the shared memory of the GPU than with a per-subspace codebook. This may result in a faster search when the LUT is large and occupies a significant part of the GPU shared memory.

In practice, the per-subspace codebook tends to result in slightly higher recall.

kmeans_n_iters, kmeans_trainset_fraction

These parameters are passed to the k-means algorithm in both IVF-Flat and IVF-PQ to cluster the data during index creation. In addition, kmeans_n_iters is passed down to the same k-means algorithm when constructing the codebooks. In practice, we have found the kmeans_n_iters parameter rarely needs to be adjusted.

To learn more about the available index build parameters for IVF-PQ, you can refer to the API reference documentation, which includes the C, C++, Python, and Rust APIs.

Tuning parameters for search

Refer to our previous blog post on IVF-Flat for information on the n_probes parameter. The n_probes / n_lists ratio determines which fraction of the dataset is searched, therefore it has a strong effect on the search accuracy and throughput. n_probes is the most important search parameter, but IVF-PQ provides a few more knobs to adjust the internal workings of the algorithm.

internal_distance_dtype

The internal_distance_dtype parameter controls the representation of the distance or similarity during the search. By default, this is the 32-bit float (numpy.float32), but changing it to 16-bit float (numpy.float16) can save memory bandwidth where appropriate. For example, float16 can be useful when the dataset datatype is low precision anyway (8-bit int, for example), though it can help with 32-bit float datasets too.

lut_dtype 

lut_dtype is the datatype used to store the look-up table (LUT). The PQ algorithm stores data in the product quantizer encoded format, which needs to be decoded during the second-phase (in-cluster) search. Thus, the algorithm constructs a separate LUT for each cluster. Constructing the look-up tables can be costly, and the tables can be rather large. By default, the individual elements in the table are stored as 32-bit floats but you can change this to 16-bit or 8-bit to reduce the table size.

Ideally, the LUT should be able to fit in the shared memory of the GPU, however, this is not the case for datasets with very large dimensionality. The logic of deciding whether this table should remain in the shared or global memory of the GPU is somewhat complicated, but you can see the outcome when gradually changing pq_dim and observing a sudden drop in QPS after a certain point (Figure 3). When the LUT is placed in the shared memory, the search speed tends to be 2-5x faster compared to a similar search with the LUT in the global memory.

An additional benefit to changing the lut_dtype instead of, say, build parameters like the pq_dim, is that you can reduce the LUT by a factor of 2x or 4x without having to rebuild the index. A few additional things to note:

It is not possible (and does not make sense) to set the lut_dtype to a more precise type than the internal_distance_dtype, as the former is still converted to the latter for computing the distance.

Smaller is not always faster. In some cases, setting lut_dtype = internal_distance_dtype yields better performance than setting a smaller lut_dtype because it saves a few cycles on data conversion in a performance-critical loop.

Figure 3. Effects of search parameters on QPS/recall trade-off on NVIDIA H100 GPU with DEEP-100M dataset.

Figure 3 demonstrates the trade-offs for different combinations of the internal search data types. The XX/YY labels denote the bit-precision of the used internal_distance_dtype/lut_dtype pair. Depending on the GPU, the selected dataset, and the batch size, you may see different results.

With the DEEP-100M dataset, pq_dim = dim = 96, pq_bits = 8, and the batch size of 10 queries at a time, the 8-bit lut_dtype appears to come at a significant cost to recall. Selecting a 16-bit type for one or both parameters does not seem to reduce the recall in this case, but does improve the QPS.

Improving recall with refinement

Tweaking the search and indexing parameters is not always enough to make the necessary impact, with the recall still lower than necessary due to the PQ compression. You could try using IVF-Flat as an alternative but it becomes problematic for billion-scale datasets because IVF-Flat offers no index compression. Refinement offers another promising alternative. 

Refinement is a separate operation that can be performed after the ANN search, and can be applied after IVF-PQ to improve the recall lost from the compression. The refinement operation recomputes the exact distances for the already selected candidates and selects a subset of them. This operation is often referred to as reranking by other vector search libraries, and would usually be performed with a number of candidates that is larger than the original number of k candidates being requested by the user. .

Here is a small pseudocode snippet to illustrate how the refinement follows IVF-PQ search:

candidate_dists, candidates = ivf_pq.search(search_params, index, queries, k * 2)
neighbor_dists, neighbors = refine(dataset, queries, candidates, k)

For an example of the refinement API, refer to our example projects, the cuVS API reference docs, or the tutorial IVF-PQ notebook. 

Although not required, we usually set the number of neighbor candidates initially queried to an integer multiple of the desired number of neighbors. This multiple is referred to as  the refine_ratio throughout cuVS’s ANN APIs. The Search performance section in the first part of this post shows how searching with x2 refinement improves the recall by a large margin (from 0.85 to 0.95). 

Figure 5 shows the internal_distance_dtype/lut_dtype experiment from Figure 4 compensated with the refinement. We use ratio 2 and 4 to see the change in recall and QPS. The XX/YY-xZ labels denote the bit-precision of the used internal_distance_dtype/lut_dtype pair and the level of the refinement.

Figure 4. Effects of search parameters on QPS/recall trade-off, compensated with the refinement operation.

We applied the refinement operation after the search with ratios x1 (no refinement), x2, and x4, and dropped some of the search configurations to declutter the plot. Note that refinement requires access to the source dataset, so this should be considered when calculating the memory footprint. We keep the index on the GPU and the dataset on the host, so the refinement operation is performed on CPU, although cuVS does support both modes.

Figure 4 shows that refinement compensates for the errors arising from the PQ compression. Ratio x2 achieves the 0.99 recall while increasing to x4 boosts the maximum recall even further at the cost of QPS in the lower-recall range. Judging by the left tails of the curves, the refinement in this case comes with ~25% cost to the QPS.

Summary

The accelerating vector search with inverted-file indexes series covers two cuVS algorithms: IVF-Flat and IVF-PQ. The IVF-Flat algorithm represents an intuitive transition from the exact KNN to a flexible ANN. IVF-PQ extends IVF-Flat with PQ compression, which further speeds up the search, making it possible to process billion-scale datasets with limited GPU memory.

We presented many techniques for fine-tuning parameters for index building and search. Knowing when to swap accuracy for performance, data practitioners can gather the best results efficiently.

The RAPIDS cuVS library provides a range of vector search algorithms to accelerate the wide variety of use cases, from the exact search to low-accuracy-high-QPS ANN methods, covering million-scale to billion-scale problem sizes and‌ possibly beyond.

To practice tuning IVF-PQ parameters for your dataset, check out our IVF-PQ notebook on GitHub. To further explore the provided APIs, see the cuVS documentation.