Parquet writers provide encoding and compression options that are turned off by default. Enabling these options may provide better lossless compression for your data, but understanding which options to use for your specific use case is critical to making sure they perform as intended.
In this post, we explore which encoding and compression options work best for your string data. String data is ubiquitous in data science and is used to represent small pieces of information such as names, addresses, and data labels as well as large pieces of information such as DNA sequences, JSON objects, and complete documents.
First, we explain each option.
The encoding step reorganizes your data to reduce its size in bytes while preserving access to each data point.
The compression step further reduces the total size in bytes, but each compression block must be decompressed before the data is accessible again.
In the Parquet format, there are two delta encodings designed to optimize the storage of string data. To help analyze each option, we’ve constructed an engineering study using libcudf and cudf.pandas on string data assembled from public sources to compare the effectiveness of Parquet’s encoding and compression methods using file size, read time, and write time as metrics.
What is RAPIDS libcudf and cudf.pandas?
In the RAPIDS suite of open-source accelerated data science libraries, libcudf is the CUDA C++ library for columnar data processing. RAPIDS libcudf is based on the Apache Arrow memory format and supports GPU-accelerated readers, writers, relational algebra functions, and column transformations.
In this post, we use the parquet_io C++ example to demonstrate the libcudf API and assess encoding and compression methods. For read/write throughput, we use the Python read_parquet function to show zero code change performance results with pandas and RAPIDS cudf.pandas, a open-source GPU-accelerated dataframe library that accelerates existing pandas code by up to 150x.
Kaggle string data and benchmarking
String data is complex, and the effectiveness of encoding and compression is data-dependent.
Keeping this in mind, we decided early on to compile a dataset of strings columns for comparing encoding and compression methods. At first, we explored a few kinds of data generators, but decisions in the data generator about string compressibility, cardinality, and lengths dominated the file size results.
As an alternative to data generators, we assembled a dataset of 149 string columns based on public datasets with 4.6 GB total file size and 12B total character count. We compared file size, read time, and write time for each of the encoding and compression options.
We ran the comparisons in this post with the Parquet reader/writer stack in both RAPIDS libcudf 24.08 and pandas 2.2.2 using parquet-cpp-arrow 16.1.0. We found <1% difference in encoded size between libcudf and arrow-cpp, and 3-8% increase in file size when using the ZSTD implementation in nvCOMP 3.0.6 compared to libzstd 1.4.8+dfsg-3build1.
Overall, we find that the conclusions about encoding and compression choices in this post hold true for both CPU and GPU Parquet writers.
String encodings in Parquet
In Parquet, string data is represented using the byte array physical type. For more information about encoding layouts, see Encodings.
The raw byte array representation includes both single-byte and multi-byte characters. Of the several encoding methods available for byte array data in Parquet, most writers default to RLE_DICTIONARY encoding for string data.
Dictionary encoding uses a dictionary page to map string values to integers, and then writes data pages that use this dictionary. If the dictionary page grows too large (usually >1 MiB), then the writer falls back to PLAIN encoding, where 32-bit sizes are interleaved with raw byte arrays.
In the Parquet V2 specification, two new encodings were added for byte array physical types and can be used to encode string data:
DELTA_LENGTH_BYTE_ARRAY (DLBA): Provides a simple improvement over PLAIN encoding, where the integer sizes data is grouped together and encoded using DELTA_BINARY_PACKED, and the byte array data is concatenated into one buffer. The goal of DLBA encoding is to help the compression stage achieve better compression ratios by grouping similar data.
DELTA_BYTE_ARRAY (DBA): Stores the prefix length of the previous string plus the suffix byte array. DBA encoding works well on sorted or semi-sorted data where many string values share a partial prefix with the previous row.
Total file size by encoding and compression
Figure 1. Total file size for 149 string columns by encoding method and compression method for the RAPIDS libcudf Parquet writer
By default, most Parquet writers use dictionary encoding and SNAPPY compression for string columns. For the 149 string columns in the dataset, the default setting yields a total 4.6 GB file size sum. For compression, we find that ZSTD outperforms SNAPPY and SNAPPY outperforms NONE.
For this set of string columns, compression has a larger impact on file size than encoding, with PLAIN-SNAPPY and PLAIN-ZSTD outperforming the uncompressed conditions.
The best single setting for this dataset is default-ZSTD, and further 2.9% reduction in the file size sum is available by choosing delta encoding for files where it provides a benefit. If you filter only for string columns where the average string length is <50 characters, Choose best-ZSTD shows a 3.8% reduction (1.32 to 1.27 GB) in file size sum relative to default-ZSTD.
When to choose delta encoding for strings
The default dictionary encoding method works well for data with low cardinality and short string lengths, but data with high cardinality or long string lengths generally reaches smaller file sizes with delta encoding.
For arrow-cpp, parquet-java, and cudf >=24.06, dictionary encoding uses a 1-MiB dictionary page size limit. If the distinct values in a row group can fit within this size, dictionary encoding is likely to yield the smallest file sizes. However, high cardinality data is less likely to fit in the 1-MiB limit, and long strings limit the number of values that fit in the 1-MiB limit.
Figure 2. Optimal encoding method for string data when using ZSTD compression
In Figure 2, each point represents a string column and is plotted based on column properties of distinct count and char/string (average). Dotted ellipses were placed to highlight approximate clusters.
For string columns where delta encoding yields the smallest file size, we found the file size reduction to be most pronounced when there are <50 average chars/string. The character counts include multi-byte characters.
The main benefit of DLBA encoding is the concentration of all the size data into a single buffer, and short string columns have a higher fraction of sizes data. DBA encoding also shows its largest file size reductions for string columns with <50 average chars/string, with one example 80% smaller than the dictionary-encoded default.
Looking through the examples, DBA encoding provided the largest file size reductions for columns with sorted or semi-sorted values such as ascending IDs, formatted timestamps, and timestamp fragments.
Figure 3. File size reduction for using delta encoding compared to dictionary encoding with plain fallback
In Figure 3, each point represents a file for which delta encoding plus ZSTD compression gave the smallest file size. The dotted line is a quadratic line of best fit to the data.
Reader and writer performance
In addition to data on file sizes, we also collected data on file write time and read time, where the GPU-accelerated cudf.pandas library showed good performance compared to pandas.
Rather than comparing C++ with Python script runtimes, we measured Parquet file processing throughput using the same Python script with pandas and zero code change cudf.pandas. The string dataset included 149 files, with 12B total characters of string data, and 2.7 GB on-disk total file size when using optimal encoding and compression methods. File read and file write timing was collected using a Samsung 1.9TB PCI-E Gen4 NVMe SSD data source, Intel Xeon Platinum 8480CL CPU, and NVIDIA H100 80GB HBM3 GPU. The processing time was measured as the time to execute the read or write step and was summed across all files, and the OS cache was cleared before reading each file.
Pandas using the pyarrow Parquet engine showed 22 MB/s read throughput and 27 MB/s write throughput.
Cudf.pandas, which uses the default CUDA memory resource in 24.06, showed 390 MB/s read throughput and 200 MB/s write throughput.
When using cudf.pandas with an RMM pool (by setting the environment variable CUDF_PANDAS_RMM_MODE=”pool”), we observed 552 MB/s read throughput and 263 MB/s write throughput.
Figure 4. Parquet file processing throughput in MB/s for the string dataset
Guide on encoding and compression strings data
Based on the comparisons in this post, here are our recommended encoding and compression settings when working with string data.
For encoding, the default dictionary encoding for strings in Parquet works well with string data that has fewer than ~100K distinct values per column.
When there are more than ~100K distinct values, delta and delta length encodings generally yield the smallest file sizes. Delta and delta length encodings provide the largest benefits (10-30% smaller files) for short strings (<30 characters/string).
For compression, ZSTD yields smaller file sizes than Snappy and uncompressed options regardless of encoding method and is an excellent choice.
In addition to data on file sizes, we also collected data on file write time and read time, showing a 17-25x Parquet read speedup for the GPU-accelerated cudf.pandas compared to the default pandas.
Conclusion
If you are looking for the fastest way to try out GPU-accelerated Parquet readers and writers, see RAPIDS cudf.pandas and RAPIDS cuDF for accelerated data science on Google Colab.
RAPIDS libcudf provides flexible, GPU-accelerated tools for reading and writing columnar data in formats such as Parquet, ORC, JSON, and CSV. Build and run a few examples to get started with RAPIDS libcudf. If you’re already using cuDF, you can build and run the new C++ parquet_io example by visiting /rapidsai/cudf/tree/HEAD/cpp/examples/parquet_io on GitHub.
For more information about CUDA-accelerated dataframes, see the cuDF documentation and the /rapidsai/cudf GitHub repo. For easier testing and deployment, RAPIDS Docker containers are also available for releases and nightly builds. To join the discussion, see the RAPIDS Slack workspace.
Acknowledgments
We owe huge thanks to Ed Seidl from Lawrence Livermore National Laboratory for contributing V2 header support, delta encoders and decoders, and a host of critical features to RAPIDS libcudf.