Encoding and Compression Guide for Parquet String Data Using RAPIDS

thumbnail

Table of Contents

  1. Total file size by encoding and compression
  2. When to choose delta encoding for strings
  3. Conclusion

Total file size by encoding and compression

By default, most Parquet writers use dictionary encoding and SNAPPY compression for string columns. The best single setting for this dataset is default-ZSTD, and further 2.9% reduction in the file size sum is available by choosing delta encoding for files where it provides a benefit. The default dictionary encoding method works well for data with low cardinality and short string lengths. For string columns where delta encoding yields the smallest file size, the file size reduction is most pronounced when there are less than 50 average chars/string. DBA encoding also shows its largest file size reductions for string columns with less than 50 average chars/string.

When to choose delta encoding for strings

Default dictionary encoding for strings in Parquet works well with string data that has fewer than ~100K distinct values per column. The distinct values in a row group should fit within the size for dictionary encoding to yield the smallest file sizes. Delta encoding is effective for high cardinality or long string lengths, providing smaller file sizes compared to dictionary encoding.

Conclusion

In conclusion, the optimal encoding and compression methods for string data in Parquet can lead to significant reductions in file size, with delta encoding and ZSTD compression offering the smallest file sizes in certain scenarios. Utilizing GPU-accelerated frameworks like RAPIDS cudf.pandas and cuDF can provide substantial speedups in read and write times for Parquet files compared to traditional pandas implementations.