gzip

gzip is a lossless compression algorithm. Decompressed data is bit-for-bit identical to the original; nothing is lost in the round trip. It’s the workhorse compression filter for HDF5 files, available in essentially every HDF5 installation, which makes gzip-compressed files portable to any reader.

In h5py, gzip compression is selected per-dataset via the compression keyword:

hdf.create_dataset('big', data=matrix,
                   compression='gzip', compression_opts=7)

The compression_opts keyword sets the gzip compression level: an integer from 0 (no compression, just store) to 9 (maximum compression, slowest). The default is 4. Higher levels produce smaller files but take longer to write and read. Level 7 is a reasonable middle ground when compression matters.

gzip gives a good compression ratio for text and for structured numerical data: sensor readings that vary slowly, images with large uniform regions, anything with repetition or local similarity. For random floating-point data the ratio is poor. Random bytes have no redundancy for the algorithm to exploit, and the compressed file ends up almost as large as the uncompressed one. Compression helps when the data has structure, and most real engineering data does.

The two alternative HDF5 compression filters are lzf (also lossless, much faster, less compressed) and szip (also lossless, tuned for correlated scientific data, patent-encumbered, not always available). gzip is the safe default: slower than the alternatives but readable everywhere.

gzip is based on the DEFLATE algorithm, which combines LZ77 dictionary compression with Huffman coding. The same algorithm underlies the PNG image format and the ZIP archive format.

Idriss Rami — Notes

Explorer

gzip

Graph View

Backlinks