MiGz for Compression and Decompression


Introduction

Compressing and decompressing files with GZip normally uses a single thread. For large files, this can bottleneck dependent tasks like data processing, data analysis, and machine learning. Although there are several alternatives supporting multithreaded compression, such as pigz (command-line tool) and ParallelGZip (Java library), no GZip utility or Java library (for any compression format) supports multithreaded decompression.

At LinkedIn, we routinely work with data ranging in size from several gigabytes to many terabytes; compression alleviates the problem of relatively slow network or disk I/O by reducing the number of bytes that must be transferred. However, the bottleneck often then becomes the CPU as it performs the single-threaded compression and decompression.

Consequently, we’ve developed MiGz, a multithreaded, GZip-compatible compression and decompression utility available as both a Java library and cross-platform command-line tools, which we are now pleased to release as an open source project. MiGz satisfies our three key design goals:

  • Platform-independent Java library and command-line tools: MiGz has no native dependencies; rather, it leverages the DEFLATE algorithm implemented natively by the JVM to achieve native code compression/decompression performance.

  • Ubiquitous compatibility across platforms and languages: Data compressed multi-threadedly by MiGz can be read by any other GZip decompressor (single-threadedly), including those available in Python, Linux, MacOS, .Net, etc.

  • Fast and effective multithreaded compression and decompression.

GZip compatibility

MiGz uses the GZip format, which has widespread support and offers both fast speed and a high compression ratio. MiGz’ed files are also entirely valid GZip files, and can be decompressed single-threadedly by any GZip utility or library, or decompressed in parallel by the MiGz decompressor (please note that non-MiGz-created GZip files cannot be decompressed in parallel, however).

Using MiGz

To use MiGz in your Java application, please obtain the code on our GitHub page or get the precompiled JAR via these Maven coordinates: “com.linkedin.migz:migz:1.0.0”.

Then, to compress and decompress data, just use com.migz.MiGzOutputStream and com.migz.MiGzInputStream just as you would GZipOutputStream and GZipInputStream, respectively.

If you’d prefer to use the MiGz command-line tools, the utilities mzip and munzip are simple to use, compressing and decompressing data provided on stdin to stdout, respectively (they are not drop-in replacements for the Linux gzip/gunzip utilities).

How MiGz works

The GZip file format allows for data to be compressed in multiple blocks (“members”), each with its own header and footer metadata. MiGz takes advantage of this by partitioning the original data into multiple, fixed-sized sections (except for the final section), and using multiple threads to compress each of these independently. The compressed blocks are then written in the proper order to the underlying stream. When decompressing, MiGz likewise decompresses each (compressed) block of data in multiple threads, and stitches the decompressed data back together to provide the caller with a coherent, properly-ordered stream using the SequentialQueue class in the concurrentli library.

GZip does not normally write the compressed size of each block in its header, so finding the position of the next block requires decompressing the current one, precluding multithreaded decompression. Fortunately, GZip supports additional, custom fields known as EXTRA fields. When writing a compressed file, MiGz adds an EXTRA field with the compressed size of the block; this field will be ignored by other GZip decompressors, but MiGz uses it to determine the location of the next block without having to decompress the current block. By reading a block, handing it to another thread for decompression, reading the next block and repeating, MiGz is able to decompress the file in parallel.



Source link