Home > Articles > Programming > C#

.NET Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Compressing a Very Large File

Last updated Mar 14, 2003.

Although the DeflateStream and GZipStream classes are very convenient to use, they have a limit of four gigabytes. If you want to compress a larger file, you have to do it in chunks.

Even if you're compressing a file that's smaller than four megabytes, you may want to store it as a series of smaller sub files. One good example is an application log archive. The application log rotates every hour, say, but you want to store an entire day's logs in a single archive file. You could use an archive builder (like WinZip or a Windows compressed folder), but often it's enough to store the individual files sequentially as smaller chunks in one larger file.

In any case, you'll end up with a single file that consists of multiple chunks. Each chunk contains a header of some sort, followed by the compressed data. The block header can be arbitrarily complex, but at minimum it must contain the count of bytes that follow. The simplest file of multiple chunks would contain:

Number of block bytes
Block bytes
Number of block bytes
Block bytes
etc., etc.
end of file marker

Writing a file of chunks

Let's play again with our friend Shakespeare's complete works. I know that the file is only a little over five megabytes in size, but we can use it as an illustration. We're going to compress it, not in a single chunk, but in multiple 64-kilobyte (uncompressed) chunks. This should illustrate the point, without taking too much time.

Compressing the file in chunks isn't hard, really, but it's a little more complicated than compressing it all through a single stream. Here's the basic algorithm:

Open input file
Open output file
  While not end of input
   Read up to 64K bytes into input buffer
   Compress from input buffer to output buffer
   Write resulting compressed size to output file
   Write compressed buffer to output file
  End While
Write end-of-file marker

That pseudo code results in this main program, which references three methods yet to be written: ReadInput, Compress, and Write.

const string inputFilename = "shaks12.txt";
const string outputFilename = "outfile.zzz";

const int bufferSize = 64 * 1024;

static void Main(string[] args)
{
  DateTime startTime = DateTime.Now;
  using (Stream strmIn = File.OpenRead(inputFilename))
  {
    using (Stream strmOut = File.Create(outputFilename))
    {
      using (BinaryWriter writer = new BinaryWriter(strmOut))
      {
        byte[] outputBuffer = new byte[bufferSize];
        using (MemoryStream ms = new MemoryStream(outputBuffer, true))
        {
          byte[] inputBuffer = new byte[bufferSize];
          int bytesRead;
          while ((bytesRead = ReadInput(inputBuffer, strmIn)) != 0)
          {
            int compressedSize = Compress(inputBuffer, bytesRead, ms);
            Write(outputBuffer, compressedSize, writer);
            ms.Position = 0;
          }
        }
        writer.Write((int)0);
      }
    }
  }
  TimeSpan totalElapsed = DateTime.Now - startTime;
  Console.WriteLine("Completed in {0:N4} seconds", totalElapsed.TotalSeconds);
  Console.Write("Press Enter:");
  Console.ReadLine();
}

Note that I created a BinaryWriter to write the output stream. If I was just writing blocks of bytes, the BinaryWriter wouldn't be necessary. But the program also writes the compressed size (an integer) before each block.

I've created separate methods for reading the uncompressed file, compressing a block, and writing the compressed block to the output file. The primary reason for doing so is to facilitate profiling. You'll see in a later section that there is ample room for optimizing things. The three processing methods are shown here.

// Read up to buff.Length bytes.
// Returns the number of bytes read.
static int ReadInput(byte[] buff, Stream s)
{
  DateTime startTime = DateTime.Now;
  int bytesRead = s.Read(buff, 0, buff.Length);
  Console.WriteLine("Read {0:N0} bytes: {1:N0} ms", bytesRead, (DateTime.Now - startTime).TotalMilliseconds);
  return bytesRead;
}

// Compress the first nBytes from buff to the passed stream.
// Returns the number of compressed bytes written to the stream.
static int Compress(byte[] buff, int nBytes, Stream s)
{
  DateTime startTime = DateTime.Now;
  using (DeflateStream ds = new DeflateStream(s, CompressionMode.Compress, true))
  {
    ds.Write(buff, 0, nBytes);
  }
  s.Flush();
  int compressedSize = (int)s.Position;
  Console.WriteLine("Compress {0:N0} bytes: {1:N0} ms", nBytes, (DateTime.Now - startTime).TotalMilliseconds);
  return compressedSize;
}

// Write nBytes from buff to the passed BinaryWriter.
static void Write(byte[] buff, int nBytes, BinaryWriter writer)
{
  DateTime startTime = DateTime.Now;
  writer.Write(nBytes);
  writer.Write(buff, 0, nBytes);
  Console.WriteLine("Write {0:N0} bytes: {1:N0} ms", nBytes, (DateTime.Now - startTime).TotalMilliseconds);
}

Here's the partial output from running the program on my 2.4 GHz Core 2 Quad:

Read 65,536 bytes: 0 ms
Compress 65,536 bytes: 0 ms
Write 28,544 bytes: 0 ms
Read 65,536 bytes: 0 ms
Compress 65,536 bytes: 0 ms
Write 28,657 bytes: 0 ms
Read 65,536 bytes: 0 ms
Compress 65,536 bytes: 0 ms
Write 28,633 bytes: 16 ms
...
...
Read 12,095 bytes: 0 ms
Compress 12,095 bytes: 0 ms
Write 6,157 bytes: 0 ms
Read 0 bytes: 0 ms
Completed in 0.6864 seconds
Press Enter:

The timings aren't very helpful here because the buffers are so small and the machine so fast, that things happen too quickly. We'll come back to that later, when working with a larger file.

Uncompressing the file

Before I go any further with the compressor, let's make sure that I can reconstitute the file: read the compressed file, decompress the data, and write a duplicate of the original file.

The strategy for decompressing is exactly reversed from compressing:

Open input file
Open output file
Do
  Read the compressed block size
  if block size != 0 then
    Read the compressed bytes
    Decompress
    Write decompressed bytes
  end if
Until block size = 0

After writing the compress program, the uncompress program is almost trivial:

class sdecomp
{
  const string inputFilename = "outfile.zzz";
  const string outputFilename = "decomp.txt";

  const int bufferSize = 64 * 1024;
  static void Main(string[] args)
  {
    DateTime startTime = DateTime.Now;
    using (Stream strmIn = File.OpenRead(inputFilename))
    {
      using (BinaryReader reader = new BinaryReader(strmIn))
      {
        using (Stream strmOut = File.Create(outputFilename))
        {
          byte[] outputBuffer = new byte[bufferSize];
          byte[] inputBuffer = new byte[bufferSize];
          using (MemoryStream ms = new MemoryStream(inputBuffer, false))
          {
            int compressedSize = 0;
            while ((compressedSize = ReadInput(inputBuffer, reader)) != 0)
            {
              ms.Position = 0;
              int uncompressedSize = Uncompress(ms, outputBuffer);
              Write(outputBuffer, uncompressedSize, strmOut);
            }
          }
        }
      }
    }
    TimeSpan totalElapsed = DateTime.Now - startTime;
    Console.WriteLine("Completed in {0:N4} seconds", totalElapsed.TotalSeconds);
    Console.Write("Press Enter:");
    Console.ReadLine();
  }

  // Read the next block into the buffer.
  // Returns the size of the compressed block.
  private static int ReadInput(byte[] buff, BinaryReader reader)
  {
    DateTime startTime = DateTime.Now;
    int compressedSize = reader.ReadInt32();
    reader.Read(buff, 0, compressedSize);
    Console.WriteLine("Read {0:N0} bytes: {1:N0} ms", 
      compressedSize, (DateTime.Now - startTime).TotalMilliseconds);
    return compressedSize;
  }

  // Uncompress nBytes from the passed stream to outputBuffer
  // Returns the number of uncompressed bytes written to the output buffer
  private static int Uncompress(Stream s, byte[] outputBuffer)
  {
    DateTime startTime = DateTime.Now;
    int uncompressedSize;
    using (DeflateStream ds = new DeflateStream(s, CompressionMode.Decompress, true))
    {
      uncompressedSize = ds.Read(outputBuffer, 0, outputBuffer.Length);
    }
    Console.WriteLine("Uncompress {0:N0} bytes: {1:N0} ms", uncompressedSize, 
      (DateTime.Now - startTime).TotalMilliseconds);
    return uncompressedSize;
  }

  // Write nBytes from buff to the stream.
  private static void Write(byte[] buff, int nBytes, Stream strmOut)
  {
    DateTime startTime = DateTime.Now;
    strmOut.Write(buff, 0, nBytes);
    Console.WriteLine("Write {0:N0} bytes: {1:N0} ms", nBytes,
      (DateTime.Now - startTime).TotalMilliseconds);
  }
}

You should run that program against the output from the compression program, and compare the uncompressed output with the original input. If the files aren't identical, then something's wrong with one or both of the programs.

Once I confirmed that both programs worked as expected, I increased the buffer size from 64 kilobytes to 64 megabytes, which is a more reasonable chunk size when working with multi-gigabyte files. 64 megabytes is large enough that it doesn't require too many blocks (16 per gigabyte), but small enough that a single block can be read from the disk very quickly. In addition, the compressor (DeflateStream) can compress a 64-megabyte block in just a few seconds. It's a very convenient size.

These simple large file compressor and uncompressor programs are surprisingly useful. They lack some niceties like error detection and correction, and they don't have any kind of "archive" capability like the Zip format does, but I've found them very useful for compressing our large XML log files that we have to hang on to for a while.