This library contains many methods of data compression. They can be included in a program, to provide fast, embedded compression of data files for example. They provide also features like CRC calculus on 16 or 32 bits, as well as Hash coding techniques.
The following compression techniques are provided:
At the moment, two of these techniques can compress data blocks, instead of files: Lzw and Lzss. The drawback is, naturally, a memory cost. It's use is restricted to memory-consuming programs (let's say, more than 512Kb of data), and to selective compression in files (for example, to compress only given block of data in a file, which can be useful too).
All file compression techniques have a similar interface. They take three arguments:
All file decompression techniques have a similar interface. They take two arguments:
The compression and decompression functions return a value of 0
on success, and a value of -1 on failure. Error message are
displayed if it's a standard error (I/O, memory allocation for example).
If compression is successful, the gain is wrote in the float value
(between -1.0 and 1.0).
To identify a file, you can use the IdentifyMethod function. This function take only one argument: a file descriptor. This function returns the number of the method. Defines of standard methods are in the crunch.h file. The following defines are made at the moment:
To check a file, you can use the CheckFile function. This function take only one argument: a file descriptor. This function checks if a compressed file is correct. This includes checking the header's CRC, the signature, and the length of compressed file to determine if the file is correct. If it's correct, the method index is returned. If it's not correct, a negative number is returned, relative to the kind of error found. The following errors are reported:
There are two functions in the library that provide a simple way to compute a CRC value (on 16 or 32 bits). These functions are CRC16 and CRC32 functions. They take two arguments:
There are two functions in the library that provide hash values for memory zones. They take also two arguments (see the CRC functions). The first hash function was taken from the "Ansi C" from B. Kernighan and D. Ritchie. It returns a value between 0 and 100 (thus you should have a lot of collisions, as it is not very efficient) and is called SimpleHash. The other one is based on Weinberger Hash function and return an unsigned long. It is not so far from a CRC computing (though it can't detect errors). The simplest way to use this hash value is to use a modulo on your array size. Here's an example.
This part of the library provides compression/decompression of data blocks. This allows "on the fly" compression of data blocks, before sending them in a file or in a socket for example. The interface is a bit tricky, because of the number of parameters, as you will see.
Currently, only Lzw and Lzss provide block compression, because they don't need a header like Huffman for example.
All block compression/decompression functions share a generic interface. They take five parameters:
A context is a container of variable used by the compression function. It contains data such as the dictionnary, the number of wrote bytes, an index table etc etc... It should be allocated at the beginning of the compression sequence, and freed at the end of it, because it's a big structure (it weights 53Kb in Lzw and 78Kb in Lzss. This doesn't take into account the dynamic memory allocation of Lzw). Before the beginning of an encoding sequence, it must be initialized (calling either LzwInitContext, or LzssInitContext). The structure can be used for more than one call to the compression function. But this means that the same must be done when uncompressing. Otherwise, the dictionnary won't be the same, and you will really strange results (not the original data by the way).
You have to keep track, for each compressed block, of the original block size (before compression). This is useful when you uncompress it. Though you can put a greater value for the out block size than original (the function will uncompress only what was compressed).
Let's take an example of what to do when you want to compress data. Suppose you have a huge structure that we wil call Foo, that you want to compress in memory. The following sequence can be used:
byte *CompressData(Foo *f, dword *final_size) { LzwContext context; byte *compressed; /* Allocate the block that will contain compressed data */ compressed = Malloc(sizeof(Foo)); /* Init the context and size */ LzwInitContext(&context); (*final_size) = sizeof(Foo); /* Compression */ if (LzwBlockEncode(&context, (byte*)f, sizeof(Foo), compressed, final_size) == -1) { /* Cannot compress: we free the allocated zone and return */ free(compressed); return NULL; } /* Successful compression: we realloc the block */ compressed = realloc(compressed, (*final_size)); /* and we return a pointer to compressed data. */ return compressed; }Now, as you can see, we need to allocate a block as big as the original structure to store the compression result. This means a heavy memory cost, especially if the Foo structure is big. To avoid this, it is possible to compress the structure by chunks of, let's say, 64Kb. This implies defining a Chunk structure, which will be chained and contain the compressed data and the length of compressed chunk, but reduces the memory overcost. And the context can be conveniently used from one chunk to another.