It's no secret to enterprise data storage managers that data compression technologies can help reduce their data footprint. But how much do you know about actually achieving the types compression ratios that will provide significant data-reduction benefits to your environment? In this SearchStorage.com interview, W. Curtis Preston, independent analyst and executive editor in TechTarget's Storage Media Group, discusses the basics of data compression – one of the most popular search terms on SearchStorage.com this year.
Listen to the MP3 or read the transcript below to learn about the data compression technology market, how compression differs from data deduplication and more.
Table of contents:
1. Defining data compression
2. How is compression accomplished?
3. The (lack of) evolution in compression technology
4. Good and bad data for compression
5. Data deduplication vs. compression
6. Defining a compression ratio
The best definition is the simplest definition -- a variety of technologies that take a look at a file or stream of data and try to find redundant parts of that data and remove those redundancies. That wasn't too simple, but the definition needs to be somewhat complex in order to differentiate compression from some other technologies.
Basically, they're trying to find common pieces of data blocks that they can get rid of, shrink, remove or substitute with smaller patterns. The more of those things they can find, the more it can compress. A perfect example to illustrate how that works is if you take and create a file for the purpose of making an Oracle database. Once you've done that, if you compress the file you'll get crazy compression ratios because the file consists entirely of zeroes. It basically says I've got 7,000 zeroes and it only takes a couple of bytes to say that. The more redundant data that it can find in the file the better it's going to compress the file.
Compression works in two places. One is area is with technologies like WinZip, compress and Unix. To run those against a file and turn those into a file .zip or file .z, the compression will also run in a backup stream and a tape drive. In that case, it's looking at the data as it comes and tries to find sets of data that are next to each other and similar, and tries to get rid of that data.
In the case of tape drive compression, it's very important to note that because this is being done in hardware and at line speed, not only does it compress the data, it actually makes the tape drive faster because compressing the data means less data needs to be written physically to tape; therefore it makes the drive faster.
In the 15 years or so that I've been dealing with compression it really hasn't changed that much. There have been other algorithms and other commands that have come out and tried do a better job, like gzip in the open source world. But essentially the compress command in Unix and the tape drives that are compressing today, for the most part, aren't that much different than they have been for 15 or 20 years. They may be a little bit faster and may compress the data a little bit more, but generally they're the same.
There are types of data that are good for compression and some that aren't good. Generally there are data formats that are pre-compressed, TIFF comes to mind. There are a number of image type files that are pre-compressed. So if it is already compressed, there's not much of a point in running compression against it.
There is a concept in the compression community where if you have a file that's already compressed by one compression algorithm and is stored on the file system as a .zip file, then you back that file up to another device that has compression and that file can actually get bigger. I've never tested to see how true that actually is, but I don't think that's something people need to worry terribly about. There are some people that worry too much that they're actually going to make their files better.
I would say compression over time. Compression is looking at a single instance of data to find data that is the same as other parts of the data that can be replaced with pointers. Deduplication is doing that but also comparing it to similar data that we've seen before. So it's not only looking in the file for things it can get rid of, but it's looking in that file for parts of the file or backup stream that have been seen by the dedupe device before.
There are some bloggers out there that have simply said that dedupe is compression. In the beginning, for a definition of compression, I could have said anything that makes data smaller. I don't like that definition. Compression is a very specific technology that works on a specific instance of data. Dedupe works on data over time. The more data, the more repetitive data that you send to a dedupe device, the better your dedupe ratio will actually get over time.
Your dedupe ratio when you first start using a dedupe device is actually very disappointing: 4:1 or 5:1. But as you add data to it over time, the dedupe ratio can go up to 10:1, 20:1 or even higher depending on your data. But with compression, you're not comparing today's files to yesterday's files, so you only get so much compression.
It's also important to note that almost all the dedupe systems of which I'm aware, do both. They use traditional compression algorithms and they use dedupe technology to compare today's data to previous data. Generally, dedupe is done first and then compression because dedupe works at a macro level and compression works at a micro level.
It's pretty simple. It means that you took a 10 MB file and it now only takes up 5 MB of space. That is the general number that is thrown out there by most people that are using the Lempel-Ziv compression algorithm, which is based on stuff that's been around for eons. Generally, very few people get that. The number that I stick with when I'm estimating is 1.5:1. I hear claims of higher than that -- I've seen 3:1 in a customer's environment -- but rarely do I see 2:1 on a regular basis. I generally see 1.5:1, which means we took 6 MB and it now takes up 4 MB.
This was first published in December 2009