Special Report

Ocarina ECOsystem deconstructs before compression, deduplication for primary storage data reduction

Ocarina Networks Inc. claims that its primary data storage-targeted ECOsystem appliance produces data reduction of up to 85% on Microsoft Office documents, PDFs and virtual machine files and 40% or more on images.

But the competition contends the savings come at a performance price that some users may be unwilling to pay.

Ocarina says judging the performance impact is a complex matter and depends, to some degree, on how a customer uses the product. The appliance offloads CPU from the filer and does data deduplication and compression on a post-process basis, according to Mike Davis, senior director of marketing at Ocarina.

"The main impact to consider is not CPU overhead but marginal delay for decompressing files," Davis wrote in an email. "This does take 'milliseconds,' which matters for transactional applications but not for human reads and Web services where we are generally not noticeable."

The ECO in ECOsystem stands for extract, correlate and optimize. In the first step, the software identifies the file type and then extracts, or decompresses, the file to get to the zeros and ones that represent its richest expression. A compound document (such as a PDF with an embedded image file) may require multiple levels of recursive decompression.

More on primary storage data reduction
Primary storage data reduction advancing via data deduplication, compression

NetApp: Post-processing approach limits performance hit

EMC: Primary storage reduction via dedupe, compression


Storwize claims good compression, no performance hit

Primary storage data deduplication is mature now, says Gartner analyst
"Really, this extraction is the thing that makes us different than everybody else," said Carter George, vice president of products at Ocarina. He claimed duplicates are often obscured because many file types have already been compressed. By decompressing files, duplicates are exposed.

While unraveling the files, ECOsystem attempts to identify natural object boundaries, such as section of text, a graphic or a photo. For instance, it might take the unique hash of the whole photo, rather than looking for 4 K duplicate chunks at the block level.

In the correlation (or data deduplication) step, the system removes the duplicates and directs pointers to the matching parts.

"By keeping those things together as natural objects, we get to the compression stage and you've already taken out the dupes," George said . "You can still get more space savings by applying compressors to the things that are left."

ECOsystem has approximately 125 compressors for the optimization step. Some are standard compressors based on the pioneering work of Abraham Lempel and Jacob Ziv. Others are proprietary compressors developed by Ocarina's research team of mathematicians for specific file types, such as seismic or genomic data.

"The more you know about what kinds of patterns are going to show up in a file, the more specialized the compressor you can build," George said. "There's whole classes of data where you will get zero to 10% data reduction with dedupe, but with a good compressor, you can get 50%, 60%, 70%, 80% reduction."

Customers who want maximum performance might opt to turn on deduplication and turn off compression. That works well for VMware VMDK files and might be beneficial for other primary storage scenarios, according to George.

But George estimated that 80% of online or near-line storage is not especially hot, or active, so customers might elect to use fast generic compression. A third option is available for archived colder storage, using the data-specific compressors. The system looks inside each object to figure out the data type and pick the best-fit compressor.

"You get sets of knobs and dials to pick what you want," George said. He advises customers to consider applying policies, such as only deduping files that are older than 10 days or haven't changed in 30 days, to minimize the performance impact.

Like some other primary storage offerings, Ocarina's product works on a post-process basis, waiting for a file to land on disk before deduplicating it. But unlike the others, ECOsystem employs what George calls a "sliding window," or variable-block approach, to compare the zeros and ones on the block to find duplicates.

ECOsystem works only with network-attached storage (NAS) filers today, but George said one partner issued a request for block storage. Other future plans for the product include an embeddable edition for NAS vendors, a direct-attached storage (DAS) option and a port for Windows servers.

Although George claims ECOsystem is 100% for the primary enterprise data storage market, not all of its customers choose to use the product that way. Saker Klippsten, head of engineering at Zoic Studios Inc. in Culver City, Calif., said his company uses the technology for secondary storage of reusable assets, such as stock film footage.

Klippsten said Zoic has realized 40% to 65% data reduction with Ocarina's ECOsystem and has no plans to use it for primary storage. "It takes time to decompress and read the files, and we want to access them in real-time," he said.

This was first published in December 2009