Home > Storage UK Special Reports > Data Deduplication Special Report > Special Report: Data Deduplication > Data deduplication technology review
Special Reports: Data Deduplication Special Report:
EMAIL THIS
 START   SPECIAL REPORT: DATA DEDUPLICATION   STORAGE EXPERTS DISCUSS DATA DEDUPLICATION STRATEGIES   
Special Report: Data Deduplication

<< PREVIOUS | NEXT >>

Data deduplication technology review

06 Oct 2008 | Antony Adshead, UK bureau chief, SearchStorage.co.UK

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   

What is data deduplication?

Data deduplication reduces the amount of data that needs to be physically stored by eliminating redundant information and replacing subsequent iterations of it with a pointer to the original.

Data deduplication products inspect data down to block- and bit-level and, after the initial occurrence, only the changed data they find is saved. The rest is discarded and replaced with a pointer to the previously saved information. Block- and bit-level deduplication methods are able to achieve compression ratios of 20x to 60x, or even higher, under the right conditions.

There is also file-level deduplication, called single instance storage In file-level deduplication, if two files are identical, one copy of the file is kept while subsequent iterations are not. File-level deduplication is not as efficient as block- and bit-level storage because even a single changed bit results in a new copy of the whole file being stored. For the purposes of this...


Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   


<< PREVIOUS | NEXT >>
VIEW ALL IN THIS CATEGORY


RELATED CONTENT
Data reduction and deduplication
IBM quietly releases source-side data deduplication in Tivoli Storage Manager 6.2
SunGard adds EMC Data Domain deduplication to Secure2Disk cloud data backup service
Primary storage data dedupe and compression find their niche
EMC's Slootman: Data Domain planning global deduplication, NetWorker integration this spring
Storage roundup: College uses clustered NAS; new Secure Multi-tenancy Design Architecture; and more
The green data centre: Business best practices
Symantec injects data deduplication into NetBackup 7 and Backup Exec 2010
Creating a data center migration plan
Data backup and recovery best practices with W. Curtis Preston
Data backup and recovery choices for SMBs

Special Report: Data Deduplication
Podcast: Data deduplication on primary storage
So far, backup is killer app for data deduplication
Podcast: How to compare data deduplication products

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


Special Report, data deduplication is defined as operating at block and bit level.

What practical benefits does data deduplication have?

Data deduplication's killer app is in backup. It demands too much processor power to be used in primary storage applications.

Data deduplication reduces the amount of data that has to be stored. This means that less media has to be bought and it takes longer to fill up disk and tape. Data can be backed up more quickly to disk, which means shorter backup windows and quicker restores. A reduction in the amount of space taken up in disk systems, VTLs for example, means longer retention periods are possible, bringing quicker restores to end users direct from disk and reducing dependence on tape and its management. Less data also means less bandwidth taken up, which means data deduplication can also speed up remote backup, replication and disaster recovery processes.

What deduplication ratios can be achieved?

Deduplication ratios vary greatly, according to the type of data being processed and over what period. Data that contains lots of repeated information, such as databases or email, will bring the highest levels of deduplication, with in excess of 30 times, 40 times or 50x times deduplication ratios possible in the right circumstances. By the same token, data that contains lots of unique information, such as image files or financial ticker tape, will not contain a great deal of redundancy that can be eliminated.

What are the advantages of hardware-based deduplication versus software dedupe?

Purpose-built deduplication appliances relieve the processing burden associated with software-based data deduplication products. The hardware-based deduplication offerings can also incorporate deduplication into other types of data protection hardware, such as backup appliances, VTLs and NAS.

While software-based deduplication usually eliminates redundancy in data at its source, hardware-based deduplication emphasises data reduction at the storage subsystem. For this reason, hardware-based deduplication may not bring the bandwidth savings that might be gained by deduplicating at source, but compression levels are generally better.

Hardware-based data deduplication brings high performance, scalability and relatively nondisruptive deployment. It is best suited to enterprise-class deployments rather than SME or remote office applications.

Software-based deduplication is typically less expensive to deploy than dedicated hardware and should require no significant changes to the physical network. But software-based deduplication can be more disruptive to install and more difficult to maintain. Lightweight agents are sometimes required on each host system to be backed up, allowing it to communicate with a backup server running the same software. The software will need updating as new versions become available or as each host's operating environment changes over time. Deduplication at the source is also processing-intensive so the host backup server must be configured for the job.

How does inline differ from post-process?

Data deduplication can be carried out inline or post process. Inline (or in-band) data deduplication removes redundant data as it is being written to media. Inline can be more efficient because data is taken in and digested simultaneously, although the additional processing power needed to handle the process may extend the backup window. The advantage to the inline method is that data passes through only once, but because it is being processed as it does, it can slow throughput.

Post-process (or out-of-band) data deduplication is carried out after data has been written to disk. This method does not affect the backup window and sidesteps CPU processing that might create a bottleneck between the backup server and the storage. Post-process deduplication uses more disk space during the data deduplication process because data is ingested then deduplicated. Disk contention is another possible issue with disk performance potentially affected as users attempt to access storage during the deduplication process.

It is recommended that you not only test the different deduplication methods to determine how they work in your environment, but also test them against backups of differing size, data types and numbers of streams.

How do deduplication products eliminate redundant data?

Deduplication systems use a variety of methods to eliminate redundant data by inspecting data down to bit level and determining whether they have been stored before.

Hash-based algorithms

Hash-based methods of redundancy elimination process each piece of data using a hash algorithm, such as SHA-1 or MD5. This method generates a unique number for each piece of data which is compared to an index of other existing hash numbers. If that hash number already exists on the index, the data need not be stored again. Otherwise, the new hash number is added to the index and the data stored.

  • SHA-1 was originally devised to create cryptographic signatures for security applications. SHA-1 creates a 160-bit value that is statistically unique for each piece of data.
  • MD5 is a 128-bit hash that was also designed for cryptographic purposes.
  • Hash collisions occur when two different chunks produce the same hash. The chances of this are very slim indeed, but SHA-1 is considered the more secure of the two algorithms.

Bit-level comparison

The best way to compare two chunks of data is to perform a bit-level comparison on the two blocks. The cost involved in doing this is the I/O required to read and compare them.

Custom methods

Some vendors use custom methods to identify duplicate data, such as their own hash algorithm combined with other methods. For instance, Diligent and Sepaton use a custom method to identify redundancy and follow that with bit-level comparison.

What is the difference between source deduplication and target deduplication?

Data can be deduplicated at the target or source. Deduplicating at the target means you can use your current backup software and the backup system operates as usual. The target identifies and eliminates redundant data sent by the backup system.

Deduplication at the source involves must installing backup client software from the deduplication vendor. The client communicates with a backup server running the same software and if the client and server agree that data has already been stored it is not sent, saving disk space and network bandwidth.

How does a deduplication device record the existence of redundant data?

Once a deduplication device has identified a redundant piece of data, it has to decide how to record its existence. There are two ways it can do so.

  1. Reverse referencing, which creates a pointer to the original instance of the data when additional identical pieces of data occur.
  2. Forward referencing, which writes the latest version of the piece of data to the system, then makes the previous occurrence a pointer to the most recent.
There are arguments that there is a difference in restore times possible between the two methods. For example, Sepaton claims its forward referencing method provides quicker restores.

How does encryption affect data deduplication

Deduplication works by eliminating redundant files, blocks or bits, and encryption turns data into a data stream that is random by its nature. Therefore, if you encrypt data first -- that is, effectively randomise it and remove similar patterns -- it may be impossible to deduplicate it. So you may find that data should be deduplicated first and then encrypted.

Data deduplication product review

Vendor h/w or s/w? VTL, NAS etc?
Algorithm used?
Inline or post-process?
Source or target?

Copan
H/w
VTL and NAS
SHA-1
Post-process
Target

Data Domain
H/w
VTL and NAS
SHA-1
Inline
Target

Dell/Equallogic
See Exagrid
- - - -
EMC
H/w
VTL, NAS, SAN attached
SHA-1 and MD5
Post-process
Target

EMC/Avamar S/w
- SHA-1 and MD5
Inline
Source

ExaGrid
H/w
NAS
- Post-process
Target

FalconStor
both
VTL and NAS
SHA-1 with optional MD5
Post-process
Target

Fujitsu
See Avamar
- - - -
HP
H/w
VTL
SHA-1
Inline
Target

Hitachi Data Systems (HDS)
See Diligent and Exagrid - - - -
IBM/Diligent
S/w
VTL
Custom
Inline
Target

NetApp
S/w (in OS)
NAS/SAN
Custom
Both
Both

Overland Storage
H/w
VTL
Custom
Inline
Target

Pillar Data Systems
See Data Domain, Diligent, Falconstor, Symantec - - - -
Quantum/ADIC
Both VTL and NAS MD5 Both Target
Sepaton
S/w VTL Custom Post-process Target
Spectra Logic
See Falconstor - - - -
Sun/StorageTek
See Falconstor - - - -
Symantec S/w - SHA-1 Inline Source





Data Backup Solutions for UK - Data Reduction, Data Deduplication, Tape Storage
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2008 - 2010, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts