Data is the lifeblood of any business. The problem is what to do with all of it. According to IDC, data in the enterprise doubles every 18 months, straining storage systems to the point of collapse. The blame for this bloat often falls on compliance regulations that mandate the retention of gobs of messages and documents. More significant, though, is that there's no expiration date on business value. Analyzing data dating back years allows users to discover trends, create forecasts, predict customer behavior, and more.
Surely here must be a way to reduce the immense storage footprint of all of this data, without sacrificing useful information. And there is, thanks to a technology known as data deduplication.
Every network contains masses of duplicate data, from multiple backup sets to thousands of copies of the employee handbook to identical file attachments sitting on the same e-mail server. The basic idea of data deduplication is to locate duplicate copies of the same file and eliminate all but one original copy. Each duplicate is replaced by a simple placeholder pointing to the original. When users request a file, the placeholder directs them to the original and they never know the difference.
The interest in the technology is driven by the potential value proposition, in terms of reduction of required storage capacity, and the win-win proposition of providing better services (e.g disk-based recovery) while reducing costs. “Deduplication improves storage efficiency by finding identical blocks of data and replacing them with references to a single shared block. The same block of data can belong to several different files or LUNs, or it can appear repeatedly within the same file. The average UNIX or Windows disk volume contains thousands or even millions of duplicate data objects. As data is created, distributed, backed up, and archived, duplicate data objects are stored unabated across all storage tiers. The end result is inefficient utilization of data storage resources,” says Martyn Molnar, NetApp Regional Sales Director-Middle East.
Deduplication takes several forms, from simple file-to-file detection to more advanced methods of looking inside files at the block or byte level. Basically, dedupe software works by analyzing a chunk of data, be it a block, a series of bits, or the entire file. This chunk is run through an algorithm to create a unique hash. If the hash is already in the index, that means that chunk of data is a duplicate and doesn't need to be stored again. If not, the hash is added to the index, and so on.
Data deduplication isn't just for data stored in a file or mail system. The benefits in backup situations, especially with regard to disaster recovery, are massive. On a daily basis, the percentage of changed data is relatively small. When transferring a backup set to another datacenter over the WAN, there's no need to move the same bytes each and every night. Use deduplication and you vastly reduce the backup size. WAN bandwidth usage goes down and disaster recovery ability goes up.
More and more backup products are incorporating data deduplication, and deduplication appliances have been maturing over the past few years. File system deduplication is on its way too. When it comes to solving real-world IT problems, few technologies have a greater impact than data deduplications.