Data Deduplication Efficiency

Data Deduplication Efficiency

Every day, companies generate or acquire more data, often at alarming rates. This data needs to be stored on disks or eventually tape. Reliable data storage is not cheap, however, especially when ancillary costs like electricity, cooling, maintenance, and floor space are factored in. This is why data deduplication is gaining in popularity. It reduces the amount of data that must be stored.  Data deduplication does this by ensuring that only a single instance of data is saved. For example, imagine that a PowerPoint presentation is distributed to the ten members of a workgroup. Each member will save a copy of the presentation. This means ten copies of the same file are stored, which is hardly efficient. Data deduplication comes to the rescue by replacing nine copies with pointers to the one unique file. When a user accesses the file, the one unique file is opened thanks to the pointer. The user assumes the file resides on a local laptop or desktop and enterprises greatly stretch their storage resources. In addition to reducing the need for storage capacity, data deduplication can improve recovery time objectives (RTOs) and lessen the need for tape backups.

There are several flavors of data deduplication. File level data deduplication eliminates entire files that are duplicated, such as the PowerPoint example above. Block level data deduplication, on the other hand, is much more granular. It identifies and preserves only the blocks in a file that are unique and discards all the redundant ones. When a file is updated, just the changed data is saved. By saving only unique blocks rather than entire unique files, block level data deduplication is much more efficient than file level data deduplication.

Another way that data deduplication strategies differ is where they are actually deployed. Source data deduplication is performed in primary storage before the data is forwarded to the backup system. This approach reduces the bandwidth needed to perform backups, but there can be interoperability issues with existing systems and applications, and it consumes more CPU cycles, which could impact performance elsewhere.

The other strategy is target data deduplication that occurs in the backup system, such as on a RAID storage array. Target data deduplication is simpler to deploy and is available in two modes. Post-process data deduplication occurs after the data has been stored and, consequently, requires greater storage capacity. In-line data deduplication occurs before the data is copied and, as a result, requires less storage capacity.

Data deduplication can do nothing to stave off the torrents of data that need to be saved, but it can make storage more cost-effective.  A robust RAID array with inline target data deduplication can reduce the quantity of data that is stored with minimal impact on other systems, making for greater storage efficiencies.