Today I wrapped up several weeks of travel which included the Charlotte VMUG conference, NetApp’s Foresight engineering event, and a number of customer and technical partner meetings. During these travels a small number of individuals would use the term data deduplication as any technology which reduces the amount of storage capacity required to store a data object (file, LUN, VM, etc…)
When this situation arose I would asked these individuals to share their understanding of data deduplicaiton. It became clear that they did not have a solid understanding of the differences between data compression, deduplication, and single instance storage, nor the benefits and considerations of each technology.
I guess one could summarize by saying these technologies are simply different means to achieve the same goal, the reduction of the storage footprint. However, this oversimplified answer does not provide any information which assists a data center architect in determining where each technology delivers the best benefit. This last point, is the goal of this post.
Single Instance Storage
I believe the easiest place to start is with ‘Single Instance Storage’, which sometimes is referred to by some as ‘File Level Deduplicaiton’. ‘Single Instance Storage’ refers to the ability of a file system (or data storage container) to identify two or more identical files and to retain the multiple external references of the file while storing a single copy on disk.
In the example below we have two copies of the same word document. This example represents a common example of what we find in user home directories. With Single Instance Storage one can reduce the storage footprint to that of a single copy.
Many of us have used or accessed technologies which leverage ‘Single Instance Storage’ as it has been the primary storage savings technology with Microsoft Exchange Server 5.5, 2000, 2003, & 2007. If you are familiar with Exchange Server you probably recall that the ability of ‘Single Instance Storage’ to reduce file redundancy is limited to the content within an Exchange database (or mailstore). In other words, multiple copies of a file may exist, but each individual database will only maintain a single copy.
Are you aware that Exchange Server 2010 has discontinued support for ‘Single Instance Storage’? Seems Microsoft has left it to the storage vendors to provide capacity savings. If you want to learn about some completely wild storage savings we can obtain with Exchange Server 2010 deployments, make sure you check out the demos we will introduce at VMworld 2010.
In summary, ‘Single Instance Storage’ is a great start to reducing redundancy within NAS file servers. It’s capability to address identical files limits its adoption in other use cases.
‘Data Deduplication’ is best described as block, or sub-file level deduplication, which is the ability to reduce the redundancy in two or more files which are not identical. Historically the storage and backup industries have used the term ‘Data Deduplicaiton’ specifically to mean the reduction of data at the sub-file level. I’m sure many of you use technologies which include ‘Data Deduplication’ such as systems from NetApp, Data Domain, or Sun Microsystems.
In the example below we have two virtual machines, each running the same Guest Operating System, yet unique objects in their security realm, and storing dissimilar data sets. This example represents common deployments of VMware, KVM, Hyper-V, etc… With ‘Data Deduplication’ one can reduce the storage footprint across multiple dissimilar objects which share common data constructs. The more common the data, the greater the storage savings.
With ‘Data Deduplication’ data is stored in the same format as if it was not deduplicated with the exception that multiple files share storage blocks between them. This design allows the storage system to serve data without any additional processing prior to transferring the data to the requesting host.
In summary, ‘Data Deduplicaiton’ is an advanced form of ‘Single Instance Storage’. It exceeds the storage savings provided by ‘Single Instance Storage’ by deduplicating both identical and dissimilar data sets.
Speaking specifically to NetApp arrays, we provide ‘Data Deduplication’ for primary, backup, and archival data sets. We support ‘dedupe’ for both SAN and NAS data sets, our data replication software, SnapMirror and our storage controller cache and expansion Flash Cache modules (formerly called PAM) are also ‘dedupe’ enabled. For more on the value of ‘dedupe’ controller cache, see my series on Transparent Storage Cache Sharing (TSCS).
Soon we will be publishing a technical report which validates that through ‘Data Deduplication’ disk and cache an array can server data sets with greater performance, and with less hardware, than what can be achieved with a Traditional Legacy Storage Array. Sound too good to be true? Ask NetApp customers, they’re all over Twitter. I’m sure some are more than happy to share their experiences with NetApp Data Deduplication.
Probably the most mature technology of the bunch is ‘Data Compression’. I’m sure we are all familiar with this technology as we use it every day in transferring files (ala WinZip) or maybe even dabbled with NTFS compression with some of your Windows systems.
In the example below we have two virtual machines, each running the same Guest Operating System, yet unique objects in their security realm, and storing dissimilar data sets. This example represents common deployments of VMware, KVM, Hyper-V, etc… With ‘Data Compression’ the data comprising the VMs is rewritten into a dense format on the array. There is no requirement for the data to be common between any objects.
As data which has been compressed is not stored in a format which can be directly accessible by the requesting host, it falls onto the storage controller to decompress the data prior to serving it to a host. This process will add latency to the storage operations.
Recently a competitor stated (via Twitter) that an EMC Clarion array incurs roughly a 10% performance penalty when serving data with ‘Data Compression’ enabled. Personally I believe that there is 10% penalty observed, under a light IO load. I wonder what the penalty is when the array is under moderately high loads?
Many of you may be surprised to know that NetApp arrays provide both ‘Data Deduplicaiton’ and ‘Data Compression’. I’ll share more on the later in my next post; however, relative to this discussion I can share with you that while we see performance increases with ‘Data Deduplication’, ‘Data Compression’ does add an additional performance tax to the storage system.
Note, these technologies are mutually inclusive, so compressed data sets gain the advantage of TSCS to help offset the performance tax.
In summary, ‘Data Compression’ is a stalwart of storage savings technologies which can provide savings unavailable with ‘Single Instance Storage’ or ‘Data Deduplication’. Because of the performance tax of ‘Data Compression’ one should restrict it’s usage to data archives and NAS file services.
Wrapping Up This Post
Storage savings technologies are all the rage of the storage and backup industries. While every vendor has their own set of capabilities, it is in the best interest for any architect, administrator, or manager of data center operations to have a clear understanding of which technology will provide benefits to which data sets before enabling these technologies. Saving storage while impeding the performance of a production environment is a sure-fire means to updating one’s resume.
Suffice to say these technologies are here, and they are reshaping our data centers. I hope this post will help you to better understand what your storage vendor means when he or she states that they offer ‘deduplicaiton’.