In many ways data deduplication and compression are a lot like salt and pepper. Both of these seasonings enhance the taste of food, each has a distinct flavor and is used in varying quantities depending on the dish being prepared; however, most of the time food tastes better when the two are used together. Similarly, dedupe and compression both reduce storage capacity, yet they are rarely used together in most storage arrays.
Maybe you’re wondering why I wrote this post. Surely everyone knows about dedupe and compression. Heck, these technologies seemingly are the premier feature of every modern storage platform, right?
I’m not so sure about that. I’m consistently surprised by the number of times these technologies are incorrectly referenced in either a blog post or vendor marketing collateral. While similar in purpose, these technologies provides data reductions for dissimilar data sets. it is critical to understand how these two technologies operate, which application types will benefit from their use but most importantly, understanding how the combination can provide unmatched storage savings across the broadest set of use cases.
A Word on Thin Provisioning
Surely at this point in the post someone is asking, ‘What about Thin Provisioning? It reduces storage capacity.’
That’s simply incorrect, Thin Provisioning is not a data reduction technology. It is a provisioning model that allows one to consume storage on demand by eliminating the preallocation of storage capacity. It allows increased utilization of storage media, but it does not reduce the capacity of the data written to the storage media. I’ll cover T.P. in my next post.
Data Deduplication (A Primer in Layman’s Terms)
Data compression provides savings by eliminating redundancy at the binary level within a block. Data Deduplication (aka dedupe) provides storage savings by eliminating redundant blocks of data.
Storage capacity reduction is accomplished only when there is redundancy in the data set. This means the data set must be comprised of multiple identical files or files that contain a portion of data that is identical to the content found in other files.
Examples of where one will find file redundancy includes home directories and cloud file sharing applications like Citrix ShareFile and VMware Horizon. Block redundancy is rampant in datasets like test & development, QA, virtual machines and virtual desktops. Just think of the number of copies of operating system and application binaries exist in these virtualized environments.
Tech Tip: The smaller the storage block size, the greater the ability to identify and dedupe data. For example, misaligned VMs are deduped with a 512 byte block size, they can’t dedupe with a 4 KB block.
Data Compression
Data compression provides storage savings by eliminating the binary level redundancy within a block of data. Unlike dedupe, compression is not concerned with whether a second copy of the same block exists, it simply wants to store the most efficient block on flash. By storing data in a format that is denser than the native form, compression algorithms “inflate” and “deflate” data, respectively as it is read or written. Examples of common file level compression that we use in our day-to-day lives include MP3 audio and JPG image files.
Compression at the application layer, like a SQL or Oracle database, is somewhat of a balancing act. Faster compression and decompression speeds usually come at the expense of smaller space savings. To cite a less know example, Hadoop commonly offers the following five compression formats:
- DEFLATE
- gzip
- bzip2
- LZO
- Snappy
Tech Tip: Compressed data sets can often be compressed on a storage array. This is possible as most admins tend to select optimal application performance over optimal storage savings.
Double Your Savings
From our inception, Pure Storage has focused on enabling All-Flash storage for mass adoption. By implementing multiple data reduction technologies, Pure Storage is able to significantly reduce data capacity and in turn reduce the price of flash. makes flash affordable for all workloads. The savings are universal.
The chart below is the actual storage savings from the entire Pure Storage customer base. Notice how Data Deduplication and Compression combine to double the storage savings we deliver to our customers. No tricks. No thin provisioning. No limits.
(clink image to view at full size)
Storage vendors who only provide a single data reduction technology are inevitably limited to the applications they can affordably provide flash for. Customers are learning to avoid such limited platforms in favor of more universal architectures like Pure Storage to address a larger number of applications and solutions.
Hi, I want to talk a lottle bit about this with you… I need to have a better understand about this two concepts… I a brazilian doctoral student…
For me, thinking only in a lossless cenario, compression and deduplacation is the same but in diferent levels… They both try to find duplicated parts in a file or a data stream or a byte stream…
Please, help me to understand, where, in my explanation, I’m wrong…
Hi Vaughn, I like you post a lot! We are PureStorage partners in Mexico and I’d like to translate you article to spanish in order to post it on my linkedin network. Of course I will say you are the author, I’m interested in benefit promotion of Pure’s technology. You con ask about me with Alejandra Garcia (PureStorage Country Manager)
Regards.