Data Compression, Deduplication, & Single Instance Storage

June 25, 2010 by Vaughn Stewart 22 Comments

Today I wrapped up several weeks of travel which included the Charlotte VMUG conference, NetApp’s Foresight engineering event, and a number of customer and technical partner meetings. During these travels a small number of individuals would use the term data deduplication as any technology which reduces the amount of storage capacity required to store a data object (file, LUN, VM, etc…)

When this situation arose I would asked these individuals to share their understanding of data deduplicaiton. It became clear that they did not have a solid understanding of the differences between data compression, deduplication, and single instance storage, nor the benefits and considerations of each technology.

I guess one could summarize by saying these technologies are simply different means to achieve the same goal, the reduction of the storage footprint. However, this oversimplified answer does not provide any information which assists a data center architect in determining where each technology delivers the best benefit. This last point, is the goal of this post.

As we proceed I will refer to my trusty ruler as a means to visibly ‘measure’ storage savings obtained with each of these technologies.

Single Instance Storage

I believe the easiest place to start is with ‘Single Instance Storage’, which sometimes is referred to by some as ‘File Level Deduplicaiton’. ‘Single Instance Storage’ refers to the ability of a file system (or data storage container) to identify two or more identical files and to retain the multiple external references of the file while storing a single copy on disk.

In the example below we have two copies of the same word document. This example represents a common example of what we find in user home directories. With Single Instance Storage one can reduce the storage footprint to that of a single copy.

(click to view at full size)

Many of us have used or accessed technologies which leverage ‘Single Instance Storage’ as it has been the primary storage savings technology with Microsoft Exchange Server 5.5, 2000, 2003, & 2007. If you are familiar with Exchange Server you probably recall that the ability of ‘Single Instance Storage’ to reduce file redundancy is limited to the content within an Exchange database (or mailstore). In other words, multiple copies of a file may exist, but each individual database will only maintain a single copy.

Are you aware that Exchange Server 2010 has discontinued support for ‘Single Instance Storage’? Seems Microsoft has left it to the storage vendors to provide capacity savings. If you want to learn about some completely wild storage savings we can obtain with Exchange Server 2010 deployments, make sure you check out the demos we will introduce at VMworld 2010.

In summary, ‘Single Instance Storage’ is a great start to reducing redundancy within NAS file servers. It’s capability to address identical files limits its adoption in other use cases.

Data Deduplication

‘Data Deduplication’ is best described as block, or sub-file level deduplication, which is the ability to reduce the redundancy in two or more files which are not identical. Historically the storage and backup industries have used the term ‘Data Deduplicaiton’ specifically to mean the reduction of data at the sub-file level. I’m sure many of you use technologies which include ‘Data Deduplication’ such as systems from NetApp, Data Domain, or Sun Microsystems.

In the example below we have two virtual machines, each running the same Guest Operating System, yet unique objects in their security realm, and storing dissimilar data sets. This example represents common deployments of VMware, KVM, Hyper-V, etc… With ‘Data Deduplication’ one can reduce the storage footprint across multiple dissimilar objects which share common data constructs. The more common the data, the greater the storage savings.

(click to view at full size)

With ‘Data Deduplication’ data is stored in the same format as if it was not deduplicated with the exception that multiple files share storage blocks between them. This design allows the storage system to serve data without any additional processing prior to transferring the data to the requesting host.

In summary, ‘Data Deduplicaiton’ is an advanced form of ‘Single Instance Storage’. It exceeds the storage savings provided by ‘Single Instance Storage’ by deduplicating both identical and dissimilar data sets.

Speaking specifically to NetApp arrays, we provide ‘Data Deduplication’ for primary, backup, and archival data sets. We support ‘dedupe’ for both SAN and NAS data sets, our data replication software, SnapMirror and our storage controller cache and expansion Flash Cache modules (formerly called PAM) are also ‘dedupe’ enabled. For more on the value of ‘dedupe’ controller cache, see my series on Transparent Storage Cache Sharing (TSCS).

Soon we will be publishing a technical report which validates that through ‘Data Deduplication’ disk and cache an array can server data sets with greater performance, and with less hardware, than what can be achieved with a Traditional Legacy Storage Array. Sound too good to be true? Ask NetApp customers, they’re all over Twitter. I’m sure some are more than happy to share their experiences with NetApp Data Deduplication.

Data Compression

Probably the most mature technology of the bunch is ‘Data Compression’. I’m sure we are all familiar with this technology as we use it every day in transferring files (ala WinZip) or maybe even dabbled with NTFS compression with some of your Windows systems.

In the example below we have two virtual machines, each running the same Guest Operating System, yet unique objects in their security realm, and storing dissimilar data sets. This example represents common deployments of VMware, KVM, Hyper-V, etc… With ‘Data Compression’ the data comprising the VMs is rewritten into a dense format on the array. There is no requirement for the data to be common between any objects.

(click to view at full size)

As data which has been compressed is not stored in a format which can be directly accessible by the requesting host, it falls onto the storage controller to decompress the data prior to serving it to a host. This process will add latency to the storage operations.

Recently a competitor stated (via Twitter) that an EMC Clarion array incurs roughly a 10% performance penalty when serving data with ‘Data Compression’ enabled. Personally I believe that there is 10% penalty observed, under a light IO load. I wonder what the penalty is when the array is under moderately high loads?

Many of you may be surprised to know that NetApp arrays provide both ‘Data Deduplicaiton’ and ‘Data Compression’. I’ll share more on the later in my next post; however, relative to this discussion I can share with you that while we see performance increases with ‘Data Deduplication’, ‘Data Compression’ does add an additional performance tax to the storage system.

Note, these technologies are mutually inclusive, so compressed data sets gain the advantage of TSCS to help offset the performance tax.

In summary, ‘Data Compression’ is a stalwart of storage savings technologies which can provide savings unavailable with ‘Single Instance Storage’ or ‘Data Deduplication’. Because of the performance tax of ‘Data Compression’ one should restrict it’s usage to data archives and NAS file services.

Wrapping Up This Post

Storage savings technologies are all the rage of the storage and backup industries. While every vendor has their own set of capabilities, it is in the best interest for any architect, administrator, or manager of data center operations to have a clear understanding of which technology will provide benefits to which data sets before enabling these technologies. Saving storage while impeding the performance of a production environment is a sure-fire means to updating one’s resume.

Suffice to say these technologies are here, and they are reshaping our data centers. I hope this post will help you to better understand what your storage vendor means when he or she states that they offer ‘deduplicaiton’.

Comments

Storagezilla says

June 25, 2010 at 6:24 pm

Fixed block deduplication, which is what you’ve shown in your diagram above, only works when there’s actual duplicate segments to remove.
That’s why every version of Exchange you listed except 2010 offers miserable dedup savings for time spent when Exchange SIS is enabled.
I won’t speak for anyone elses Archiving products but with SourceOne archiving enabled Exchange 2010 gains SIS and compression so instead of the 8MB file sent to 50 people stored on disk it’s a less than 10KB message pointing back to a SIS and compressed file in an archive which may or may not be sitting on a storage system offering data deduplication.
For the majority of datasets which aren’t Full Backups or VMware images compression can consistently deliver savings unless the data is already compressed.
Oracle Advanced Compression on production databases is a case in point and like Celerra Deduplication the savings can be gained in a reduction of replication traffic and less data sent to external backup media.
I don’t favor one of what you’ve covered over the other as the technologies are different but trying to diminish any of them is a mistake.

Reply
Chad Sakac says

June 25, 2010 at 6:58 pm

Disclosure – EMCer here.
A good post Vaughn. I hear ‘Zilla’s perspective that it tends to highlight the strengths of one approach (the one which is a NetApp strength) and dismisses the strengths of other approaches in a variety of contexts – but hey, that goes with the territory I suppose.
A couple quick comments…
1) generally, data reduction algorithms, all of which can be called “deduplication” in a sense – after all:
– “single instancing files with indirection” dedupes identical files
– “single instancing blocks with indirection” dedupes identical blocks
– “single instancing bit sequences with substitution” is a form of compression (compression replaces information sequences with more efficient information sequences).
2) The key is that their effectiveness (in “information efficiency”) and cost (in terms of computational or IO load) depends on a lot of things – the dataset and the implementation.
3) the question of implementation has to include the topic of post-process vs. inline process.
It’s an interesting area of emerging technology no doubt. Inline sub-file dedupe is widely deployed and effectively a technological ante for backup targets, and starting now to appear in primary storage use cases (ocarina, ZFS), though still seem to have caveats (as always – don’t we all – not casting a stone as we all live in glass houses).
I LOVED this topic back in 2nd year electrical engineering – information theory (Claude Shannon) http://en.wikipedia.org/wiki/Claude_Shannon (one wicked smart dude for his day).
Example – in general purpose NAS use cases, file-level dedupe (or “single instance files” if you prefer, or “file system compression using file-level object analysis” if that suits you better) is very basic, but also remarkably effective. It eliminates a LOT of redundancy, has very little CPU or storage load introduced as part of the process.
Conversely, with NFS with VMware or VMFS use cases, this does virtually nil. Compression with that dataset results in 30-50% capacity efficiency.
In storage arrays, the other thing that is a big consideration are other “adjacent effects” of any feature or function – something that affects us all in myriad ways.
Within EMC, all 3 of these techniques (file level, sub-file level, and compression) we mentally lump into “Data Reduction Engine” technologies. We apply them into our archive products, our backup products, our primary storage products. VMware increasingly is also applying them into their stack.
BTW – the analogy of “Zip” or “NTFS compression” are bad analogies for how we do compression (and I would suspect this applies to other array vendor implementations).
Consider…. When you hit “play” on your iPod, there’s no visible “decompressing”. Yet of course the MP3 or AAC format is natively compressed. Likewise, when you take a picture on a camera, there’s no visible pause as it gets stored to the flash card. Yet of course the JPEG, TIFF format is natively compressed.
Those are also not perfect analogies – as they are very rapid compression algorithms, but they are also lossy.
What is used across EMC (Celerra, CLARiiON, Recoverpoint all use this today, and now that it’s part of the core data reduction engine tech, it’s being used in more places) for compression is a patented very rapid lossless compression algorithm.
Today, the compression is still a post-process on primary storage, with the decompress being done in real-time on reads (hence the small latency impact). The whole file isn’t decompressed, only the portion of the file/set of blocks being accessed.
We’re not stopping here (and I’m sure NetApp and all the startups too aren’t stopping either!).
Loads of R&D about expanding our primary storage data reduction engine to even more efficient blended approaches, bringing our existing block-level dedupe into primary storage use cases, and how to apply the IP across EMC to the greatest effect.
Likewise, with the emergence of massively multicore CPUs, and also CPU instructions that are particularly well suited to hash calculations (used in many places for this topic) for finding identical block or file objects and compression – more and more can be done inline.
The design goal is to make data reduction – and the application of the most efficient underlying technique an “invisible data services” attribute of the system.
We certainly aren’t perfect (who is?), but this pursuit for ongoing efficiency in every dimension, with the broadest set of use cases is something that serves our mutual customers (and the industry as a whole) well IMO.

Reply
twitter.com/HPStorageGuy says

June 25, 2010 at 11:57 pm

Hey Vaughn – add HP to your list of deduplication vendors. In case you missed it, we announced HP StoreOnce this past week at HP Technology Forum. I won’t take up a lot of space here talking about it but you can read, watch, and listen more about it on several articles I have on my blog: http://h30507.www3.hp.com/t5/Around-the-Storage-Block-Blog/bg-p/139/label-name/storeonce. And of course, I work for HP.

Reply
MAC says

June 26, 2010 at 1:49 am

Hi Vaughn ,
i just wanted to check on how exactly does netapp provides data compression.
Thanks for the great post
regards
MAC

Reply
Nicholas Weaver says

June 26, 2010 at 7:19 am

Nice write-up Vaughn. This has definitely got my curiosity engaged on the differences and use cases around the topic.

Reply
JohnFul says

June 26, 2010 at 9:09 am

Oh were to begin?
The golden rule is “know thy workload”.
The MP3 example is actually a good one. The application that creates the MP3 does the compression. I’ve you’ve ever converted a WMV to MP3, or imported your current CDs into iTunes, you’ve got an idea of the work that entails. The MP3 or (MP4) can then be stored somewhere and accessed by an MP3 player application or dedicated device. It’s at the endpoint, not the storage, that the MP3 is decompressed and played. It’s not too much work for a single dedicated device; it appears to happen instantaneously. Now just imagine transferring the work of a few thousand such devices to the storage controller; you’d likely bring it to its knees.
Now imagine a CIFS volume full of home directories. In many of those home directories you have the very same MP3 file. If you enable file level deduplication, you would certainly save some space because the file would only be stored once with multiple references to it. Now, what happens when 1000 users attempt to play the same MP3 at the same time? If your cache isn’t SIS aware, you immediately fill up your cache and all those reads go to disk. Of course at the disk, all those reads are hitting the same blocks over and over again bringing event the most stout storage controller to it’s knees.
Now, what if the 1000s of users of this MP3 or latest viral video like to share it with their friends? They open up Outlook and, instead of a link attach the file and spread the love. File level deduplication of 2TB Exchange 2010 databases isn’t all that practical. The when in doubt sub it out approach of MAPI stubbing and storing the file somewhere else will generally bring your Exchange server to its knees with MAPI RPC traffic in addition to the disk IO. Sort of goes against the “large, cheap mailboxes” thrust of the IO improvements in Exchange 2010 anyway, doesn’t it? If you stub it out, you’re increasing the IOPS density of a smaller storage pool forcing more expensive SAS or FC disk or EFD. What if you did a very fine grained block level deduplication instead, you had a very large cache implemented in Flash, and your cache was aware of the deduplicated blocks?
JohnFul

Reply
B Riley says

June 26, 2010 at 9:41 am

Thanks for the clarification. These kinds of posts / conversations really help all of us to cut through the marketing and get to the nuts and bolts.

Reply
Vaughn Stewart says

June 26, 2010 at 11:13 am

To all who have left comments, thank you for sharing your thoughts and feedback. IIt is very rewarding to post a blog and to generate a semi-real time discussion.
@Mark (aka Zilla) – I think we are on similar paths (EMC & NetApp that is), which is to provide hardware acceleration to storage related tasks critical in software based ecosystems.
@Chad – You make a solid point when you state that an MP3 is natively in a compressed format and the host (laptop, iPod, etc) decompresses the data as it plays the file. The difference here is your iPod isn’t attempting to decompress hundreds or thousands of MP3s simultaneously, not does it have the responsibility to compress raw audio files.
On post processing versus in-line data deduplication… it makes a lot of sense particularly on a production data set. I ensures optimal performance during peak operations and most data sets have a daily change rate which is sub 2%. This is a small sacrifice to make in terms of training capacity for ensuring performance.
As I’ve said many times, Chad & I see the world with the same perspective almost all of the time (say 95% of the time). The other 5% is where we differ. Dense storage is in everyone’s future, and it is our jobs to ensure that the choice technology is used with the appropriate data set.
@Calvin (HP Storage Guy) – I’m glad to see HP jump into the storage efficiency game. Thanks for the link, I’ll read up on the advances.
@Mac & @B – I’ll dive into this technology in greater depth next week.
@Nick – I’m blown away by your work (my friends rave about you non-stop). Thank you for reading my babble.
@John – Again another much smart than I, thanks for continuing the discussion.

Reply
Brian Gracely says

June 26, 2010 at 4:30 pm

Vaughn,
This is well written. You do a nice job of laying out the differences between the technologies. There are a number of things I like about the post and the subsequent discussion between NetApp and EMC.
1 – As much as EMC and NetApp will debate terminology, implementations and use-cases, the most important thing is how hard they push each other to advance these important technologies. Competition leads to great results, especially for customers.
2 – Both NetApp and EMC are on their 2nd, 3rd or 4th generation of these technologies. This stuff isn’t easy. It takes many years to work out the nuances between theory/labs and real life implementations.
3 – The discussion is based on feedback from customers about specific use-cases. Customers that have had the technology in production – either in primary storage or backup. Those customers know and trust the technology because it works and they are asking for even more savings than they have already received.
4 – From all the meetings that I’ve recently been in, NetApp and EMC customers are no longer concerned about the technologies. Two years ago, that wasn’t the case. It takes quite a while for customers to trust the technology:
– How much does it impact the CPU on the arrays?
– Will it lose data?
– Will it corrupt the data?
– Will the application developers trust that their data is safe?
Considering how fast virtualization is changing the data center, making shared storage even more important than ever, I’m glad to be partnering with the two best deduplication vendors on the planet. Being able to match the efficiency story that VMware brings on the server-side is equally important for NetApp and EMC to bring on the storage side.
Thanks to both companies for making great deduplication technologies. Keep pushing each other. Customers love the value it brings to their virtualized data centers

Reply
Christopher Kusek says

June 26, 2010 at 5:54 pm

Vaughn,
I like how pretty the images you used are, they look so clean and very smooth in the presentation of the topic.
I am also fortunate that everyone in the thread before me covered pretty much every point of consideration I might mention, bring up, discuss, etc. 🙂
Fortunately, there are things on both of our sides of the fence (so to speak) that I know about, yet sadly cannot discuss; nonetheless I’m glad to see you continuing being active and producing great content like this to get the conversation going on it.
– Christopher

Reply
Chad Sakac says

June 26, 2010 at 6:30 pm

@John – my point was just that algorithms around compression differ wildly.
People as consumers of technology are used to fast lossy decompress in MP3 in HW, in fast lossy compress in JPEG, and slow compress/decompress with Zip. There are, of course lossless fast compress/decompress (though often reclaim less capacity than the other examples).
I was just saying that the words we all use carry a ton of implicit meaning in the listener’s ears (sometimes correct, often not).
I’m also certainly not poo-poohing the value of a sub-file level dedupe where there is benefit in the example you point out.
Today, to achieve a similar effect in the View case for example (and similar cache scaling), we use View Composer. Of course, that is use case specific, and done at the infrastructure layer it is more general.
In the same way that I’m sure there are folks at NetApp working feverishly away in areas where EMC has a technology that they don’t, of course we’re furiously working away on sub-file pointer based dedupe to augment our current primary storage techiques, and in many cases leveraging the great technologies we have in the backup source and target space.
Thanks for the dialog!

Reply
JohnFul says

June 26, 2010 at 7:09 pm

@Chad
There are a wide variety of data reduction methods, and it seems new ones are invented every day. My point was that workloads vary widely as well. Understanding the workload goes a long way toward picking an effective data reduction method.
Thanks
JohnFul

Reply
Storagesavvy says

June 26, 2010 at 8:10 pm

Vaughn,
I really appreciated this post, and combined with all of the comments you received, we all, as vendors, should be educating our customers on the different approaches so they can understand where each will help them.

Reply
Aaron Delp says

June 27, 2010 at 5:40 am

Morning Vaughn – It appears you’re attracting EMCer’s to your posts like moths to a flame! Great dialog all around and thanks to everyone for the professional conversation!
As somebody on the outside here is how I see this developing. EMC and NetApp are taking a fundamentally different approach (not the first time) to solve the same issue. The fact that both are trying to solve the issue is a good thing because the concept of “data space reduction” is at the top of everyone’s list these days.
Of course the first question to go along with this is what penalty will I encounter? For many, saying 10% or less is money in the bank. If you go much higher than 10%, many customers start to get a little uncomfortable. 10% is “in the noise” and a non-factor.
Based on what I have seen so far (I’m willing to bet Chad is writing furiously as I type this) it is hard to beat NetApp’s savings vs. performance trade-offs. Combine this with the dedupe aware cache and replication and it is an amazing solution for virtualization.
I have many customers with dedupe ratio’s of 70%+ (record is one customer had 91% one time I checked). There are some caveats around dedupe on VMFS LUNs vs NFS volumes but that is a conversation (or blog post) for another day.
Conclusion: DeDupe sells a LOT of arrays. It’s not about the drives, it’s about the efficiency behind the drives.

Reply
MAC says

June 28, 2010 at 6:23 am

Hi Vaughn,
i saw intersting article about snapcompress and snapencrypt from netapp
ontap has included compression and encryption is this for real

Reply
The Storage Alchemist says

June 28, 2010 at 8:42 am

Hey Vaughn,
Good to see that primary storage optimization is really getting a lot of virtual ink these days. I have a couple of comments.
While I think this post does a great job at helping to distinguish the differences of the technologies, however, I find it a bit odd that you would say that customers don’t have a ‘solid’ understand of the differences, this is not the case when we are speaking to customers. Also, I think you need to be a bit careful at the color added to some of the ‘definitions’ (it is the FUD that is spread that ensures customers don’t have a ‘solid’ understanding). For example compression, when done right, doesn’t add any performance penalty and in fact increases performance. The Storwize technology, which sits in front of over 200 NTAP files at over 100 customers sits all claim that they are seeing no performance degradation at all. I will point out that most NTAP users (I say most) aren’t really taxing their systems to a 10% performance impact (per Aaron) may in fact be just that – noise given the benefits. However, I say, if I don’t need to worry about performance degradation, then why even think about it if the optimization is done properly.
The other thing I would add is that when you make your comparisons that you stick to a common baseline. To show things like single instancing with a word document and then deduplication with a VMware image are two different use cases. (Not to mention both cases are essentially file use cases and there is no block example.)
I would concede that deduplicaiton is a great fit for .vmdk files that DON’T store unique data inside the .vmdk but store this data on a shared system. However when users are storing data inside the .vmdk deduplicaiton goes out the window. Again, we have a number of customers (in both VMware accounts and non VMware accounts) that tell us that NTAP dedulication provides them with anywhere from 9% to 18% optimization on primary storage consistently because there just isn’t that much repetitive data in primary storage systems (unlike backup where data is sent day after day after day). (Now if you are storing .vmdk files without unique data in the file, then you can get 90+% optimization.)
Since we all know that primary storage optimization is a must for the data center due to the growth of data, this is a great topic to discuss. So, in the spirit of helping end users make good business decisions lets arm them with the right information.

Reply
Vaughn Stewart says

June 30, 2010 at 6:03 am

@Brian & Aaron – well said, I think you’ve hit the nail on the head. Storing data in a space efficient format is a requirement. Understanding which format is best for a particular data set is where an understanding of these technologies is of value. This last point, is why I started this post (because of the loose usage of the term dedupe).

Reply
Vaughn Stewart says

July 1, 2010 at 2:23 pm

@Steve (aka the alchemist) – Thank you for reading and sharing your thoughts. I was a bit perplexed by some of your comments, which I believe the source of this ‘confusion’ may lie in a lack of understanding the shared storage model found with virtual infrastructures and cloud deployments. With technologies from VMware and other server virtualization vendors it is common to store multiple VMs on large shared storage pools. This data layout introduces a number of new capabilities and is central in the ability of block level data deduplication to significantly reduce one’s storage footprint.
I do have a few points from your comments which I’d like to dive into…
When you stated – storing data inside of a VMDK deduplication goes out the window. Your claim flies in the face of what the industry is witnessing, and is just plain wrong.
1. See my post http://blogs.netapp.com/virtualstorageguy/2009/01/deduplication-g.html
2. See Aaron’s comments on this post alone. Aaron does not work for NetApp.
3. Try Google: http://www.netapp.com/us/company/news/news_rel_20071127.html
From a technical perspective…
4. Data deduplication is block level, thus sub file (or sub VMDK), and provides savings for with SAN & NAS datasets.
5. Deduplication results would be identical on datasets if they resided on a NAS share or in a VMDK.
6. As multiple VMs are stored on a storage object dedupe occurs within a single VM and across all on the storage object.
As for your comment on – when done right compression increases array performance… As compression requires data to stored and served in a non-native format that requires manipulation of the data as it is either read or written can you elaborate on your statement and provide supporting data?
And your comment on – common baselines… In the post I did not show leveraging single instance storage with a virtual machine, as this scenario does not exist in a production environment. Two identical VM would become a support issue as soon as they were powered on. Thus I used the NAS analogy; where identical files can be commonly found.
Is it possible that your negative claims around the capabilities of NetApp may be more related to a desire to promote your product rather than furthering this technical discussion?
I would my assertions to be wrong. Would you mind supporting one of your claims? I’d like to be put in contact with a purported NetApp customer who is obtaining a 9%-18% storage savings with VMware on NetApp. There is nothing to lose here as if this customer exists they have an opportunity to provide free storage to these customers as a part of our 50% storage savings guarantee.
http://www.netapp.com/us/solutions/infrastructure/virtualization/guarantee.html
Alchemy never really panned out, did it? One of the cornerstone desires of such a practice was the ability to change the invaluable into something rare & precious. It sounds fantastic, the ability for one to turn lead into gold; however, if this ability was possible it would have devastated the value of gold. Gold would be as commonplace as lead, and thus worthless.
Maybe alchemy is best left for the history books and fictional stories.

Reply
invisible says

July 2, 2010 at 3:56 pm

Gentlemen,
First, here is the output:
Netapp2> df -g -s
Filesystem used saved %saved
/vol/TRMN_E00_N/ 1343GB 7240GB 84%
above is a volume from production NFS datastore hosting >50 VMs. Volume dedup (ASIS) is NOT enabled – we use file based cloning and thus results.
However, I am a little bit concerned – what would happen (in theory) when a block to which other 5 are reffing to, goes bad?
Have anyone seen the situation when people are NOT deploying dedup solutions for that particular reason?

Reply
Giacomo says

July 4, 2010 at 6:23 am

Nevertheless I had until few months ago some doubts -high CPU utilization -regarding post processing deduplication on DoT (I knew and appreciated DD appliances for long) I recently started two new environment based on a 3140A and 3160A clusters on two different customers and environments but both based on FC SAN and NAS (CIFS and NFS) mix.
In both cases, where the goals were to migrate each single bit of data from other storage (IBM DS and EMC CX) I started the volumes and luns preparation each time as absolutely thin provisioned: so that, no volume reservation, zero fractional reserve, vol autosize, snap autodelete and not reserved LUNs, even the ones used for db such SQL or Oracle RAC on raw partitions. Or the datastore for VMware built on FC and SATA disks.
Well. Now that more than one month of data in production has passed me and the customers are very satisfied of obtained results!
The space is saved even on tempdb luns of SQL, is greatly saved on thin provisioned (from the vSphere perspective and tools) vmdk of the images of servers (a lot of Windows 2K3 and 2K8) and is greatly saved on other luns and overall on CIFS areas dedicated to the user data.
And last, nevertheless the CPU raises to 90 percent and more during the dedupe processes run simultaneously on 5-8 volumes this value in my opinion is a fake “alert”. DoT uses all the CPU has available and if it’s sleeping now it’s clear that uses it for dedup while mantains at the same time an excellent responsiveness for all other apps (san, file sharing, snaps and others..).
Here you can find some pics extracted by the “storage utilization” of MyAutosupport area of these two environments.
http://files.splinder.com/e490d453a0047e1392d74be991a578f2_medium.jpg
http://files.splinder.com/c733f1d22af4eeab19b69dd63e20dc1b_medium.jpg

Reply
Puma Shoes says

August 18, 2010 at 5:47 pm

This is an outstanding written post, Thanks for yet an additional insightful post, as normally!

Reply
eric says

January 3, 2011 at 6:41 pm

@invisible
Can you explain how you got your figures without turning SIS on please?
As far as I am aware a command run in NTAP shows the output that WAFL can see and if SIS is not on then what was shown here? What file cloning are you referring to?
Thanks in advance.
Eric

Reply

If you found this article valuable, consider sharing...

Related

Reader Interactions

Comments

Leave a ReplyCancel reply