Many of you have been asking when NetApp & VMware update TR-3697 (multiprotocol performance comparison paper covering VI3) to include vSphere. It is my pleasure to share that as of this morning we have jointly released a new report TR-3808.
The paper compares the performance of FC, iSCSI, and NFS connected storage along with the gains made in protocol optimization with vSphere as compared to VI3.
What’s Covered in the Report?
TR-3808 compares the scaling of NetApp FC, iSCSI, and NFS connected shared datastores. VMware agreed on testing general purpose shared datastores, as they comprise greater than 80% of all datastores currently deployed.
The concept of a general purpose datastores is the 80% of the 80:20 rule regarding storage with VMware. I tend to discuss the rule during my customer briefings and public presentations. The remaining 20% of the rule is what I refer to as High Performance Datasets. Both of these constructs were discussed in the ‘Multivendor Post to help our mutual NFS customers using VMware‘, which was co-authored by Chad Sakac.
Maybe I’ll dedicate a post to the 80:20 rule in the near future…
The Test Bed
In the testing we deployed an 8 node ESX cluster connect to a FAS 3170 over 1 & 10 GbE and 4 Gb FC links. We used (what many consider the industry standard) Iometer benchmark to generate our workload, which consisted of 4K or 8K request sizes, with a 75% read 25% write mix, and 100% random access pattern. The testing was ran at various load based on the number of running virtual machines starting with 32, then 96, and finally 160 (executing 128, 384, and 640 total outstanding I/Os respectively). Each VM ran the Iometer dynamo configured to generate four outstanding I/Os.
The Performance Results
As you can see there is very little difference between the performance results of any storage protocol when running VMware on NetApp. This first graph is one of many showing I/O throughput.
Here’s another throughput test ran with 10 GbE for NFS and iSCSI…
In these results I’d like to point out that our FC performance results tend to be much higher than what is witnessed when running FC connected datastores on storage arrays which implement per-LUN I/O queues. This is a legacy storage array architecture hold over which is still present in most of the current array platforms within the storage industry.
For more on per-LUN I/O queues see my post ‘vSphere Introduces the Plug-n-Play SAN’
Here’s some results measuring CPU utilization. These are a bit scary at first because the numbers are relative. So, if the CP utilization an workload over FC utilized 10% of the ESX CPU, and the same workload for NFS utilized 14.5% of the ESX CPU, then the relative difference is NFS uses 145% of the CPU required by FC.
The last chart I’ll share covers I/O latency, and again there is negligible differences.
The Performance and Efficiency Gains with vSphere
Overall, our tests showed vSphere delivered comparable performance and greater protocol efficiency compared to ESX 3.5 for all test configurations. As for overall performance (as measured in IOPS), we found that vSphere delivered results comparable to ESX 3.5 in all the configurations; however, while the total performance was relatively flat we found the ESX CPU utilization was significantly reduced.
In our tests vSphere consumed approximately 6% to 23% less ESX host CPU resources when using either FC or NFS, depending on the load. With the iSCSI initiator, we found that vSphere consumed approximately 35% to 43% less ESX host CPU resources compared to ESX 3.5.
What’s not covered in the Report?
Obviously what is not covered in the current report is testing the performance of High Performance Datasets. These are IO intensive applications like databases, email servers, etc… These types of datasets typically don’t share datastores, don’t share physical spindles, and may write data at a block (or page) size greater than 4KB & 8KB size we see in shared datastores.
Larger block sizes combined with the lack of sharing LUN queues and spindles provide for some very interesting results, unfortunately we weren’t able to come to a consensus on the dataset and access profile to test in order to publish TR-3808 this month.
While TR-3808 doesn’t address this type of working set, there is a VMware report comparing FC, iSCSI, and NFS performance with a 16,000 seat Exchange 2007 configuration in vSphere.
Interestingly enough, every NetApp storage protocol outperformed the same test conducted earlier on an EMC Clariion array with Fibre Channel connectivity to the ESX hosts. I should point out that some of the performance gain in the NetApp config is a direct result of the enhancements in vSphere and would also apply to EMC.
Wrapping this Post Up
I share the opinion with many of our customers when I state that running VMware on a natively multiprotocol storage array provides the best means to scale a manage a virtual datacenter as every storage protocol provides a unique value relative to it’s use case. For some examples see the 80:20 rule or some of the storage related details in ‘VCE-101 Thin Provisioning Part 2 – Going Beyond’
The engineering teams and VMware and NetApp have ensured that customers will always receive the highest performance available. We are nearing the end of another set of tests that will provide additional data regarding performance with storage saving technologies such as data deduplication, thin provisioning, and our performance acceleration module (PAM). You’ll see it here as soon as it’s live (trust me this one is my pet project).
I think the take away for me is that Jumbo Frames for iSCSI isn’t necessarily an improvement. Although CPU overhead decreases the throughput does as well.
Looking at some of my shared VMFS volumes, I see read I/O sizes averaging between 15-33kB(spikes to well over 100kB), and write I/O sizes averaging 8-12kB(spikes to 20-30). Pretty low usage volumes too(aggregate IOPS averaging about 200 for these three volumes).
I don’t put any data intensive things on VMFS, that stuff either resides on RDM, or on NFS(which the guest VMs access themselves over their virtual NICs). Everything is fiber attached, and configured for boot from SAN.
I was just wondering where you got your 4-8k numbers from for shared VMFS volumes.
Myself I’d be most curious to see performance of 10GbE iSCSI offload cards like Chelsio.
My array is one giant aggregate(NetApp term though this isn’t a NetApp array), everything from DB, to VM, to NFS file stores, shares the same spindles.
Did you change the default queue depth settings on the FC HBAs? Just curious, with so many outstanding requests I think increasing the queue depth would benefit performance. I set each of my systems(mostly qlogic HBAs) to 64(default is 16 or 32 I forget). Our NAS cluster uses Emulex and it’s queue depth is set at around 1000.
Another interesting test would be to gauge performance running software iSCSI inside of the guest VM vs at the hypervisor level, at least for throughput/latency, of course you couldn’t do that for VMFS volumes. Running iSCSI inside the guest was pretty common back in ESX 3.x at least when the iSCSI software initiator on ESX was so horribly slow.
ok I see you guys set it to 128, buried at the end 🙂 great!
Don Mann says
This whitepaper is designed to simulate real-world environments. On pg32 – you see the details of the environment appearing that the 8 nodes in the test are configured as 4 clusters of 2 nodes each.
It also is not clear if the 20 VM’s per LUN are spread across the 2 nodes in the cluster.
I would suggest that real world environments would be more of a 4 or 8 node cluster environment (given you have 8 nodes), and to test protocol performance we should add VM’s to the datastore to test scalability. I wouldn’t mind if there were individual tests showing VM’s on single host, but the majority of the tests should be on 4-8 nodes clusters with DRS enabled – or manual distribution of the VM’s.
Cluster has datastore1 – 20 VM’s, 5VM’s in datastore1 per host.
The value of NFS for my customers has been larger datastores with more VM’s
Would anyone take 8 ESX nodes and split them into 4 clusters of 2 hosts each?
Vaughn Stewart says
All thanks for the feedback and dialog.
@Duncan – The actual throughput in MBs are pretty high (although the report only displays relative numbers). There may be some value in reducing CPU utilization from the perspective of the storage array, which serves the aggregated workload from all hosts. Saving a few points of CPU per ESX/ESXi may not be a big deal, but it may result in double digit CPU reductions on the array.
@Nate – regarding the block size choices of 4KB & 8KB. Most VMs are configured with the default block size of their file system. File systems greater than 2 GB in Windows are formatted with a 4KB block size. So while one can increase the block size is not the norm or reflected in our Autosupport data.
8KB is a common or the default page size used by a number of applications store data inside of files. Applications like this include SQL server, Oracle database server, etc…
@Don – You raise some valid points. The test bed does not exactly reflect a production environment in the areas of cluster configuration, how the datastores are shared and in terms of the VM to datastore density. With this said these sacrifices where made in order for the performance engineering teams at NetApp and VMware to take an academic approach to measuring the performance of the storage protocols. They felt that this was the correct way to accomplish this goal. When you consider their view, I believe one can see how this makes sense.
If it helps, I believe we are planning an update to this report to include FCoE. I will share your thoughts with the team to see if they are open to modifying their test plans. I’d also share that we do have some other work in process that I believe will address your concerns, so please be patient.
Hari Kannan says
2 questions –
1. From the charts, it seems that 10Gb doesnt seem to have any better throughput, latency or cpu utlization compared to a gigabit – is this correct summary? if so, why is it so or why would a customer pay the premium?
2. it was also interesting to note that the tests were run on 54xx series, not Nehalem – would be interested in getting finding if there would be any differences (from a protocol comparison perspective, not ESX 3.5 vs. vSphere) if the tests were run on Nehalem..
Vaughn Stewart says
@Hari – Data travels at the same speed over 1GbE and 10GbE provided the bandwidth has not reached capacity. These tests spreed the I/O over multiple links in the cluster and as such there’s plenty of bandwidth available.
10GbE becomes critical in a few areas. First, in reducing the total number of cables connected to an ESX/ESXi host. Next is when considering IO intensive applications, where w/o the use of 10GbE one has to add more storage network links to the hosts.
I would not expect an I/O performance gain from a Nehalem chipset; however, I would expect the CPU utilization to be reduced . The later has happened historically over the past decade.
is 1G and 10G no different if the hosts are native as opposed to going thru a hypervisor? I suspect the upperbound is set by esx/vsphere’s I/O engine which simply doesn’t scale.
Ken Schroeder says
Is there any data for the impact of using Jumbo Frames on the Filer as far as reduction of CPU Utilization?
Ajf 4 says
Wow, it is really great.I am very pleased that that i’m standing at an excellent blog of my life, i’m really glad to get my comment here in very decent topic. Thanks to you!
I’m an SE working in a project with a Netapp partner and we are to build a new project to consolidate 200 servers using a FAS box and new HP blades. The customer wishes to continue using iSCSI or NFS protocols and we are trying to figure out about the best way to leverage the embeded 10gE that new HP blades provide. are there any TRs of WPs that we could use as a reference?
Michael Blake says
I see you used 2 Intel 82575EB Ethernet Controllers for your iSCSI and NFS tests (one for storage and one for VM
I don’t suppose you enabled VMDq on those NICs as those specific chipsets only support 4 queues per port. Given that you
used 8 servers x 4 ports would give you 32 queues which would only be sufficient for your tests with 32 VMs.
Have you run any performance tests or could you make an educated guess at what effect enabling VMDq would have had on