A “Multivendor Post” to help our mutual iSCSI customers using VMware

Today’s post is one you don’t often find in the blogosphere, see today’s post is a collaborative effort initiated by Chad Sakac (EMC), which includes contributions from Andy Banta (VMware), Vaughn Stewart (NetApp), Eric Schott (Dell/EqualLogic), and Adam Carter (HP/Lefthand), David Black (EMC) and various other folks at each of the companies.
Together, our companies make up the large majority of the iSCSI market, all make great iSCSI targets, and we (as individuals and companies) all want our customers to have iSCSI success.

I have to say, I see this one often – customer struggling to get high throughput out of iSCSI targets on ESX. Sometimes they are OK with that, but often I hear this comment: "…My internal SAS controller can drive 4-5x the throughput of an iSCSI LUN…"

Can you get high throughput with iSCSI with GbE on ESX? The answer is YES. But there are some complications, and some configuration steps that are not immediately apparent. You need to understanding some iSCSI fundamentals, some Link Aggregation fundamentals, and know some ESX internals – none of which are immediately obvious…

We could start this conversation by playing a trump card; 10GbE, but we’ll save this topic for another discussion. Today 10GbE is relatively expensive per port and relatively rare, and the vast majority of iSCSI and NFS deployments are on GbE. 10GbE is supported by VMware today (see the VMware HCL), and all of the vendors here either have, or have announced 10GbE support.

10GbE can support the ideal number of cables from an ESX host – two. This reduction in port count can simplify configurations, reduce the need for link aggregation, provide ample bandwidth, and even unify FC using FCoE on the same fabric for customers with existing FC investments. We all expect to see rapid adoption of 10GbE as prices continue to drop. Chad has blogged on 10GbE and VMware.

This post is about trying to help people maximize iSCSI on GbE, so we’ll leave 10GbE for a followup.

If you are serious about iSCSI in your production environment, it’s valuable to do a bit of learning, and it’s important to do a little engineering during design.

iSCSI is easy to connect and begin using, but like many technologies which excel in terms of their simplicity the default options and parameters may not be robust enough to provide an iSCSI infrastructure which can support your business.

This post is going to start with sections called “Understanding” which will walk through protocol details and ESX Software Initiator internals. You can skip them if you want to jump to configuration options, but a bit of learning goes a long way into understanding the why of the hows.

Understanding your Ethernet Infrastructure

• Do you have a “bet the business” Ethernet infrastructure? Don’t think of iSCSI (or NFS datastores) use here as “it’s just on my LAN”, but “this is the storage infrastructure that is supporting my entire critical VMware infrastructure”. IP storage needs the same sort of design thinking applied to FC infrastructure. Here are some things to think about:

1. Are you separating you storage and network traffic on different ports? Could you use VLANs for this? Sure. But is that “bet the business” thinking? Do you want a temporarily busy LAN to swamp your storage (and vice-versa) for the sake of a few NICs and switch ports? If you’re using 10GbE, sure – but GbE?
   2. Think about Flow-Control (should be set to receive on switches and transmit on iSCSI targets)
   3. Enable spanning tree protocol with either RSTP or portfast enabled
   4. Filter / restrict bridge protocol data units on storage network ports
   5. If you want to squeeze out the last bit, configure jumbo frames (always end-to-end – otherwise you will get fragmented gobbledygook)
   6. Use Cat6 cables rather than Cat5/5e. Yes, Cat5e can work – but remember – this is “bet the business”, right? Are you sure you don’t want to buy that $10 cable?

7. You’ll see later that things like cross-stack Etherchannel trunking can be handy in some configurations.

8. Each Ethernet switch also varies in its internal architecture – for mission-critical, network intensive Ethernet purposes (like VMware datastores on iSCSI or NFS), amount of port buffers, and other internals matter – it’s a good idea to know what you are using.

• How many workloads (guests) are you running? Both individually and in aggregate are they typically random, or streaming? Random I/O workloads put very little throughput stress on the SAN network. Conversely, sequential, large block I/O workloads place a heavier load. Likewise, be careful running single stream I/O tests if your environment is multi-stream / multi-server. These types of tests are so abstract they provide zero data relative to the shared infrastructure that you are building.

• In general, don’t view “a single big LUN” as a good test – all arrays have internal threads handling I/Os, and so does the ESX host itself (for VMFS and for NFS datastores). In general, in aggregate, more threads are better than fewer. You increase threading on the host with more operations against that single LUN (or file system), and every vendor’s internals are slightly different, but in general, more internal array objects are better than fewer – as there are more threads.

• Not an “Ethernet” thing, but while we’re talking on the subject of performance generally and not skimping, there’s no magic on the brown spinny things – you need enough array spindles to support the IO workload – often not enough drives in total, or an under-configured specific sub/group of drives – every vendor does this differently (aggregates/RAID groups/pools), but all have some sort of “disk grouping” out of which LUNs (and file systems in some cases) get their collective IOPs.

Understanding: iSCSI Fundamentals

We need to begin with a prerequisite nomenclature to establish a start point. If you really want the “secret decoder ring” then start here.

This diagram is chicken scratch, but it gets the point across. The red numbers are explained below.

1. iSCSI initiator = an iSCSI client, and serves the same purpose as an HBA, sending SCSI commands, and encapsulating in IP packets. This can operate in the hypervisor (example in this case this would be the ESX software initiator or hardware initiator) and/or in the guests (example – the Microsoft iSCSI initiator).
   2. iSCSI target = an iSCSI server, usually on an array of some type. Arrays vary in how they implement this. Some have one (the array itself), some have many, some map them to physical interfaces, some make each LUN an iSCSI target.
   3. iSCSI initiator port = the end-point of an iSCSI session, and is not a TCP port. After all the handshaking, the iSCSI initiator device creates and maintains a list of iSCSI initiator ports. Think of the iSCSI initiator port as the “on ramp” for data.
   4. iSCSI network portal = an IP address or grouping of IP addresses used by iSCSI initiator or target (in which case it’s IP address and TCP port). There can be groupings of network portals into.. portal groups (see Multiple Connections per Session)
   5. iSCSI Connection = a TCP Connection, and carries control info, SCSI commands and data being read or written.

6. iSCSI Session = one or more TCP connections that form an initiator-target session

7. Multiple Connections per Session (MC/S) = iSCSI can have multiple connections within a single session (see above).

8. MPIO = Multipathing, and used very generally as a term – but exists ABOVE the whole iSCSI layer (which in turn is on top of the network layer) in the hypervisor and/or in the guests. As an example, when you configure the ESX storage multipathing, that’s MPIO. MPIO is defacto load-balancing and availability model for iSCSI

Understanding: Link Aggregation Fundamentals

The next thing as a core bit of technology to understand is Link Aggregation. The group spent a fair amount of time going around on this as we were writing this post. Many people jump to this as a way as and “obvious” mechanism to provide greater aggregate bandwidth than a single GbE link can provide.

The core thing to understand (and the bulk of our conversation – thank you Eric and David) is that 802.3ad/LACP aggregates physical links, but the mechanisms used to determine the whether a given flow of information follows one link or another are critical.

Personally, I found this doc very clarifying. You’ll note several key things in this doc:

• All frames associated with a given “conversation” are transmitted on the same link to prevent mis-ordering of frames. So what is a “conversation”? A “conversation” is the TCP connection.

   • The link selection for a conversation is usually done by doing a hash on the MAC addresses or IP address.

   • There is a mechanism to “move a conversation” from one link to another (for loadbalancing), but the conversation stops on the first link before moving to the second.

   • Link Aggregation achieves high utilization across multiple links when carrying multiple conversations, and is less efficient with a small number of conversations (and has no improved bandwith with just one). While Link Aggregation is good, it’s not as efficient as a single faster link.

It’s notable that Link Aggregation and MPIO are very different. Link Aggregation applies between two network devices only. Link aggregation can load balance efficiently – but is not particularly efficient or predictable when there are a low number of TCP connections.

Conversely MPIO applies on an end-to-end iSCSI session – a full path from the initiator to the target. It can be efficient in loadbalancing with a low number of TCP sessions. While Link Aggregation can be applied to iSCSI (as will be discussed below), MPIO is generally the design point for iSCSI multipathing.

Understanding: iSCSI implementation in ESX 3.x

The key to understanding the issue is that the ESX 3.x software initiator only supports a single iSCSI session with a single TCP connection for each iSCSI target.

Making this visual… in the diagram above, while in iSCSI generally you can have multiple “purple pipes” each with one or more “orange pipes” to any iSCSI target, and use MPIO with multiple active paths to drive I/O down both paths.

You can also have multiple “orange pipes” (the iSCSI connections) in each “purple pipe” (single iSCSI session) – Multiple Connections per Session (which effectively multipaths below the MPIO stack), shown in the diagram below.

But in the ESX software iSCSI intiator case, you can only have one orange “pipe” for each purple pipe for every target (green boxes marked 2), and only one “purple pipe” for every iSCSI target. The end of the “purple pipe” is the iSCSI intiator port – and these are the “on ramps” for MPIO

So, no matter what MPIO setup you have in ESX, it doesn’t matter how many paths show up in the storage multipathing GUI for multipathing to a single iSCSI Target, because there’s only one iSCSI initiator port, only one TCP port per iSCSI target. The alternate path to the gets established after the primary active path is unreachable. This is shown in the diagram below.

VMware can’t be accused of being unclear about this. Directly in the iSCSI SAN Configuration Guide: “ESX Server‐based iSCSI initiators establish only one connection to each target. This means storage systems with a single target containing multiple LUNs have all LUN traffic on that one connection”, but in general, in my experience, this is relatively unknown.

This usually means that customers find that for a single iSCSI target (and however many LUNs that may be behind that target – 1 or more), they can’t drive more than 120-160MBps.

This shouldn’t make anyone conclude that iSCSI is not a good choice or that 160MBps is a show-stopper. For perspective I was with a VERY big customer recently (more than 4000 VMs on Thursday and Friday two weeks ago) and their comment was that for their case (admittedly light I/O use from each VM) this was working well. Requirements differ for every customer.

Now, this behavior will be changing in the next major VMware release. Among other improvements, the iSCSI initiator will be able to use multiple iSCSI sessions (hence multiple TCP connections). Looking at our diagram, this corresponds with “multiple purple pipes”for a single target. It won’t support MC/S or “multiple orange pipes per each purple pipe” – but in general this is not a big deal (large scale use of MC/S has shown a marginal higher efficiency than MPIO at very high end 10GbE configurations) .

Multiple iSCSI sessions will mean multiple “on-ramps” for MPIO (and multiple “conversations” for Link Aggregation). The next version also brings core multipathing improvements in the vStorage initiative (improving all block storage): NMP round robin, ALUA support, and EMC PowerPath for VMware – which in the spirit of this post, EMC is making as heterogeneous as we can.

Together – multiple iSCSI sessions per iSCSI target and improved multipathing means aggregate throughput for a single iSCSI target above that 160MBps mark in the next VMware release, as people are playing with now.

Obviously we’ll do a follow up post.

(Strongly) Recommended Reading

1. I would recommend reading a series of posts that the inimitable Scott Lowe has done on ESX networking, and start at his recap.

2. Also – I would recommend reading the vendor documentation on this carefully.

VMware:
VMware: iSCSI SAN Configuration Guide

EMC:

Celerra VMware ESX Server Using EMC Celerra Storage Systems – Solutions Guide

CLARiiON: VMware ESX Server Using EMC CLARiiON Storage Systems – Solutions Guide

DMX: VMware ESX Server Using EMC Symmetrix Storage Systems – Solutions Guide

NetApp:
NetApp & VMware Virtual Infrastructure 3 : Storage Best Practices
(according to Vaughn this is the most popular NetApp TR)

HP/LeftHand:
LeftHand Networks VI3 field guide for SAN/iQ 8 SANs

Dell/EqualLogic:
Network Performance Guidelines
VMware Virtual Infrastructure 3.x Considerations, Configuration and Operation Using an Equallogic PS Series SAN

How do you get high iSCSI throughput in ESX 3.x?

As discussed earlier, the ESX 3.x software initiator really only works on a single TCP connection for each target – so all traffic to a single iSCSI Target will use a single logical interface. Without extra design measures, it does limit the amount of IO available to each iSCSI target to roughly 120 – 160 MBs of read and write access.

This design does not limit the total amount of I/O bandwidth available to an ESX host configured with multiple GbE links for iSCSI traffic (or more generally VMKernel traffic) connecting to multiple datastores across multiple iSCSI targets, but does for a single iSCSI target without taking extra steps.

Here are the questions that customers usually ask themselves

Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?

Question 2: If I have a single LUN that needs really high bandwidth – more than 160MBps and I can’t wait for the next major ESX version, how do I do that?

Question 3: Do I use the Software Initiator or the Hardware Initiator?

Question 4: Do I use Link Aggregation and if so, how?

Here are the answers you seek

Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?

Answer 1: Keep it simple. Use the ESX iSCSI software initiator. Use multiple iSCSI targets. Use MPIO at the ESX layer. Add Ethernet links and iSCSI targets to increase overall throughput. Expect no more than ~160MBps for a single iSCSI target.

Remember an iSCSI session is from initiator to target. If use multiple iSCSI targets, with multiple IP addresses, you will use all the available links in aggregate, the storage traffic in total will load balance relatively well. But any individual one target will be limited to a maximum of single GbE connection’s worth of bandwidth.

Also remember that this also applies to all the LUNs behind that target. So, consider that as you distribute the LUNs appropriately among those targets.

The ESX initiator uses the same core method to get a list of targets from any iSCSI array (static configuration or dynamic discovery using the iSCSI SendTargets request) and then a list of LUNs behind that target (SCSI REPORT LUNS command).

So, to place your LUNs appropriately to balance the workload:

• On an EMC CLARiiON, each physical interface is seen by an ESX host as a separate target, so balance the LUNs behind your multiple iSCSI targets (physical ports).

• On a Dell/EqualLogic array, since every LUN is a target, balancing is automatic and you don’t have to do this.

• On an HP/LeftHand array, since every LUN is a target, balancing is automatic and you don’t have to do this.

• On a NetApp array each interface is a seen by an ESX host as a separate target, so balance your LUNs behind the targets.

• On an EMC Celerra array, you configure iSCSI targets and assign them to any virtual or physical network interface, then balance the LUNs behind the targets.

Select your active paths in the VMware ESX multi-pathing dialog to balance the I/O across the paths to your targets and LUNs. Also it can take up to 60 seconds for the standby path to become active as the session needs to be established and the MPIO failover needs to occur, as noted in VMware iSCSI configuration guide. There are some good tips there (and in the Vendor best practice docs) about extending guest timeouts to withstand the delay without a fatal I/O error in the guest.

Question 2: If I have a single LUN that needs really high bandwidth – more than 160MBps and I can’t wait for the next major ESX version, how do I do that?

Answer 2: Use an iSCSI software initiator in the guest along with either MPIO or MC/S

This model allows the Guest Operating Systems to be “directly” on the SAN and to manage their own LUNs. Assign multiple vNICs to the VM, and map those to different pNICs. Many of the software initiators in this space are very robust (like the Microsoft iSCSI initiator). They provide their guest-based multipathing and load-balancing via MPIO (or MC/S) based on the number of NICs allocated to the VM.

As we worked on this post, all the vendors involved agreed – we’re surprised that it isn’t more popular. People have been doing it for a long time, and it works, even through VMotion operations where some packets are lost (TCP retransmits them – iSCSI is ok with occasional loss, but constant losses slow TCP down – something to look at if you’re seeing poor iSCSI throughput).

It has a big downside, though – you need to manually configure the storage inside each guest, which doesn’t scale particularly well from a configuration standpoint – so for most customers they stick with the “keep it simple” method in Answer 1, and selectively use this for LUNs needing high throughput.

There are other bonuses too:

• This also allows host SAN tools to operate seamlessly – on both physical or virtual environments – integration with databases, email systems, backup systems, etc.

• I suppose "philosophically" there’s something a little dirty of "penetrating the virtualizing abstraction layer", and yeah – I get why that philosophy exists. But hey, we’re not really philosophers, right? We’re IT professionals, and this works well 🙂

• Also has the ability to use a different vSwitch and physical network ports than VMkernel allowing for more iSCSI load distribution and separation of VM data traffic from VM boot traffic.

• It is notable that this option means that SRM is not supported (which depends on LUNs presented to ESX, not to guests)

• Also has a couple handy side-effects:

• Dynamic and automated LUN (i.e. you don’t need to do something in Virtual Center for the guest to use the storage) surfacing to the VM itself (useful in certain database test/dev use cases)

• You can use it for VMs that require a SCSI-3 device (think Windows 2008 cluster quorum disks – though those are not officially supported by VMware even as of VI3.5 update 3)

Question 3: Do I use the Software Initiator or the Hardware Initiator?

Answer 3: In general, use the Software Initiator except where iSCSI boot is specifically required.

This method bypasses the ESX software initiator entirely. Like the ESX software initiator, hardware iSCSI initiators uses the ESX MPIO storage stack for multipathing – but doesn’t have the single connection per target limit.

But, since you still have all the normal caveats with static load balancing and using the ESX NMP software (active/passive model, with static, manual loadbalancing), this won’t increase the throughput for a single iSCSI target.

In general, across all the contributors from each company, our personal preference is to use the software initiator. Why? In general it’s simple, and since it’s used very widely, very tested, very robust. It also has a clear 10GbE support path.

Question 4: Do I use Link Aggregation and if so, how?

Answer 4: There are some reasons to use Link Aggregation, but increasing a throughput to a single iSCSI target isn’t one of them in ESX 3.x.

What about Link Aggregation – shouldn’t that resolve the issue of not being able to drive more than a single GbE for each iSCSI target? In a word – NO. A TCP connection will have the same IP addresses and MAC addresses for the duration of the connection, and therefore the same hash result. This means that regardless of your link aggregation setup, in ESX 3.x, the network traffic from an ESX host for a single iSCSI target will always follow a single link.

So, why discuss it here? While this post focuses on iSCSI, in some cases, customers are using both NFS and iSCSI datastores. In the NFS datastore case, MPIO mechanisms are not an option, load-balancing and HA is all about Link Aggregation. So in that case, the iSCSI solution needs to work in with concurrently existing Link Aggregation.

Now, Link Aggregation can be used completely as an alternative to MPIO from the iSCSI initiator to the target. That said, it is notably more complex than the MPIO mechanism, requiring more configuration, and isn’t better in any material way.

If you’ve configured Link Aggregation to support NFS datastores, it’s easier to leave the existing Link Aggregation from the ESX host to the switch, and then simply layer on top many iSCSI targets and MPIO (i.e. “just do answer 1 on top of the Link Aggregation”).

To keep this post concise and focused on iSCSI, the multi-vendor team here decided to cut out some of NFS/iSCSI hybrid use case and configuration details, and leave that to a subsequent EMC Celerra/NetApp FAS post.

In closing

I would suggest that anyone considering iSCSI with VMware should feel confident that their deployments can provide high performance and high availability. You would be joining many, many customer enjoying the benefits of VMware and advanced storage that leverages Ethernet.

To make your deployment a success, understand the “one link max per iSCSI target” ESX 3.x iSCSI initiator behavior. Set your expectations accordingly, and if you have to, use the guest iSCSI initiator method for LUNs needing higher bandwidth than a single link can provide.

Most of all ensure that you follow the best practices of your storage vendor and VMware.

6 thoughts on “A “Multivendor Post” to help our mutual iSCSI customers using VMware”

Chad Sakac says:

January 27, 2009 at 7:00 am

Vaughn – thanks very much for being open willing and able to make strong contributions…. I hope we can make this the first of many.
Cheers – to my most respected competitor, and good friend – looking forward to meeting you on the battlefield!

Val Bercovici says:

January 27, 2009 at 7:35 pm

Agreed. This is very cool and as I commented on Chad’s blog is the kind of collaboration and transparency I hope to see more of in the storage blogosphere.

Chuck Boyd says:

January 28, 2009 at 8:39 am

Vaughn – this is a fabulous post !! Thx so much to you, to Chad and the others for sharing this gold mine of info across vendors – hope to see more of this kind of thing where it’s helpful – Kudos to all !!
Chuck Boyd
Storage Solutions Architect, NY/NJ
CDW-Berbee

Kussai says:

January 28, 2009 at 9:13 am

Is there a similar write up about the same topics in general without being specific to VMWare? If you can provide URL that would be great. Thanks.

Andrew Miller says:

February 18, 2009 at 7:38 pm

Fantastic reference post — good review for me plus quite a few things I hadn’t thought through yet (as just hadn’t run across).
I’m definitely looking forward to the “subsequent EMC Celerra/NetApp FAS post” as I’m helping more and more people who are going NFS only (or NFS primary at least with iSCSI just for light RDM usage).

Chad SAkac says:

February 21, 2009 at 7:13 am

Andrew – thanks… I’m heads down this next week at VMworld Europe, but Vaughn and I are pumped to do an NFS reprise. and have been talking about it…