Storage Architectures with LDVM: Faster, Less ComPlex

1
18

While attending the VMware TechSummit this week, someone asked why I thought storage was relevant in light of server virtualization, the virtual datacenter, cloud deployments, the software mainframe, etc…

OK – This individual didn’t actual say software mainframe but I though I’d inject it into their synonym roll call. Mmmm… Synonym Rolls. Sounds yummy! (OK – running on 4 hours of sleep may be impacting my sense of humor).

In response to this question I replied, “There is one thing I can guarantee, which is by the time I finish this statement your company will have generated and stored more data than when I began.”

Consider the similarities with, and the diametrically opposed aspects of, the adoption rate of server virtualization and the data growth rate in your data center.

  • Server virtualization separates the compute layer from the physical layer resulting in a means to enable application mobility
  • Data growth is rampant, commonly growing at 50-100% year over year. A single copy is a minimum, redundant copies aren’t free, and replication requires more bandwidth each year. These force work to keep data sedentary in nature

So why do I, like most of the engineers at NetApp, find the storage aspect of data center virtualization interesting? Simple, it’s because we are developing ways to make application mobility a reality by addressing the shackles restricting data to an individual datacenter.

Case in Point: Long Distance VMotion

I trust you familiar with VMotion, Storage VMotion, and Long Distance VMotion. To level set LDVM is the ability to complete either type of migration between two geographically separated data centers. There are numerous reasons why one would want to migrate a running VM from one data center to another including but not limited to:

  • Data center maintenance without downtime
  • Disaster avoidance
  • Data center migration, consolidation, or expansion
  • Workload balancing

All of these use cases present the following challenges for most companies:

  1. Time required to copy data
  2. Cost of bandwidth to replicate data
  3. Cost to store multiple copies of data

Time

The largest challenge with migrate a running VM from one datacenter to another is the time it takes to complete the data copy portion of the migration. Quite frankly the common VM is too large to be sent over the network in an appropriate amount of time for use with LDVM. VMs tend to range from tens to hundreds of GBs in size and items this large can’t traverse most networks in the blink of an eye.

Bandwidth

One method to solve the time issue is to increase the amount of bandwidth between two datacenters. While this solution is an easy means to addresses the time issue it introduces a new challenge; escalated OpEx. From what I understand, high-speed bandwidth isn’t exactly inexpensive.

Storage

One can eliminate some of the bandwidth requirement if one implements a storage plex, aka mirrored data set. In this architecture data resides in both the production and the remote datacenter simultaneously. The goal of such an architecture is to eliminate the ‘on demand’ requirement to copy tens to hundreds of GBs with a LDVM migration.

With this design on need only migrate the data that has yet to be copied between both locations; however, there is an ongoing requirement to send every write between both data centers. This design typically requires additional storage hardware to accomplish this task.

Seems like this model replaces an OpEx issue with a CapEx issue due to the increase in storage footprint.

To recap we can solve LDVM time related issues with increased bandwidth, to which we can solve the increased bandwidth issue by increasing the storage footprint. Sound to me like one is robbing Peter to pay Paul? If this example sounds this way to you, then welcome to my world!

Long Distance VMotion as designed by Cisco NetApp and VMware

The engineering teams at Cisco, NetApp, and VMware committed to tackling the issues limiting the adoption of LDVM and I’d like to share with you what some of we have developed. Our joint solution supports long distance VMotion for data centers up to 400KMs apart (or roughly ~250 miles).

The enabling technologies in our solution work so well, it appears they were designed for this purpose. Through the use of Cisco’s DCI & OTV customers are able to deploy a layer two network that spans multiple geographically dispersed data centers. In this manner one can allow for the non-disruptive migration of virtual machines via VMotion without the need to correct network address changes in the guest, application, names services, clients, etc.

Putting this all together is a NetApp technology native to the FAS array known as FlexCache. FlexCache allows for the immediate access of VMs in data center A from data center B and vice versa (as t can be used bi-directionally between datasets on two arrays).

When a LDVM migration begins only the data required to operate the VM in it running state is transferred from the source data center. Following the required data the remaining data will be ‘trickled’ to the remote array based on network bandwidth availability.

Think of this model as a ‘storage migration on demand’ where the virtual disk is disassembled, the hot data is transferred. Writes from the migrated VM are stored locally and as the remaining bits of are received from the source the virtual disk file is ‘reassembled’ by the array.

I use the word reassembled here figuratively and not literally. With our pointer-based technologies like snapshot backups, file clones, data deduplication, and now caching, NetApp can provide individual data access to multiple objects based on a shared common source and globally unique 4KB blocks of data.

This solution gets more elegant when more than a single VM is migrated, as NetApp storage platforms only replicates the 4KB blocks which are unique within a datastore even if this data is redundant within one and across many VMs. This design significantly reduces bandwidth requirements.

Speaking of data deduplication, did I ever mention we do that really well for VMFS, NFS, & RDM footprints? I only bring up dedupe in this post as the ability to reduce one’s footprint by 60-70% (as we see with most virtualized servers), actually can make deploying storage at multiple sites affordable.

Sound Too Good To Be True?

This architecture was tested as a part of the Cisco Validated Design process, and I’d like to share with you one of the tests completed as it is highlighted in our Application Mobility whitepaper. The goal of this particular test was to measure the completion time for the Long Distance VMotion migration to complete and the application performance in terms of operations per minute (OPMs).

Microsoft SQL Server Test Bed:

  • Reinitialize the Microsoft SQL Server by rebooting the VMware ESX server on which it is resident and the target VMware ESX server to reset the statistics data
  • Start the Dell DVD Store client on a virtual machine that has IP connectivity to both VMware ESX servers
  • Run the Dell DVD client and wait for 30 minutes for the client to attain a steady state; note the operations per minute (OPM) on that VMware ESX server
  • Migrate the system to the corresponding target
  • Wait 30 minutes for the client to attain steady state; note the OPMs for that VMware ESX server
  • Perform 18 more migrations with a 10-minute wait between each migration
  • Collect test statistics to evaluate the total elapsed time

I hope you’ll be surprised at the results… All migrations completed successfully, without disruption to the application or clients. At our maximum supported distance of 400 KM, or the worst case scenario if you will, the systems tested incurred a 2.5 minute migration time and a performance degradation of a mere 3%.

Not too shabby, for not having all of the data located at the remote location at the time of migration.

One Detail I Should Call Out

I didn’t state that this solution only supports VMs on NFS datastores today. The reason is our FlexCache technology addresses file level objects (aka VMs). As VMotion is designed to migrate VMs and not datastores, then we feel FlexCache is an ideal fit for LDVM.

As I’ve shared many times in the past, the primary difference between VMFS and NFS is that NFS allows for any array based storage virtualization to be directly accessed by the hypervisor, and this is exactly the case with this solution.

Wrapping It Up

The ability to migrate live VMs across data centers is becoming a reality. VMware has decoupled the application form the server, Cisco has enabled a single network to span multiple facilities, and NetApp is reducing the cost to store and replicate the data.

I hope I have shared one of many examples as to why I’m so nuts about integrating and advancing storage with VMware and VMware based solutions. Trust me when I say this architecture is only the beginning. We are hard at work developing the next generation of our Cisco Validated, Long Distance VMotion architecture. As we progress, I’ll be sure to share the news with you.

If you’re interested in learning more, please check out our Application Mobility whitepaper.

I’m hungry, where’s the Synonym Rolls?


1 Comment

  1. Dear VS Guy
    This week the Reagan Conference Center in D.C. hosts ‘Meritalk 1100’. Known previously as ‘Meritalk 932’, the count enumerates the number of US Govt data centers. Or a best guess.
    Preventing further proliferation, and maybe reverting back towards 932 or less, is of urgent national importance.
    Can LDVM from NetApp/Cisco/VMware enable this today?.

Leave a Reply