Welcome to the first session in a series of blog posts entitled ‘Storage 101 – Virtualization Changes Everything’
Data deduplication was originally introduced in Data ONTAP 7.2 in July of 2006 as a technology that enhanced the NetApp Disk-to-Disk (D2D) backup offerings. Coincidentally VMware released VI3 around the same time in June of 2006, and shortly thereafter customers began using these two technologies together to virtualize their servers and storage footprint.
For today’s post we’re going to dive into the realm of storage efficiency and discuss data deduplication and it’s impact in production storage capacity and storage array cache specifically within a virtual data center.
The Format of this Series
As I stated when I decided to kick off these posts I want to share these topics with a wide audience and in order to do so each session will begin with a high level overview of the technology and its value followed by the technical details.
It’s no secret that one of the top two challenges with server virtualization is the cost associated with storage. The act of virtualizing and consolidating servers from direct attached storage onto shared storage arrays results in a dramatic increase in the capacity requirements of a much more expensive form of storage media.
Eliminating this hurdle will allow customers to virtualize more of their infrastructure.
By enabling data deduplication, customers typically reduce their production storage requirements for VMware by 50% – 70%. These cost savings can literally finance further virtualization projects such as implementing DR with Site Recovery Manager or virtualizing a desktop environment.
Data deduplication has a pervasive effect within an infrastructure. By reducing the production footprint less storage is required for backup, DR, and archival purposes. In addition dedupe provides WAN acceleration when replicating a data set as any block already stored on the destination won’t be resent just because a second VM contains the same blocks.
NetApp has extended this technology to also include the storage array cache. With Intelligent Caching deduplicated data sets actually outperform non-deduplicated data sets as the cache in the storage controller is effectively increased exponentially. There’s more on this in the technical details section below.
Data deduplication and Intelligent Caching are the cornerstones of NetApp’s storage efficiency technologies. Together they will reduce your storage costs, bandwidth requirements for replication, DR and backup archive media requirements while increasing array performance. Please don’t take my word for it, Google it! (I’d suggest skipping any vendor site and read blog posts and reader comments). Alternatively take us up on our 50% storage savings guarantee. If we can’t deliver the savings, we’ll provide the additional storage you require at no cost.
Let’s Get into the Technical Details
NetApp storage arrays and enterprise class arrays virtualized by NetApp vSeries controllers address storage quite a bit differently than traditional (or legacy) storage array architectures. The difference is NetApp abstracts the data being served from the physical storage or disk drives.
This abstraction begins by grouping the physical disks (and their available IOPs) into aggregates, which are comprised of one or more raid groups. An aggregate is divided into one or more Flexible Volumes. The FlexVol is the layer where WAFL is implemented, and it is WAFL that enables pointer-based functionality such as Snapshots; Replication; LUN, volume, and file Clones; and others including dedupe.
I believe the best place to start is by briefly explaining a NetApp Snapshot
With all storage arrays data is stored in blocks; however, with WAFL the blocks layer is abstracted from the physical block, thus allowing ONTAP to manipulate the access and presentation of a data object. Creating a snapshot is one of these manipulations. A snapshot preserves the state of data at a point in time by creating an inode map that is identical to the production inode table and by locking the blocks associated with the snapshot.
When the production data is modified it does not affect the snapshot. New data is written in free space, deleted data will remain as it is part of the snapshot, and the production inode table map is updated.
This concept is very similar to the database concept of a view.
The storage array stores two versions of the data, the current and previous state, but does not duplicate any of the data that is in common between these two versions.
Dedupe is Similar to Snapshots in Reverse
Take a typical VMware environment where you have consolidated a number of VMs into a datastore. Any 4kb block of data that is stored in more than one location, within a single VM or across all of the VMs in the datastore, can be reduced to a single 4kb instance.
In a manner which is vary analogous to the Snapshot model dedupe allows multiple VMs to share the underlying storage blocks and only consume storage required by their unique storage blocks.
Some examples of where redundant data is within your VMware environment:
Redundancy within a VM:
• In windows dlls are stored in multiple locations including their operational location, the windows\system32\dllcache folder, any hotfix or service pack uninstall folder, and redundantly within various application folders.
Redundancy within a datastore:
• There are multiple VMs running the same OS, hotfixes, services packs, and enterprise management applications such as antivirus and SNMP monitoring tools.
• There can be multiple instances of the same application being run on multiple VMs.
• It is common to have different applications, running on different VMs, which were created with the same program language, like C++, which will call the same dlls in their operation.
• Within every VMDK there is free (allocated but unused) space, all of these 4kb NTFS and EXT3 free blocks are identical.
• Every VM contains data that has been deleted, but still resides in the GOS file system and VMDK (this is where undelete tools operate).