vStorage APIs for Array Integration, commonly known as VAAI is a set of operations/primitives aimed at offloading certain host CPU and network intensive operations to the storage array.
Atomic Test Set (ATS)
The first primitive described is the Atomic Test and Set operation. This is sometimes also referred to as hardware locking. To understand this we need to understand how SCSI locking works. In the earlier SCSI standards (SPC-2) exclusive access to a device can be granted by a SCSI RESERVE command. If a host issues a SCSI RESERVE command to a device and if it suceeds, then the host can be assumed to be the owner of the device. All other hosts which need access to the device will need to wait till the first host issues a SCSI RELEASE command. During the time when a host has acquired an exclusive lock on a device, most commands issued from any other host will result in a reservation conflict. These commands include READ and WRITE.
While RESERVE and RELEASE commands together ensured data integrity, it did lead to performance problems among other issues. For example if two different hosts need to access/write to two different locations on disk, only one of the host (the lock holder) could complete it operation while the other had to wait till the lock on the devices was released.
The SCSI commands improved the reservation model which is commonly called SPC-3/SPC-4 Persistent Reservations. To briefly describe this model, the exclusive access was now extended to allow for shared locking. So it is now possible to allow for writes and reads from multiple hosts at the same time.
VMware ESXi does not use the persistent reservation model, but the Atomic Test and Set operations. In order to lock a device, essentially a SCSI COMPARE AND WRITE command is sent to the device. The COMPARE AND WRITE command specifies a location (LBA) on disk to write the data associated with the command (the write data). Usually this is the size of a logical block on disk. The command also contains another set of data which can be called the compare data. To execute COMPARE AND WRITE command, the device first reads the logical block(s) at the LBA specified (the read data), does a byte per byte comparsion of the read data with the compare data in the command. If the read data and compare data is identical, then the device writes the write data at the specified logical block. If the read data and the compare data are not identical, the host is notified of this failure.
Now a host can use the COMPARE AND WRITE command to acheive locking at the hardware level. A COMPARE AND WRITE success can be considered as a successful lock and a failure indicates that some other host has acquired the lock. This can be used for locking a the datastore level but even at a file system extent level. The performance benefits of such a locking mechanism over the RESERVE/RELEASE model are tremendously.
Prior to VAAI, when an ESXi host needed to fill locations on disk with zero data, it had to send the zero data as a WRITE command. For example to zero-out a 1MB location on data, a WRITE command is issued with the logical block address (LBA) and with zero data s the write data. Depending on the size of the location that had to be zeroed-out, one or more WRITE commands had to be sent. But what if the array/disk itself were to be instructed to perform this operation ?
The SCSI command set includes a command call WRITE SAME (to be correct WRITE SAME(16) issued). This command take two parameters among other the logical block location and the transfer length. Along with this command a host also sends a signe block of data. When this command is received by the device, based on the transfer length, the device calcuates the number of blocks for the write operation and starting from the logical block specified in the command, the devices will write the single block of data to the calculated number of blocks.
The benefits of block zeroing are quite obvious. To zero fill the same 1MB of data, it can be easily acheived by a single WRITE SAME(16) command with only 512 bytes of zero data sent to the device reducing the host CPU cycles and the network bandwidth utilization
Full Clone / XCOPY
Cloning and storage vmotion are quite common operations in a virtual environment. Before VAAI a clone involved reading data from the source VM datastore and writing that data to the target VM datastore. The source and the destination datastores for the clone operation in most cases may be within the same LUN or within the same array. So it make sense to instruct the array to do the copy operation itself.
With VAAI this is achieved by the SCSI Third Party Copy (EXTENDED COPY) command. With the EXTENDED COPY command an ESXi host instructs the array with the source logical block and the number of blocks to read and the destination logical block and the number of blocks to write. With 512 byte sectors, the number of blocks to read and write will be the same. As with the block zeroing primitive the host CPU cycles and the network bandwidth utilization are significantly reduced.
Unmap / Block Discard
With the increasing number of thin provisioned and SSD disks, the unmap primitive becomes significant. A thin provisioned disk only used physical location on disk when needed. That is when a write is received to a new location on disk, a thin provision disk allocates location on physical disk to satisfy the write request. This allows for over provisioning, i.e presenting disk with capacity greater than that is physically available. Due to the over-provisioning factor, such disks depend heavily on a host informing if a particular location on disk is used by a host or not.
To explain this further consider a file system which is created on a thin-provisioned disk. Not lets suppose you create a file and write about 1GB of data to this file. To satisfy the 1GB of write, the thin-provisioned disk will allocate about 1GB of physical locations on disk and write the file data to these locations. Now at a later point of time you delete this disk. Usually a filesystem will delete the inode information for the file and any block allocation associated with it from within its own metadata. However there is a problem here for the thin-provisioned disk. Since it has no knowledge that the 1GB of file data has been deleted, the physical resources associated with that data is still considered to be in use. The SCSI standard has a command called UNMAP (also possible through the WRITE SAME(16) command) for this purpose. Since the filesystem is aware that the 1GB of data is no longer in use, it can inform the thin provisioned disk by send a UNMAP command (or a series of UNMAP command) specifiying which location on disk is no longer in use. On receipt of this command a thin-provisioned disk can release (a.k.a discard) the physical resources associated with these locations.
With a thin-provisioned disk, overprovisioning is an issue. When there are no longer any physical disk resources available to satisfy a write request, it is considered an out of space condition. Remember that with a thin-provisioned disk one present a 1 TB LUN over say a physical lun of only 100 GB. So this can happen more frequently than one can imagine. When this happens a thin-provisioned disk can inform a ESXi host of the out-of-space condition. This is done at the SCSI protocol level where in a thin-provisioned disk returns a particular error code (In techincal terms as CHECK CONDITION with a particular ASC and ASCQ)
At this point the VM can be paused until more physical resources are added to the thin-provisioned disk. This pausing in VAAI terms is called TP-Stun
With ESXi 5.x the VAAI primitives were extended to NFS datastores. For VAAI over NFS vendor provided plugins are required. The three primitives with NFS datastores are
- File Cloning
- Space Reservation
- Extended Statistics
The file cloning primitive is similar to the XCopy/Full clone primitive. There is however a significant different. With block level cloning, ESXi has to instruct the array which blocks need to be cloned. From an array point of view it has no knowledge which blocks need to be copied to complete a VM clone. With NFS file cloning, since a VMDK is essential a file, a clone operation is a single operation. The other difference which is in favour of block-level cloning is that with storage vmotion a file clone can only performed with a VM powered off. With storage array based datastores, full cloning is possible even when a VM is powered on.
Space reservation is used for reserving space on the datastore similar to the eager zeroed or lazy zeroed options with storage array based datastores. However it should be noted that unlike the block zeroing primitive, when data needs to be zeroed, it still needs to be sent over the wire.
The extended statistics primitive is used to extract more information about the datastore such as how much of physical storage space is used up, which is use full in available capacity calculations
Benefits of VAAI
The performance benefits of VAAI are significant and cannot be ignored. VM cloning and storage vmotion operations use a fraction of the host CPU and network. Many storage vendors have introduced VAAI in the offerings. There are also opensource implementations which now have VAAI support