VDR Backup Failures

I’ve spent the last month dealing with VDR backup failures. Sometime around vSphere 4.1 VDR stated failing on my domain controllers, then my SQL servers, then all windows servers in general. The linux servers and any other VM’s backing up without application quiescing never skipped a beat. Here’s the dreaded error message from the VDR appliance:

Failed to create snapshot for specops-cmd, error -3960 ( cannot quiesce virtual machine)

or from the vCenter server itself:

Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.

DAMNED LIES! That wasn’t the real problem.

After a whole lot of testing, about 12 emails, 2 webex sessions, and 5 phone-calls to support, I narrowed the problem down to the following multiple causes. Any of the following will cause this error message on a Server 2008 R2 VM backing up with the VMWare Tools VSS driver.

  1. Independent Disks
  2. iSCSI Connections with MS iSCSI
  3. Running an older version of vmware tools — note that vCenter itself will tell you the tools are ‘OK’ unless you’re a major build behind. My systems were running b257589, and most of the VDR issues were resolved when I did an in-place upgrade to b299420.
  4. “Missing” drives in Disk Management. I had one system that kept adding ‘missing drives’. The key was that on boot, I’d get the message “VMWare Customization in Progress”. Apparently it never cleared sysprep out of the boot list. See the fix for that at my wiki here: vSphere on JP Wiki. Search for ‘customization’.
  5. There is a problem with the way VMWare Tools calls for a VSS Snapshot on systems running Active Directory (the NTDS writer). The writer will show “non-retryable error” after a backup attempt. This is a known issue slated to be fixed in the next release of VMWare tools. For now, just create a text file in “C:\programdata\vmware\vmware tools” called vmbackup.conf with 1 line: “NTDS” without the quotes. This disables the NTDS writer, but it’s better than Disk.EnableUUID=false because the rest of your system will still be application-quiesced.

I hope this helps someone avoids the work of systematically ruling out the other 20+ suspected causes :).