Currently I am working for one of our clients which offers a Self-Service VRA portal and is active as a Service Provider. Most of the active tenants are just deploying the regular Windows OS based Virtual Machines, while other tenants deploy the Linux OS based Virtual Machines. This last group of tenants are more into also deploying Docker/Kubernetes containers/applications. It is not always visible upfront what a specific tenant is deploying or using inside his Virtual Machines. While this client uses Veeam as his backup solution and if you will read on you will come to notice that, whilst not expected, sometimes some tenants Virtual Machines (configurations) are failing for some weird reason. This is how it went...
While looking through the status of our tenant's backup jobs, I saw indeed a weird message all of a sudden. After a couple of months passed with no errors and warnings at all, the following did show up. Some of the Virtual Machines came back with a "failed" status inside the jobs.
To be specific, it is the following error message:
10/10/2017 9:47:27 AM :: Failed to create VM snapshot. Error: CreateSnapshot failed, vmRef vm-27395, timeout 1800000, snName VEEAM BACKUP TEMPORARY SNAPSHOT, snDescription Please do not delete this snapshot. It is being used by Veeam Backup., memory False, quiesce True
10/10/2017 9:47:40 AM :: Error: An error occurred while saving the snapshot: Failed to quiesce the virtual machine.
To me, a first look at this error message would indicate, that there is something wrong with the VMware Tools running inside the guest OS of the Virtual Machine. Further investigation of that fact showed me the following:
After contacting the tenant to ask if they were seeing anything weird on the Virtual Machine and even after a reboot of the Virtual Machine, the same thing happened after running the Veeam job again.
Most of the times, two things could be wrong:
1.) There could be a problem with VMware Tools.
Normally a reboot or an upgrade of the VMware Tools could do the trick.
2.) The tenant is running Docker/Kubernetes on the Virtual Machines.
In that particular case, there might be a problem with the FIFREEZE/FITHAW function within the guest OS of the Virtual Machine (CentOS).
These functions are being used by the VMware Tools quiescence method wich are necessary for the Veeam backup (snapshot). In most cases there is a guest OS used which is not Docker/Kubernetes aware, and by that fact not supported. Probably a kernel wich uses a version of 2.6.32-24 and lower.
Contact with the tenant showed that this was indeed the fact.
So, how can we solve this problem, so that the tenant has a reliable backup again?
1.) Upgrade the kernel to 2.6.35-22 or higher.
That could take a while and in my opinion, quite intrusive from a tenants perspective.
2.) Open the tools.conf file (/etc/vmware-tools directory) with a text editor. (If this file does not exist, just create one).
Append the following lines:
enableSyncDriver = false
A quick look at the status of the job showed me that this was indeed the quickest and most reliable fix for it. The job ended successfully.