Over the weekend that followed the customer contacted me to say that VeeamOne was reporting issues on all four hosts.
Alarm: ESX(i) host storage warning
Time: 14/10/2017 09:50:11
Details: Fired by event: esx.problem.visorfs.ramdisk.full
Event description: The ramdisk 'root' is full. As a result, the file /scratch/log/vpxa.log could not be written.
I used this VMware KB article to trouble shoot.
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2001550
I used vdf -h to see the partition utilization and could see straight away that the Root partition was indeed full.
If the root partition becomes full the host can make vMotion fail, make the host slow and unresponsive or even worse PSOD!
Next I looked in the VMKernel log.
[root@esxi1:~] tail /var/log/vmkwarning.log
2017-10-14T08:51:20.082Z cpu31:65816)WARNING: VisorFSRam: 353: Cannot extend visorfs file /scratch/log/equallogic/ehcmd.log because its ramdisk (root) is full.
VMKernel was telling me that the EqualLogic MEM log couldn't be written, given the MEM module was the last thing to be updated I suspected this was the issue.
[root@esxi1:/scratch/log/equallogic] ls -lah
total 24400
d-wxrw--wt 1 root root 512 Oct 13 10:43 .
drwxr-xr-x 1 root root 512 Oct 14 08:42 ..
-rw-rw-rw- 1 root root 23.8M Oct 14 10:06 ehcmd.log
an 'ls' showed that the log was 24MBs in size, given the root RAMdisk is only 32MBs this is a problem.
I used WinSCP to copy the file off the ESXi host for further examination. To free up some space and get us out of the immediate issue I removed the log using the command below.
rm /scratch/log/equallogic/ehcmd.log
I then used vdf -h to check the free space which was back down to a normal value.
Sadly after 24hrs the problem was back!
There are two options here, first is to limit the EqualLogic log size from the default of 50MBs to something more like 15MBs, the second option is to move the /scratch area.
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033696
I created a folder on VMFS6-1 called .locker-esxi1 and used a SSH session to configure the path using the commands below, note the hosts need a reboot to action the changes:
vim-cmd hostsvc/advopt/update ScratchConfig.ConfiguredScratchLocation string /vmfs/volumes/VMFS6-1/.locker-esxi1
vim-cmd hostsvc/advopt/view ScratchConfig.ConfiguredScratchLocation
Fortunately the customer has VeeamOne, otherwise I'm not sure how this problem would have been picked up as by default vCenter doesn't alert on this.