Cannot start VM with vGPU – insufficient amount of graphic resources
One of my customers uses vGPU with VMware Horizon infrastructure. We configured a desktop pool with 32 VMs with vGPU, the VM – W10-214-9-VDI would not start even though only 31 VMs out of 32 were running. This means that there were free resources on the graphics card for one VM, but it still wouldn’t start and the attempts ended with an error message:
The amount of graphics resource available in the parent resource pool is insufficient for the operation. Failed to start the virtual machine. Module DevicePowerOn power on failed. Could not initialize plugin ‚libnvidia-vgx.so‘ for vGPU ‚grid_m10-2q‘. No graphics device is available for vGPU ‚grid_m10-2q‘.
when I looked at the graphics card settings on the esx node, I saw that one graphics card has only 3 used nodes and one is free. So why it doesn’t want to boot the VM when the slot is free ?
so I connected to the ESX node via SSH and ran the nvidia-smi command. Let’s check out some „ghost virtual machines“ on these GPUs. What was interesting was that nvidia-smi repots 32 running VMs, but there are actually only 31. This means we have some sort of stuck process. After a while, I discovered that the process for W10-214-23-VDI is duplicated. So I proceeded with the command – nvidia-smi | grep W10-214-23-VDI
so I terminated the process via the „kill -9 PID“ command and then the GPU slot was freed and the problematic VM in VMWare booted.