NVIDIA GRID™

The following guide outlines how to install the require NVIDIA GRID™ drivers to use vGPUs with a HyperCloud cluster.

End User install instructions

To use vGPU, first acquire the GRID driver package, and place the entire GRID .zip file onto the HyperCloud dashboard using scp. Once copied, run nvidia-grid-install <GRID-zip-file>. For example, nvidia-grid-install /var/cores/NVIDIA-GRID-Linux-KVM-535.154.02-535.154.05-538.15.zip on the dashboard.
Run hypercloud-reboot-cluster-live -compute to reboot all the compute nodes. When a vGPU-capable node starts, the GRID kernel drivers will automatically be loaded and will also configure three vGPUs at 8 GB RAM each with well-known UUIDs of 00000000-0000-0000-0001-000000000001, 00000000-0000-0000-0001-000000000002, and 00000000-0000-0000-0001-000000000003 - this is the only GPU configuration option within HyperCloud. The last number in this UUID, by policy, indicates the vGPU ID (numbering starts at 1) and the second-to-last number grouping represents the GPU (numbering starts at 1); e.g. the second-to-last grouping will be 0002 for vGPUs created on the second GPU.

Note

For HyperCloud nodes which do not contain a vGPU, nothing will occur. The GRID kernel drivers will not be loaded.
Create a VM template in the HyperCloud GUI to use a vGPU.
- From the menu on the left, click "Templates" then "VMs" then the "+", selecting Create.
- Then, complete the template creation as per documentation.
- Before finalizing the template, select the Tags tab and under "Raw Data", paste the following in the DATA field:
```
<devices>
    <hostdev mode='subsystem' type='mdev' model='vfio-pci' >
        <source>
            <address uuid='00000000-0000-0000-0001-000000000001'/>
        </source>
    </hostdev>
</devices>
```
- Modify the UUID to match the vGPU that a VM created from this template will use.
- Click the green Create to finish creating the VM Template.
  
  Warning
  
  The UUID cannot be changed from the UI for an instantiated VM. It can only be changed in the template. A template will need to be created for each vGPU usage configuration; furthermore, a UUID cannot be duplicated across multiple instances.

OS and driver

Create a VM running Ubuntu Linux version 22.04 with a vGPU attached to it, as described in the instructions above.

Note

Only Ubuntu and RHEL are supported by NVIDIA.
From inside the Ubuntu VM, run:
```
apt update
```
Install the NVIDIA GRID driver from the NVIDIA NDA website.

Run the following:

Example

The following shows an internal SoftIron depository for the drivers. Replace the URL as applicable.

wget https://git.softiron.com/jenkins/cloud/nvidia_grid/NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip
mkdir nv
cd nv
unzip ../NVIDIA-GRID-Linux-KVM-535.161.08-538.46.zip
cd Guest_Drivers/
chmod +x 0644 ./nvidia-linux-grid-535_535.161.08_amd64.deb
apt install ./nvidia-linux-grid-535_535.161.08_amd64.deb

There may be a complaint about running as root; however, this should be ignorable.

Run nvidia-smi to confirm that the driver has loaded.
```
nvidia-smi
```
Once the VM has been instantiated with the recently created vGPU template, use lspci to see that the vGPU has been passed through to the VM.
Example
```
root@ubuntu-vm:~# lspci -d 10de:
00:05.0 VGA compatible controller: NVIDIA Corporation GA102GL [A10] (rev a1)
```
The GRID driver can now be installed onto the VM by following NVIDIA's documentation.
- On Ubuntu, extract the GRID zip file and install the .deb in Guest_Drivers; similarly, for RHEL, install the .rpm in the directory.

Once the installation is complete, the command nvidia-smi will show the vGPU:

Make note of the supported CUDA version in the top right corner.

root@ubuntu-vgpu:~/gpu-burn# nvidia-smi
Thu Apr 11 13:29:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10-8Q                  On  | 00000000:00:05.0 Off |                    0 |
| N/A   N/A P8              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI       PID   Type   Process name                             GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Of note

The vGPU shows 8192 MiB of RAM.

End User uninstall instructions

Simply run the following command from the Dashboard:

nvidia-grid-uninstall

CUDA Install

Download CUDA from NVIDIA at https://developer.nvidia.com/cuda-downloads?target_os=Linux.
- Select the OS and make sure to select "deb (network)".
Run the "Base Install" instructions from the web page, but note that the last line installs a specific CUDA version. For HyperCloud 2.3.x this must be changed to match the version shown from running the nvidia-smi command, in this case, 12.2; therefore, the last line of the instructions will require (instead of 12-4):

sudo apt-get -y install cuda-toolkit-12-2

GPU-Burn install

Download and build GPU-burn

git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make COMPUTE-86

Run GPU-burn with:
```
./gpu-burn 600
```
600 is the number of seconds to run the program.

Info

For a licensed GPU on HyperCloud 2.3 on HC41XXX family nodes, the benchmark numbers should be around 13900.

License setup

The NVIDIA vGPU driver requires a license from NVIDIA in order to operate. Unlicensed vGPUs will run at full speed for 20 minutes, then throttle to a lower speed after that time.

See: https://docs.nvidia.com/grid/13.0/grid-licensing-user-guide/index.html.

Go to: https://ui.licensing.nvidia.com/ to set up a license server and retrieve the license.

Info

Setting up the license server is outside the scope of this document and requires an NVIDIA NDA account. Note that the new method of licensing does not involve setting up a local server and installing Tomcat. If you are looking at documents which describe this, they are out of date and will not work.
The easiest way to conduct licensing is to set up a license server that runs on NVIDIA’s cloud, and they have made it easy to do this. This will require the VM to have access to the internet. If this is not possible, they do allow local license servers to be run on your network (but it’s not the old-style Tomcat server). This is also outside the scope of this document.
After the license server has been set up, the licenses will need to be assigned. When assigning licenses, the server must be Stopped. The licenses required for the current HyperCloud configuration of vGPU are RTX Virtual Workstation. No other license types will work.
Select on the license server and click the green Actions button at the top right, and select "Download Configuration Token". This will provide a .tok file. This file will communicate to the VM's GRID driver to acquire a license from NVIDIA's server.
Copy this file onto the VM in the /etc/nvidia/ClientConfigToken/ directory, then run:
```
systemctl restart nvidia-gridd.service
```
Wait for ~10 seconds then run:
```
systemctl status nvidia-gridd.service
```

The session should resemble below:

root@ubuntu-vgpu:~/gpu-burn# systemctl restart nvidia-gridd.service
root@ubuntu-vgpu:~/gpu-burn# sleep 10
root@ubuntu-vgpu:~/gpu-burn# systemctl status nvidia-gridd.service
● nvidia-gridd.service - NVIDIA Grid Daemon
    Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
    Active: active (running) since Thu 2024-04-11 14:00:58 UTC; 9s ago
    Process: 28237 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
   Main PID: 28238 (nvidia-gridd)
    Tasks: 4 (limit: 19140)
    Memory: 1.4M
        CPU: 213ms
    CGroup: /system.slice/nvidia-gridd.service
            └─28238 /usr/bin/nvidia-gridd

Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Starting NVIDIA Grid Daemon...
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Started (28238)
Apr 11 14:00:58 ubuntu-vgpu systemd[1]: Started NVIDIA Grid Daemon.
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: vGPU Software package (0)
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Ignore service provider and node-locked licensing
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: NLS initialized
Apr 11 14:00:58 ubuntu-vgpu nvidia-gridd[28238]: Acquiring license. (Info: api.cls.licensing.nvidia.>
Apr 11 14:01:00 ubuntu-vgpu nvidia-gridd[28238]: License acquired successfully. (Info: api.cls.licen>

Ensure the output relays a successful acquisition, then the status can be further verified with nvidia-smi:

root@ubuntu-vgpu:~/gpu-burn# nvidia-smi -q |grep License
    vGPU Software Licensed Product
    License Status                  : Licensed (Expiry: 2024-4-12 14:1:0 GMT)

From the NVIDIA license server's website the license use will be displayed.

Run the gpu-burn command again to verify the benchmark as expected (~13900 for NVIDIA A10 GPU on HC41XXX family nodes).