Appendix G - Setting Up and Testing a GPGPU¶
Requirements for GPGPU testing¶
SUT prepared for testing as described in this document
nVidia GPGPU(s) installed in SUT
At this time, only nVidia GPGPUs are supported for Certification Testing.
Internet connection
The SUT must be able to talk to the Internet in order to download a significant number of packages from the nVidia repositories.
Installation of the
checkbox-provider-gpgpu
package – typesudo apt install checkbox-provider-gpgpu
after deploying the node. This package is installed from the Certification PPA, which should be enabled when you deployed the node or installed Checkbox manually.
Setting Up a GPGPU for Testing¶
New tests cases have been added to test that nVidia GPGPUs work with Ubuntu. With this addition, GPGPUs can be certified on any Ubuntu LTS Release or Point Release starting with Ubuntu 18.04 LTS using the 4.15 kernel.
The tool to set up the GPGPU environment for testing is included in the
checkbox-provider-certification-server
package and is installed any time the
Server Certification suite is installed on a SUT for testing.
To set up the GPGPU you simply need to do the following:
sudo gpu-setup.sh
This will add the nVidia repo and GPG key to the Ubuntu installation on the
SUT, update the Apt cache and install the Cuda Toolkit and appropriate nVidia
drivers for the GPGPUs installed in the SUT. It will also download the source
for a tool called gpu-burn
, an open source stress test for nVidia GPGPUs.
Then the script will compile the gpu-burn
tool and exit.
Once the script is complete, you must reboot the SUT to ensure the correct nVidia driver is loaded.
GPGPUs that use NVLink¶
Some NVIDIA GPUs use NVLink for inter-device communication. NVLink is a high- bandwidth, energy-efficient interconnect technology developed by NVIDIA, aimed at replacing the traditional PCIe method of data transfer between the CPU and GPU or between multiple GPUs. Server configurations that use NVLink to connect multiple GPUs require extra configuration before testing can be performed. Failure to configure NVLink on systems where it is in use will result in the GPU tests failing to successfully run.
You must configure NVLink before launching tests. The following steps are provided as a guideline and as a general reference to the steps necessary to configure NVLink. It is not guaranteed that these steps will work in all cases as they depend somewhat on specific driver versions, tool versions, etc. which can change over time. It is expected that you understand how to configure your own hardware prior to testing.
Documentation and downloads for nVidia’s Data Center GPU Manager can be found at https://developer.nvidia.com/dcgm/.
The following steps should be performed after running gpu-setup.sh
and
having rebooted the machine to ensure that the correct NVIDIA driver has been
loaded and the GPUs are accessible.
Determine which driver version you are using:
# modinfo nvidia |grep -i ^version # version: 525.105.17
You’re looking for the major version, in this example, 525.
Install the
datacenter-gpu-manager
,fabricmanager
andlibnvidia-nscq
packages appropriate for your driver version:# sudo apt install nvidia-fabricmanager-525 libnvidia-nscq-525 datacenter-gpu-manager
Start the fabricmanager service:
# sudo systemctl start nvidia-fabricmanager.service
Start the persistence daemon:
# sudo service nvidia-persistenced start
Start nv-hostengine:
# sudo nv-hostengine
Set up a group:
# dcgmi group -c GPU_Group # dcgmi group -l
The output will show you the GPU groups and the ID number for each.
Discover GPUs:
# dcgmi discovery -l
The output will show you the GPUs on the machine and the ID number for each.
Add GPUs to group:
# dcgmi group -g 2 -a 0,1,2,3 # dcgmi group -g 2 -i
Set up health monitoring:
# dcgmi health -g 2 -s mpi
Run the
diag
to check:# dcgmi diag -g 2 -r 1
At this point, NVLink should be configured and ready to go. You can also test this
by quickly running one of the nVidia sample tests such as the one found in
/usr/local/cuda-10.2/samples/1_Utilities/p2pBandwidthLatencyTest
which is
provided by the cuda
package.
Alternately, you can also cd into /opt/gpu-burn
and run a quick test with
gpu-burn
like so:
# ./gpu-burn 10
Testing the GPGPU(s)¶
To test the GPGPU, you only need to run the test-gpgpu
command as a normal
user, much in the same manner as you run any of the certify-*
or test-*
commands provided by the canonical-certification-server
package.
Running test-gpgpu
will execute gpu-burn
for approximately 30 minutes
to 1 hour against all discovered GPGPUs in the SUT in parallel. Once testing
is complete, the tool will upload results to the SUT’s Hardware Entry on the
Certification Portal. You do not need to create a separate certificate request
for GPGPU test results, simply add a note to the certificate created from the
main test results with a link to the GPGPU submission and the certification
team will review them together.