Appendix G - Setting Up and Testing a GPGPU =========================================== Requirements for GPGPU testing ------------------------------ - SUT prepared for testing as described in this document - nVidia GPGPU(s) installed in SUT - At this time, only nVidia GPGPUs are supported for Certification Testing. - Internet connection - The SUT must be able to talk to the Internet in order to download a significant number of packages from the nVidia repositories. - Installation of the ``checkbox-provider-gpgpu`` package -- type ``sudo apt install checkbox-provider-gpgpu`` after deploying the node. This package is installed from the Certification PPA, which should be enabled when you deployed the node or installed Checkbox manually. Setting Up a GPGPU for Testing ------------------------------ New tests cases have been added to test that nVidia GPGPUs work with Ubuntu. With this addition, GPGPUs can be certified on any Ubuntu LTS Release or Point Release starting with Ubuntu 18.04 LTS using the 4.15 kernel. The tool to set up the GPGPU environment for testing is included in the ``checkbox-provider-certification-server`` package and is installed any time the Server Certification suite is installed on a SUT for testing. To set up the GPGPU you simply need to do the following:: sudo gpu-setup.sh This will add the nVidia repo and GPG key to the Ubuntu installation on the SUT, update the Apt cache and install the Cuda Toolkit and appropriate nVidia drivers for the GPGPUs installed in the SUT. It will also download the source for a tool called ``gpu-burn``, an open source stress test for nVidia GPGPUs. Then the script will compile the ``gpu-burn`` tool and exit. Once the script is complete, you must reboot the SUT to ensure the correct nVidia driver is loaded. GPGPUs that use NVLink ---------------------- Some NVIDIA GPUs use NVLink for inter-device communication. NVLink is a high- bandwidth, energy-efficient interconnect technology developed by NVIDIA, aimed at replacing the traditional PCIe method of data transfer between the CPU and GPU or between multiple GPUs. Server configurations that use NVLink to connect multiple GPUs require extra configuration before testing can be performed. Failure to configure NVLink on systems where it is in use will result in the GPU tests failing to successfully run. **You must configure NVLink before launching tests.** The following steps are provided as a guideline and as a general reference to the steps necessary to configure NVLink. It is not guaranteed that these steps will work in all cases as they depend somewhat on specific driver versions, tool versions, etc. which can change over time. It is expected that you understand how to configure your own hardware prior to testing. Documentation and downloads for nVidia's Data Center GPU Manager can be found at https://developer.nvidia.com/dcgm/. The following steps should be performed after running ``gpu-setup.sh`` and having rebooted the machine to ensure that the correct NVIDIA driver has been loaded and the GPUs are accessible. #. Determine which driver version you are using:: # modinfo nvidia |grep -i ^version # version: 525.105.17 You're looking for the major version, in this example, 525. #. Install the ``datacenter-gpu-manager``, ``fabricmanager`` and ``libnvidia-nscq`` packages appropriate for your driver version:: # sudo apt install nvidia-fabricmanager-525 libnvidia-nscq-525 datacenter-gpu-manager #. Start the fabricmanager service:: # sudo systemctl start nvidia-fabricmanager.service #. Start the persistence daemon:: # sudo service nvidia-persistenced start #. Start nv-hostengine:: # sudo nv-hostengine #. Set up a group:: # dcgmi group -c GPU_Group # dcgmi group -l The output will show you the GPU groups and the ID number for each. #. Discover GPUs:: # dcgmi discovery -l The output will show you the GPUs on the machine and the ID number for each. #. Add GPUs to group:: # dcgmi group -g 2 -a 0,1,2,3 # dcgmi group -g 2 -i #. Set up health monitoring:: # dcgmi health -g 2 -s mpi #. Run the ``diag`` to check:: # dcgmi diag -g 2 -r 1 At this point, NVLink should be configured and ready to go. You can also test this by quickly running one of the nVidia sample tests such as the one found in ``/usr/local/cuda-10.2/samples/1_Utilities/p2pBandwidthLatencyTest`` which is provided by the ``cuda`` package. Alternately, you can also cd into ``/opt/gpu-burn`` and run a quick test with ``gpu-burn`` like so:: # ./gpu-burn 10 Testing the GPGPU(s) -------------------- To test the GPGPU, you only need to run the ``test-gpgpu`` command as a normal user, much in the same manner as you run any of the ``certify-*`` or ``test-*`` commands provided by the ``canonical-certification-server`` package. Running ``test-gpgpu`` will execute ``gpu-burn`` for approximately 30 minutes to 1 hour against all discovered GPGPUs in the SUT in parallel. Once testing is complete, the tool will upload results to the SUT's Hardware Entry on the Certification Portal. You do not need to create a separate certificate request for GPGPU test results, simply add a note to the certificate created from the main test results with a link to the GPGPU submission and the certification team will review them together.