Appendix G - Setting Up and Testing a GPGPU
===========================================

Requirements for GPGPU testing
------------------------------

- SUT prepared for testing as described in this document

- nVidia GPGPU(s) installed in SUT

  - At this time, only nVidia GPGPUs are supported for Certification Testing.

- Internet connection

  - The SUT must be able to talk to the Internet in order to download a
    significant number of packages from the nVidia repositories.

- Installation of the ``checkbox-provider-gpgpu`` package -- type ``sudo
  apt install checkbox-provider-gpgpu`` after deploying the node. This
  package is installed from the Certification PPA, which should be enabled
  when you deployed the node or installed Checkbox manually.

Setting Up a GPGPU for Testing
------------------------------

New tests cases have been added to test that nVidia GPGPUs work with Ubuntu.
With this addition, GPGPUs can be certified on any Ubuntu LTS Release or Point
Release starting with Ubuntu 18.04 LTS using the 4.15 kernel.

The tool to set up the GPGPU environment for testing is included in the
``checkbox-provider-certification-server`` package and is installed any time the
Server Certification suite is installed on a SUT for testing.

To set up the GPGPU you simply need to do the following::

  sudo gpu-setup.sh

This will add the nVidia repo and GPG key to the Ubuntu installation on the
SUT, update the Apt cache and install the Cuda Toolkit and appropriate nVidia
drivers for the GPGPUs installed in the SUT.  It will also download the source
for a tool called ``gpu-burn``, an open source stress test for nVidia GPGPUs.
Then the script will compile the ``gpu-burn`` tool and exit.

Once the script is complete, you must reboot the SUT to ensure the correct
nVidia driver is loaded.

GPGPUs that use NVLink
----------------------

Some NVIDIA GPUs use NVLink for inter-device communication. NVLink is a high-
bandwidth, energy-efficient interconnect technology developed by NVIDIA, aimed
at replacing the traditional PCIe method of data transfer between the CPU and
GPU or between multiple GPUs. Server configurations that use NVLink to connect
multiple GPUs require extra configuration before testing can be performed.
Failure to configure NVLink on systems where it is in use will result in the
GPU tests failing to successfully run.

**You must configure NVLink before launching tests.** The following steps are
provided as a guideline and as a general reference to the steps necessary to
configure NVLink. It is not guaranteed that these steps will work in all cases
as they depend somewhat on specific driver versions, tool versions, etc. which
can change over time. It is expected that you understand how to configure your
own hardware prior to testing.

Documentation and downloads for nVidia's Data Center GPU Manager can be
found at https://developer.nvidia.com/dcgm/.

The following steps should be performed after running ``gpu-setup.sh`` and
having rebooted the machine to ensure that the correct NVIDIA driver has been
loaded and the GPUs are accessible.

#. Determine which driver version you are using::

   # modinfo nvidia |grep -i ^version
   # version:        525.105.17

   You're looking for the major version, in this example, 525.

#. Install the ``datacenter-gpu-manager``, ``fabricmanager`` and ``libnvidia-nscq`` packages
   appropriate for your driver version::

   # sudo apt install nvidia-fabricmanager-525 libnvidia-nscq-525 datacenter-gpu-manager

#. Start the fabricmanager service::

   # sudo systemctl start nvidia-fabricmanager.service

#. Start the persistence daemon::

   # sudo service nvidia-persistenced start

#. Start nv-hostengine::

   # sudo nv-hostengine

#. Set up a group::

   # dcgmi group -c GPU_Group
   # dcgmi group -l

   The output will show you the GPU groups and the ID number for each.

#. Discover GPUs::

   # dcgmi discovery -l

   The output will show you the GPUs on the machine and the ID number for
   each.

#. Add GPUs to group::

   # dcgmi group -g 2 -a 0,1,2,3
   # dcgmi group -g 2 -i

#. Set up health monitoring::

   # dcgmi health -g 2 -s mpi

#. Run the ``diag`` to check::

   # dcgmi diag -g 2 -r 1

At this point, NVLink should be configured and ready to go. You can also test this
by quickly running one of the nVidia sample tests such as the one found in
``/usr/local/cuda-10.2/samples/1_Utilities/p2pBandwidthLatencyTest`` which is
provided by the ``cuda`` package.

Alternately, you can also cd into ``/opt/gpu-burn`` and run a quick test with
``gpu-burn`` like so::

  # ./gpu-burn 10

Testing the GPGPU(s)
--------------------

To test the GPGPU, you only need to run the ``test-gpgpu`` command as a normal
user, much in the same manner as you run any of the ``certify-*`` or ``test-*``
commands provided by the ``canonical-certification-server`` package.

Running ``test-gpgpu`` will execute ``gpu-burn`` for approximately 30 minutes
to 1 hour against all discovered GPGPUs in the SUT in parallel.  Once testing
is complete, the tool will upload results to the SUT's Hardware Entry on the
Certification Portal. You do not need to create a separate certificate request
for GPGPU test results, simply add a note to the certificate created from the
main test results with a link to the GPGPU submission and the certification
team will review them together.