Machines are high-performing computing for scaling AI applications.
For non-ML-in-a-Box or terminal/SSH-only (headless) machines, configuring NVLink can make data transfer faster, improve scalability, reduce latency, and optimize utilization of resources, which are important for high performance tasks and computing.
ML-in-a-Box immediately provides the data science stack needed for high performance tasks and computing, and avoids manual configuring that could lead to issues on the machine. We strongly recommend that you run ML-in-a-Box instead of configuring NVLink on non-ML-in-a-Box or headless machines. If you want, you can disable desktop streaming from ML-in-a-Box using the following command:
sudo systemctl set-default multi-user.target
Then, reboot your system.
sudo reboot
NVIDIA CUDA Compiler Driver (NVCC) and NVIDIA System Management Interface (NVSMI) are essential for NVLink connection. Before configuring NVLink, identify the GPUs on the machine and check if tools are installed.
Identifying the GPUs on the machine and checking whether they have the proper drivers loaded verifies that those GPUs are compatible for NVLink connection. Review the details about all the PCI buses and devices on the machine . The following output shows a list of PCI devices such as multiple NVIDIA A100 GPUs on a Paperspace machine.
paperspace@pstc7tvpw3ml:~$ lspci
...
00:05.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:07.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:08.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:09.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:0a.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:0b.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:0c.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
...
Alternatively, you can use lspci | grep NVIDIA
, which specifically identifies NVIDIA GPUs, and lspci -v
, which provides verbose information about each PCI device.
NVIDIA CUDA Toolkit includes NVCC, which compiles CUDA code into executable programs. Check the current NVCC version on the machine with the nvcc --version
command. Identifying the version running on the machine is important because it verifies whether the version is compatible for NVLink connection.
paperspace@pstc7tvpw3ml:~$ nvcc --version
If NVCC is available on the machine, then you can skip to verifying the NVIDIA CUDA Drivers installation.
If NVCC is not found on the machine, it is either because the machine doesn’t have NVIDIA CUDA Toolkit installed or the NVIDIA CUDA Toolkit isn’t on the machine’s PATH. Follow the NVIDIA CUDA Toolkit instructions when configuring NVLink to install the NVIDIA CUDA Toolkit.
If the NVIDIA CUDA Toolkit isn’t on the PATH of the machine, then type the following commands to add the toolkit to the machine’s PATH.
export PATH=/etc/alternatives/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/etc/alternatives/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
To permanently add the toolkit to the machine’s PATH, open a terminal, and navigate to the machine’s home directory. List all the files by running the ls -a
command. Then, use a text editor of your choice to add the commands to the .profile
file. For example, nano .profile
.
Save and exit the .profile
file, and apply the change by either restarting the terminal or sourcing the .profile
file with the source ~/.profile
command.
NVIDIA CUDA Drivers includes NVSMI, which monitors and manages NVIDIA GPU devices by providing access to GPU settings and configuration details, GPU performance, and their real-time statuses. It also shows how GPUs are interconnected, which is either through PCIe or NVLink.
Use the nvidia-smi
command in your terminal to see whether the GPUs on the machine are compatible with NVLink connection and the hardware is up-to-date and functioning correctly.
paperspace@pstc7tvpm3ml:~$ nvidia-smi
If NVSMI is available on the machine, then you can skip to installing the fabric manager. If NVSMI isn’t found on the machine, install the NVIDIA CUDA Drivers to install NVSMI on the machine.
After identifying the GPUs and verifying that NVCC and NVSMI are installed, ensure that the machine is up-to-date. Manage the machine’s system using the sudo su -
command. Then, update the machine’s system software by executing the sudo apt-get update && apt-get upgrade -y
command.
Installing the NVIDIA CUDA Toolkit allows you to use NVCC and other CUDA tools for developing and running CUDA applications. To find the correct NVIDIA CUDA Toolkit for the drivers on the machine, see NVIDIA’s CUDA Toolkit and driver compatibility table. For example, if the machine uses Ubuntu 22.04, download NVIDIA’s Ubuntu 22.04 CUDA Toolkit repository pinning file:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
Move the pinning file to the APT preferences directory, which handles package priorities.
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
Download NVIDIA’s Ubuntu 22.04 CUDA repository Debian (.deb
) package.
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
Install the CUDA repository package which contains the NVIDIA CUDA Toolkit and adds the repository to the APT system.
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
Copy the repository’s GPG key to the machine’s keyring directory. This is necessary for authenticating packages from the repository securely.
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
Update the machine’s package lists which incorporates the new changes made:
sudo apt-get update
Lastly, install the NVIDIA CUDA Toolkit 12.4 from the repository you added earlier. This command ensures that the toolkit is installed correctly with all the required configurations and priorities set.
sudo apt-get -y install cuda-toolkit-12-4
Install CUDA drivers on the machine with the cuda-drivers
flag which ensures that the GPUs on the machine have proper drivers to use NVLink connection.
sudo apt-get install -y cuda-drivers
NVIDIA Fabric Manager manages fabric resources such as NVLink and is important for setups involving complex GPU interconnects such as configuring and allocating NVLink connections. Install the CUDA driver’s Fabric Manager 550 on the machine.
sudo apt-get install cuda-drivers-fabricmanager-550 -y
Start the NVIDIA Fabric Manager which is required to manage and optimize NVLink fabric resources.
sudo systemctl start nvidia-fabricmanager
After configuring NVLink, verify the connections between GPUs and whether NVLink is active and functioning properly.
View how all the GPUs are connected and if they are working as expected using the nvidia-smi
command. The command also shows connectivity information such as whether NVLink is enabled and connecting the GPUs.
Review the topology of the machine to see how the GPUs are interconnected by running the nvidia-smi topo -m
command.
Check the status of each NVLink connection for each GPU. Run the nvidia-smi nvlink --status
command to see the information about each NVLink including the utilization and active or inactive status.
The NVLink connection isn’t configured correctly if NVLink is disabled and inactive. To troubleshoot, check if you have installed all the up-to-date tools and test GPU and NVLink connection. If these steps don’t fix the issue, then reboot the machine with sudo reboot
. If you need further assistance, contact support.
After verifying the basic installation and functionality of the machine’s CUDA environment, test the environment further with CUDA samples. CUDA samples are a collection of examples made by NVIDIA and are used to configure and test the CUDA Toolkit features such as NVLink. Clone the CUDA Samples repository.
git clone https://github.com/NVIDIA/cuda-samples
Navigate to the deviceQuery
sample which is a utility that provides information about the CUDA devices on the machine. It is used to verify that the system recognizes the GPUs and to display their capabilities such as NVLink.
cd cuda-samples/Samples/1_Utilities/deviceQuery
Compile the deviceQuery
sample with a Makefile from the sample directory. The Makefile has instructions for compiling deviceQuery
. Alternatively, You can also run nvcc -o deviceQuery deviceQuery.cu
as an alternative to using the Makefile to compile deviceQuery
sample.
Lastly, use the ./deviceQuery
command, execute the compiled deviceQuery
program which shows details about NVLink support. This validates whether the CUDA environment is set up correctly and that NVLink is supported and is used to connect the GPUs within the machine. If NVLink doesn’t appear in the deviceQuery
output, then there is an issue with the hardware setup or the driver/software configuration.