Skip to content

GPU & CUDA Errors

This guide covers common NVIDIA GPU errors and their solutions for deep learning workloads. Most issues fall into these categories:

  • NUMA node affinity problems
  • GPU detection and availability
  • CUDA compatibility issues

  • Error message: successful NUMA node read from SysFS had negative value (-1)
  • Performance degradation in multi-GPU systems

The NUMA node setting resets to -1 on every system reboot, causing memory allocation issues.

Fix NUMA node affinity
#1)Identify the PCI-ID (with domain) of your GPU
#For example: PCI_ID=“0000.81:00.0”
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#Add the following line.
#This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/<PCI_ID>/numa_node”)

#Keep in mind that this is only a “shallow” fix as the Nvidia driver is unaware of it:
#Locally you would have some different PCI_ID, so replace it with your own.
#Such as 0000:0b:00.0, so example:
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/0000:0b:00.0/numa_node”)

# Verify the fix
nvidia-smi topo -m

Discussion on StackOverflow


Check GPU availability in PyTorch
import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Should return: True

# Get GPU name
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU count: {torch.cuda.device_count()}")
# Example output: GPU: NVIDIA GeForce RTX 4090
  1. Driver issues - See Driver Installation
  2. CUDA version mismatch - Verify PyTorch/TensorFlow CUDA compatibility
  3. Environment problems - Check Environment Setup