GPU & CUDA Errors
Overview
Section titled “Overview”This guide covers common NVIDIA GPU errors and their solutions for deep learning workloads. Most issues fall into these categories:
- NUMA node affinity problems
- GPU detection and availability
- CUDA compatibility issues
NUMA Node Error
Section titled “NUMA Node Error”Symptoms
Section titled “Symptoms”- Error message:
successful NUMA node read from SysFS had negative value (-1) - Performance degradation in multi-GPU systems
Root Cause
Section titled “Root Cause”The NUMA node setting resets to -1 on every system reboot, causing memory allocation issues.
Solution
Section titled “Solution”Fix NUMA node affinity
#1)Identify the PCI-ID (with domain) of your GPU
#For example: PCI_ID=“0000.81:00.0”
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#Add the following line.
#This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/<PCI_ID>/numa_node”)
#Keep in mind that this is only a “shallow” fix as the Nvidia driver is unaware of it:
#Locally you would have some different PCI_ID, so replace it with your own.
#Such as 0000:0b:00.0, so example:
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/0000:0b:00.0/numa_node”)
# Verify the fix
nvidia-smi topo -mDiscussion on StackOverflow
GPU Detection Issues
Section titled “GPU Detection Issues”Verify GPU Availability
Section titled “Verify GPU Availability”Check GPU availability in PyTorch
import torch
# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Should return: True
# Get GPU name
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU count: {torch.cuda.device_count()}")
# Example output: GPU: NVIDIA GeForce RTX 4090Check GPU in TensorFlow
import tensorflow as tf
# List physical GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"GPUs available: {len(gpus)}")
for gpu in gpus:
print(f" - {gpu}")Common Causes of GPU Not Detected
Section titled “Common Causes of GPU Not Detected”- Driver issues - See Driver Installation
- CUDA version mismatch - Verify PyTorch/TensorFlow CUDA compatibility
- Environment problems - Check Environment Setup
Related Resources
Section titled “Related Resources”- Driver Installation Guide - Install or update NVIDIA drivers
- GPU Memory Management - Handle OOM errors
- Multi-GPU Setup - Configure multiple GPUs