GPU & CUDA Errors

Overview

This guide covers common NVIDIA GPU errors and their solutions for deep learning workloads. Most issues fall into these categories:

NUMA node affinity problems
GPU detection and availability
CUDA compatibility issues

NUMA Node Error

Symptoms

Error message: successful NUMA node read from SysFS had negative value (-1)
Performance degradation in multi-GPU systems

Root Cause

The NUMA node setting resets to -1 on every system reboot, causing memory allocation issues.

Solution

Fix NUMA node affinity

#1)Identify the PCI-ID (with domain) of your GPU
#For example: PCI_ID=“0000.81:00.0”
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#Add the following line.
#This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/<PCI_ID>/numa_node”)

#Keep in mind that this is only a “shallow” fix as the Nvidia driver is unaware of it:
#Locally you would have some different PCI_ID, so replace it with your own.
#Such as 0000:0b:00.0, so example:
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/0000:0b:00.0/numa_node”)

# Verify the fix
nvidia-smi topo -m

Discussion on StackOverflow

GPU Detection Issues

Check GPU availability in PyTorch

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Should return: True

# Get GPU name
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU count: {torch.cuda.device_count()}")
# Example output: GPU: NVIDIA GeForce RTX 4090

Check GPU in TensorFlow

import tensorflow as tf

# List physical GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"GPUs available: {len(gpus)}")
for gpu in gpus:
    print(f"  - {gpu}")

Common Causes of GPU Not Detected

Driver issues - See Driver Installation
CUDA version mismatch - Verify PyTorch/TensorFlow CUDA compatibility
Environment problems - Check Environment Setup

Driver Installation Guide - Install or update NVIDIA drivers
GPU Memory Management - Handle OOM errors
Multi-GPU Setup - Configure multiple GPUs