GPU Monitoring Scripts
Overview
Section titled “Overview”Effective GPU monitoring helps you:
- Track GPU utilization and memory usage
- Detect performance bottlenecks
- Monitor temperature and prevent thermal throttling
- Log metrics for analysis
- Set up alerts for issues
Quick Monitoring Commands
Section titled “Quick Monitoring Commands”Basic nvidia-smi Usage
Section titled “Basic nvidia-smi Usage”# Basic GPU status
nvidia-smi
# Simplified output
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv
# Watch mode (updates every 2 seconds)
watch -n 2 nvidia-smi# Query specific GPU
nvidia-smi -i 0
# Query multiple GPUs
nvidia-smi -i 0,1,2
# All GPUs summary
nvidia-smi --query-gpu=gpu_name,memory.total,driver_version --format=csv# Show running processes on GPUs
nvidia-smi pmon
# Detailed process information
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csvContinuous Monitoring Scripts
Section titled “Continuous Monitoring Scripts”GPU Usage Logger
Section titled “GPU Usage Logger”gpu_logger.sh - Log GPU metrics to file
#!/bin/bash
# Log GPU metrics continuously
LOG_FILE="gpu_metrics_$(date +%Y%m%d_%H%M%S).csv"
INTERVAL=5 # seconds
# Create header
echo "timestamp,gpu_id,gpu_name,temp_c,utilization_%,memory_used_mb,memory_total_mb,power_w" > "$LOG_FILE"
# Log continuously
while true; do
nvidia-smi --query-gpu=timestamp,index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw \
--format=csv,noheader,nounits >> "$LOG_FILE"
sleep $INTERVAL
doneUsage:
chmod +x gpu_logger.sh
./gpu_logger.sh & # Run in backgroundReal-Time GPU Dashboard
Section titled “Real-Time GPU Dashboard”gpu_dashboard.sh - Colorful terminal dashboard
#!/bin/bash
# Real-time GPU monitoring dashboard
while true; do
clear
echo "=== GPU Monitoring Dashboard ==="
echo "Time: $(date '+%Y-%m-%d %H:%M:%S')"
echo ""
# GPU utilization
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw \
--format=csv,noheader | \
while IFS=',' read -r idx name util mem_used mem_total temp power; do
echo "GPU $idx: $name"
echo " Utilization: $util"
echo " Memory: $mem_used / $mem_total"
echo " Temperature: $temp"
echo " Power: $power"
echo ""
done
# Running processes
echo "=== Active Processes ==="
nvidia-smi pmon -c 1 | grep -v "#"
sleep 2
donePython Monitoring Scripts
Section titled “Python Monitoring Scripts”Simple GPU Monitor
Section titled “Simple GPU Monitor”gpu_monitor.py - Python monitoring
#!/usr/bin/env python3
import pynvml
import time
from datetime import datetime
# Initialize NVIDIA Management Library
pynvml.nvmlInit()
def monitor_gpus(interval=2):
"""Monitor all GPUs continuously"""
device_count = pynvml.nvmlDeviceGetCount()
try:
while True:
print(f"\n{'='*60}")
print(f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"{'='*60}")
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# Get device info
name = pynvml.nvmlDeviceGetName(handle)
# Get utilization
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
# Get memory info
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
# Get temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
# Get power
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0 # Convert to W
print(f"\nGPU {i}: {name}")
print(f" Utilization: {util.gpu}%")
print(f" Memory: {mem.used / 1024**3:.2f} GB / {mem.total / 1024**3:.2f} GB ({mem.used/mem.total*100:.1f}%)")
print(f" Temperature: {temp}°C")
print(f" Power: {power:.2f} W")
time.sleep(interval)
except KeyboardInterrupt:
print("\nMonitoring stopped.")
finally:
pynvml.nvmlShutdown()
if __name__ == "__main__":
# Install: pip install nvidia-ml-py3
monitor_gpus(interval=2)Simple monitoring with gpustat
#!/usr/bin/env python3
import gpustat
import time
def monitor():
"""Simple GPU monitoring with gpustat"""
try:
while True:
stats = gpustat.GPUStatCollection.new_query()
print("\033[2J\033[H") # Clear screen
print(stats)
time.sleep(2)
except KeyboardInterrupt:
print("\nStopped.")
if __name__ == "__main__":
# Install: pip install gpustat
monitor()GPU Usage Alerts
Section titled “GPU Usage Alerts”gpu_alert.py - Email/Slack alerts
#!/usr/bin/env python3
import pynvml
import time
import smtplib
from email.mime.text import MIMEText
pynvml.nvmlInit()
# Alert thresholds
TEMP_THRESHOLD = 80 # °C
MEMORY_THRESHOLD = 90 # %
CHECK_INTERVAL = 60 # seconds
def send_alert(message):
"""Send email alert (configure SMTP settings)"""
sender = "[email protected]"
receiver = "[email protected]"
msg = MIMEText(message)
msg['Subject'] = 'GPU Alert'
msg['From'] = sender
msg['To'] = receiver
# Configure your SMTP server
# smtp = smtplib.SMTP('smtp.gmail.com', 587)
# smtp.starttls()
# smtp.login(sender, 'your_password')
# smtp.send_message(msg)
# smtp.quit()
print(f"ALERT: {message}")
def check_gpus():
"""Check GPU health and send alerts"""
device_count = pynvml.nvmlDeviceGetCount()
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
# Check temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
if temp >= TEMP_THRESHOLD:
send_alert(f"GPU {i} ({name}) temperature {temp}°C exceeds threshold {TEMP_THRESHOLD}°C")
# Check memory
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
mem_percent = (mem.used / mem.total) * 100
if mem_percent >= MEMORY_THRESHOLD:
send_alert(f"GPU {i} ({name}) memory usage {mem_percent:.1f}% exceeds threshold {MEMORY_THRESHOLD}%")
if __name__ == "__main__":
try:
while True:
check_gpus()
time.sleep(CHECK_INTERVAL)
except KeyboardInterrupt:
print("\nAlert monitoring stopped.")
finally:
pynvml.nvmlShutdown()Advanced Monitoring Tools
Section titled “Advanced Monitoring Tools”nvtop - Interactive GPU Monitor
Section titled “nvtop - Interactive GPU Monitor”# Install nvtop (like htop for GPUs)
sudo apt install nvtop
# Run
nvtopFeatures:
- Real-time GPU usage graphs
- Process monitoring
- Multi-GPU support
- Color-coded interface
gpustat - Clean CLI Output
Section titled “gpustat - Clean CLI Output”# Install
pip install gpustat
# Basic usage
gpustat
# Watch mode
gpustat --watch
# With color
gpustat --color
# JSON output for scripting
gpustat --jsonPrometheus + Grafana Setup
Section titled “Prometheus + Grafana Setup”nvidia_gpu_exporter for Prometheus
# Install NVIDIA GPU exporter
docker run -d --gpus all \
-p 9445:9445 \
nvidia/dcgm-exporter:latest
# Access metrics at http://localhost:9445/metricsThen configure Grafana to visualize:
- GPU utilization over time
- Memory usage trends
- Temperature monitoring
- Power consumption
Training-Specific Monitoring
Section titled “Training-Specific Monitoring”Log GPU Usage During Training
Section titled “Log GPU Usage During Training”Monitor during PyTorch training
import torch
import pynvml
pynvml.nvmlInit()
def log_gpu_usage(epoch, batch_idx):
"""Log GPU usage during training"""
if batch_idx % 100 == 0: # Log every 100 batches
for i in range(torch.cuda.device_count()):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
print(f"Epoch {epoch}, Batch {batch_idx}, GPU {i}: "
f"Util {util.gpu}%, Mem {mem.used/1024**3:.1f}GB/{mem.total/1024**3:.1f}GB")
# In training loop
for epoch in range(num_epochs):
for batch_idx, (data, target) in enumerate(train_loader):
# Your training code
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Log GPU usage
log_gpu_usage(epoch, batch_idx)Monitor during TensorFlow training
import tensorflow as tf
import pynvml
pynvml.nvmlInit()
class GPUMonitorCallback(tf.keras.callbacks.Callback):
def __init__(self, log_freq=100):
super().__init__()
self.log_freq = log_freq
self.batch_count = 0
def on_batch_end(self, batch, logs=None):
self.batch_count += 1
if self.batch_count % self.log_freq == 0:
for i in range(pynvml.nvmlDeviceGetCount()):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
print(f"Batch {self.batch_count}, GPU {i}: "
f"Util {util.gpu}%, Mem {mem.used/1024**3:.1f}GB")
# Use in training
model.fit(
train_dataset,
epochs=10,
callbacks=[GPUMonitorCallback(log_freq=100)]
)Troubleshooting GPU Issues
Section titled “Troubleshooting GPU Issues”Check for GPU Throttling
Section titled “Check for GPU Throttling”check_throttling.sh
#!/bin/bash
# Check if GPUs are throttling due to temperature or power
nvidia-smi --query-gpu=index,name,clocks_throttle_reasons.active,temperature.gpu,power.draw \
--format=csv
# Reasons codes:
# 0x0000000000000000 - Not throttled
# 0x0000000000000001 - GPU idle
# 0x0000000000000002 - Applications clocks setting
# 0x0000000000000004 - SW power cap
# 0x0000000000000008 - HW slowdown
# 0x0000000000000010 - Sync boost
# 0x0000000000000020 - SW thermal slowdown
# 0x0000000000000040 - HW thermal slowdown
# 0x0000000000000080 - HW power brake slowdownFind GPU Memory Leaks
Section titled “Find GPU Memory Leaks”find_memory_leak.py
import torch
import gc
def find_tensors():
"""Find all tensors in memory"""
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
print(type(obj), obj.size())
except:
pass
# Call periodically during training to detect leaks
find_tensors()Best Practices
Section titled “Best Practices”- Monitor Continuously - Run monitoring during long training runs
- Log Metrics - Save GPU logs for post-analysis
- Set Alerts - Get notified of temperature/memory issues
- Check Baselines - Know your GPU’s normal behavior
- Use Multiple Tools - Combine nvidia-smi, nvtop, and custom scripts
Related Resources
Section titled “Related Resources”- GPU Memory Management - Optimize memory usage
- Multi-GPU Setup - Monitor multiple GPUs
- Training Utilities - Training management scripts
- HPC Monitoring - Cluster-wide GPU monitoring