Skip to content

Monitoring & Maintenance

Your deep learning workstation is a significant investment. Proper monitoring and maintenance:

  • Prevents hardware failures - Catch issues before they cause damage
  • Maintains performance - Avoid gradual degradation
  • Saves money - Early detection prevents costly repairs
  • Ensures reliability - No interrupted training runs
  • Temperature - Prevent thermal throttling and damage
  • Fan speed - Ensure adequate cooling
  • Power draw - Detect anomalies
  • Memory errors - ECC errors (on professional GPUs)
  • Clock speeds - Check for throttling
  • CPU temperature - Prevent throttling
  • RAM usage - Detect memory leaks
  • Disk usage - Avoid running out of space
  • Network - For multi-node training
  • PSU health - Power delivery issues
  • GPU utilization - Is your GPU fully utilized?
  • Training speed - Samples/second throughput
  • Data loading time - Identify bottlenecks
  • Loss curves - Training progress

nvidia-smi - Built-in NVIDIA tool

# One-time check
nvidia-smi

# Continuous monitoring
watch -n 1 nvidia-smi

# Log to file
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,memory.used \
  --format=csv --loop=10 > gpu_log.csv

nvtop - Better visualization

# Install
sudo apt install nvtop

# Run
nvtop

htop - CPU and memory

sudo apt install htop
htop

Prometheus + Grafana - Industry standard

  • Collect metrics over time
  • Beautiful dashboards
  • Alerting on issues

TensorBoard - For ML metrics

  • Built into TensorFlow
  • PyTorch integration available
  • Visualize training progress

Weights & Biases - Cloud-based

  • Experiment tracking
  • Hardware monitoring
  • Team collaboration
  • GPU temperature check
  • Disk space monitoring
  • Training job status
  • Review temperature logs
  • Check for driver updates
  • Clean browser cache/temp files
  • Review system logs for errors
  • Physical dust cleaning
  • Check thermal paste (if temps rising)
  • Verify all fans working
  • Update system packages
  • Review power consumption
  • Deep clean (blow out dust thoroughly)
  • Check all cable connections
  • Update firmware (motherboard, GPU)
  • Benchmark and compare to baseline
  • Review and optimize storage
  • Replace thermal paste
  • Check PSU health
  • Review warranty status
  • Plan upgrades if needed

Ideal:

  • Idle: Below 50°C
  • Training: 60-75°C
  • Max acceptable: 80°C

Action required:

  • 80-85°C: Check cooling, clean dust
  • Above 85°C: Stop training, investigate immediately

Normal:

  • RTX 4090: 350-450W
  • RTX 4080: 250-320W
  • A100: 250-400W

Concerning:

  • Constant max power (possible inefficiency)
  • Fluctuating wildly (unstable workload)
  • Lower than expected (throttling)

Healthy curve:

  • Below 60°C: 30-40% fan speed
  • 60-70°C: 50-60% fan speed
  • 70-80°C: 70-80% fan speed
  • Above 80°C: 90-100% fan speed

Concerning:

  • Fans at 100% constantly (cooling issue)
  • Fans not spinning up (fan failure or curve issue)

Enable SSH server:

sudo apt install openssh-server
sudo systemctl enable ssh
sudo systemctl start ssh

Monitor remotely:

# SSH into machine
ssh user@workstation-ip

# Check GPUs
nvidia-smi

# Check system
htop

Netdata - Real-time monitoring

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Access at http://workstation-ip:19999

Glances - Web UI for system stats

pip install glances[web]
glances -w  # Access at http://workstation-ip:61208

Termux (Android) - SSH from phone Blink (iOS) - SSH client

Set up alerts:

# Send email on high temperature
# (Configure with cron)
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
if [ $temp -gt 85 ]; then
    echo "GPU temp high: $temp°C" | mail -s "Alert" [email protected]
fi

Signs of dust buildup:

  • Rising temperatures over time
  • Louder fans
  • More frequent thermal throttling

Cleaning:

  • Use compressed air
  • Hold fans while blowing (prevent spin damage)
  • Clean monthly if in dusty environment

When to replace:

  • Every 12-18 months
  • If temps rising 10°C+ from baseline
  • After moving system

For GPUs:

  • Voids warranty usually
  • Only if out of warranty
  • Use quality paste (Thermal Grizzly, Noctua)

Script to monitor and alert:

#!/bin/bash
# save as gpu_temp_alert.sh

THRESHOLD=85
EMAIL="[email protected]"

while true; do
    temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)

    if [ $temp -gt $THRESHOLD ]; then
        echo "GPU temperature critical: ${temp}°C" | mail -s "GPU Alert" $EMAIL
        # Optional: shutdown
        # sudo shutdown -h now
    fi

    sleep 60
done

Run on startup:

# Add to crontab
crontab -e

# Add line:
@reboot /path/to/gpu_temp_alert.sh &
# Check disk usage
df -h

# Alert if >90% full
usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 90 ]; then
    echo "Disk usage critical: ${usage}%" | mail -s "Disk Alert" $EMAIL
fi

Causes:

  1. Dust buildup
  2. Poor airflow
  3. Ambient temperature high
  4. Thermal paste dried out
  5. Fan failure

Solutions:

  1. Clean system
  2. Improve case airflow
  3. Underclock GPU slightly
  4. Replace thermal paste
  5. Replace fans

Symptoms:

  • Training slower than before
  • Lower GPU utilization
  • Longer iteration times

Causes:

  1. Thermal throttling
  2. Driver issues
  3. Background processes
  4. Storage degradation

Diagnostics:

# Check throttling
nvidia-smi -q -d PERFORMANCE

# Monitor clocks
nvidia-smi -q -d CLOCK

# Check for background processes
htop

Potential causes:

  1. Power delivery issues (PSU)
  2. Overheating
  3. RAM errors
  4. Driver bugs
  5. Overclocking instability

Solutions:

  1. Test with different PSU
  2. Monitor temps closely
  3. Run memtest86+
  4. Update/downgrade drivers
  5. Reset to stock clocks

Critical data to backup:

  • Model checkpoints
  • Training configs
  • Custom code
  • Processed datasets (if expensive to regenerate)

Backup strategies:

# Local backup to external drive
rsync -av --progress /path/to/models /mnt/external/backup/

# Cloud backup (rclone to Google Drive, S3, etc.)
rclone sync /path/to/models remote:backup/

# Automated daily backup
# Add to crontab
0 2 * * * rsync -av /path/to/models /mnt/external/backup/

Recommended setup:

  1. Basic: nvidia-smi + htop
  2. Intermediate: nvtop + Netdata web dashboard
  3. Advanced: Prometheus + Grafana + TensorBoard
  • Monitor GPU health with nvidia-smi and nvtop
  • Set up temperature alerts (see scripts above)
  • Enable remote access via SSH
  • Configure web dashboards for long-term tracking