Monitoring & Maintenance

Why Monitoring Matters

Your deep learning workstation is a significant investment. Proper monitoring and maintenance:

Prevents hardware failures - Catch issues before they cause damage
Maintains performance - Avoid gradual degradation
Saves money - Early detection prevents costly repairs
Ensures reliability - No interrupted training runs

What to Monitor

GPU Health

Temperature - Prevent thermal throttling and damage
Fan speed - Ensure adequate cooling
Power draw - Detect anomalies
Memory errors - ECC errors (on professional GPUs)
Clock speeds - Check for throttling

System Health

CPU temperature - Prevent throttling
RAM usage - Detect memory leaks
Disk usage - Avoid running out of space
Network - For multi-node training
PSU health - Power delivery issues

Training Metrics

GPU utilization - Is your GPU fully utilized?
Training speed - Samples/second throughput
Data loading time - Identify bottlenecks
Loss curves - Training progress

Monitoring Tools

Real-Time Monitoring

nvidia-smi - Built-in NVIDIA tool

# One-time check
nvidia-smi

# Continuous monitoring
watch -n 1 nvidia-smi

# Log to file
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,memory.used \
  --format=csv --loop=10 > gpu_log.csv

nvtop - Better visualization

# Install
sudo apt install nvtop

# Run
nvtop

htop - CPU and memory

sudo apt install htop
htop

Long-Term Monitoring

Prometheus + Grafana - Industry standard

Collect metrics over time
Beautiful dashboards
Alerting on issues

TensorBoard - For ML metrics

Built into TensorFlow
PyTorch integration available
Visualize training progress

Weights & Biases - Cloud-based

Experiment tracking
Hardware monitoring
Team collaboration

Maintenance Schedule

Daily (Automated)

GPU temperature check
Disk space monitoring
Training job status

Weekly

Review temperature logs
Check for driver updates
Clean browser cache/temp files
Review system logs for errors

Monthly

Quarterly

Deep clean (blow out dust thoroughly)
Check all cable connections
Update firmware (motherboard, GPU)
Benchmark and compare to baseline
Review and optimize storage

Yearly

Replace thermal paste
Check PSU health
Review warranty status
Plan upgrades if needed

Critical Thresholds

GPU Temperature

Ideal:

Idle: Below 50°C
Training: 60-75°C
Max acceptable: 80°C

Action required:

80-85°C: Check cooling, clean dust
Above 85°C: Stop training, investigate immediately

GPU Power Draw

Normal:

RTX 4090: 350-450W
RTX 4080: 250-320W
A100: 250-400W

Concerning:

Constant max power (possible inefficiency)
Fluctuating wildly (unstable workload)
Lower than expected (throttling)

Fan Speed

Healthy curve:

Below 60°C: 30-40% fan speed
60-70°C: 50-60% fan speed
70-80°C: 70-80% fan speed
Above 80°C: 90-100% fan speed

Concerning:

Fans at 100% constantly (cooling issue)
Fans not spinning up (fan failure or curve issue)

Remote Monitoring

SSH Access

Enable SSH server:

sudo apt install openssh-server
sudo systemctl enable ssh
sudo systemctl start ssh

Monitor remotely:

# SSH into machine
ssh user@workstation-ip

# Check GPUs
nvidia-smi

# Check system
htop

Web Dashboards

Netdata - Real-time monitoring

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Access at http://workstation-ip:19999

Glances - Web UI for system stats

pip install glances[web]
glances -w  # Access at http://workstation-ip:61208

Mobile Monitoring

Termux (Android) - SSH from phone Blink (iOS) - SSH client

Set up alerts:

# Send email on high temperature
# (Configure with cron)
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
if [ $temp -gt 85 ]; then
    echo "GPU temp high: $temp°C" | mail -s "Alert" [email protected]
fi

Preventive Maintenance

Dust Management

Signs of dust buildup:

Rising temperatures over time
Louder fans
More frequent thermal throttling

Cleaning:

Use compressed air
Hold fans while blowing (prevent spin damage)
Clean monthly if in dusty environment

Thermal Paste Replacement

When to replace:

Every 12-18 months
If temps rising 10°C+ from baseline
After moving system

For GPUs:

Voids warranty usually
Only if out of warranty
Use quality paste (Thermal Grizzly, Noctua)

Alerts & Automation

Temperature Alerts

Script to monitor and alert:

#!/bin/bash
# save as gpu_temp_alert.sh

THRESHOLD=85
EMAIL="[email protected]"

while true; do
    temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)

    if [ $temp -gt $THRESHOLD ]; then
        echo "GPU temperature critical: ${temp}°C" | mail -s "GPU Alert" $EMAIL
        # Optional: shutdown
        # sudo shutdown -h now
    fi

    sleep 60
done

Run on startup:

# Add to crontab
crontab -e

# Add line:
@reboot /path/to/gpu_temp_alert.sh &

Disk Space Alerts

# Check disk usage
df -h

# Alert if >90% full
usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 90 ]; then
    echo "Disk usage critical: ${usage}%" | mail -s "Disk Alert" $EMAIL
fi

Troubleshooting

High Temperatures

Causes:

Dust buildup
Poor airflow
Ambient temperature high
Thermal paste dried out
Fan failure

Solutions:

Clean system
Improve case airflow
Underclock GPU slightly
Replace thermal paste
Replace fans

Performance Degradation

Symptoms:

Training slower than before
Lower GPU utilization
Longer iteration times

Causes:

Thermal throttling
Driver issues
Background processes
Storage degradation

Diagnostics:

# Check throttling
nvidia-smi -q -d PERFORMANCE

# Monitor clocks
nvidia-smi -q -d CLOCK

# Check for background processes
htop

Crashes During Training

Potential causes:

Power delivery issues (PSU)
Overheating
RAM errors
Driver bugs
Overclocking instability

Solutions:

Test with different PSU
Monitor temps closely
Run memtest86+
Update/downgrade drivers
Reset to stock clocks

Data Backup

Critical data to backup:

Model checkpoints
Training configs
Custom code
Processed datasets (if expensive to regenerate)

Backup strategies:

# Local backup to external drive
rsync -av --progress /path/to/models /mnt/external/backup/

# Cloud backup (rclone to Google Drive, S3, etc.)
rclone sync /path/to/models remote:backup/

# Automated daily backup
# Add to crontab
0 2 * * * rsync -av /path/to/models /mnt/external/backup/

Monitoring Stack Setup

Recommended setup:

Basic: nvidia-smi + htop
Intermediate: nvtop + Netdata web dashboard
Advanced: Prometheus + Grafana + TensorBoard

Next Steps

Monitor GPU health with nvidia-smi and nvtop
Set up temperature alerts (see scripts above)
Enable remote access via SSH
Configure web dashboards for long-term tracking