Monitoring & Maintenance
Why Monitoring Matters
Section titled “Why Monitoring Matters”Your deep learning workstation is a significant investment. Proper monitoring and maintenance:
- Prevents hardware failures - Catch issues before they cause damage
- Maintains performance - Avoid gradual degradation
- Saves money - Early detection prevents costly repairs
- Ensures reliability - No interrupted training runs
What to Monitor
Section titled “What to Monitor”GPU Health
Section titled “GPU Health”- Temperature - Prevent thermal throttling and damage
- Fan speed - Ensure adequate cooling
- Power draw - Detect anomalies
- Memory errors - ECC errors (on professional GPUs)
- Clock speeds - Check for throttling
System Health
Section titled “System Health”- CPU temperature - Prevent throttling
- RAM usage - Detect memory leaks
- Disk usage - Avoid running out of space
- Network - For multi-node training
- PSU health - Power delivery issues
Training Metrics
Section titled “Training Metrics”- GPU utilization - Is your GPU fully utilized?
- Training speed - Samples/second throughput
- Data loading time - Identify bottlenecks
- Loss curves - Training progress
Monitoring Tools
Section titled “Monitoring Tools”Real-Time Monitoring
Section titled “Real-Time Monitoring”nvidia-smi - Built-in NVIDIA tool
# One-time check
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
# Log to file
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,memory.used \
--format=csv --loop=10 > gpu_log.csvnvtop - Better visualization
# Install
sudo apt install nvtop
# Run
nvtophtop - CPU and memory
sudo apt install htop
htopLong-Term Monitoring
Section titled “Long-Term Monitoring”Prometheus + Grafana - Industry standard
- Collect metrics over time
- Beautiful dashboards
- Alerting on issues
TensorBoard - For ML metrics
- Built into TensorFlow
- PyTorch integration available
- Visualize training progress
Weights & Biases - Cloud-based
- Experiment tracking
- Hardware monitoring
- Team collaboration
Maintenance Schedule
Section titled “Maintenance Schedule”Daily (Automated)
Section titled “Daily (Automated)”- GPU temperature check
- Disk space monitoring
- Training job status
Weekly
Section titled “Weekly”- Review temperature logs
- Check for driver updates
- Clean browser cache/temp files
- Review system logs for errors
Monthly
Section titled “Monthly”- Physical dust cleaning
- Check thermal paste (if temps rising)
- Verify all fans working
- Update system packages
- Review power consumption
Quarterly
Section titled “Quarterly”- Deep clean (blow out dust thoroughly)
- Check all cable connections
- Update firmware (motherboard, GPU)
- Benchmark and compare to baseline
- Review and optimize storage
Yearly
Section titled “Yearly”- Replace thermal paste
- Check PSU health
- Review warranty status
- Plan upgrades if needed
Critical Thresholds
Section titled “Critical Thresholds”GPU Temperature
Section titled “GPU Temperature”Ideal:
- Idle: Below 50°C
- Training: 60-75°C
- Max acceptable: 80°C
Action required:
- 80-85°C: Check cooling, clean dust
- Above 85°C: Stop training, investigate immediately
GPU Power Draw
Section titled “GPU Power Draw”Normal:
- RTX 4090: 350-450W
- RTX 4080: 250-320W
- A100: 250-400W
Concerning:
- Constant max power (possible inefficiency)
- Fluctuating wildly (unstable workload)
- Lower than expected (throttling)
Fan Speed
Section titled “Fan Speed”Healthy curve:
- Below 60°C: 30-40% fan speed
- 60-70°C: 50-60% fan speed
- 70-80°C: 70-80% fan speed
- Above 80°C: 90-100% fan speed
Concerning:
- Fans at 100% constantly (cooling issue)
- Fans not spinning up (fan failure or curve issue)
Remote Monitoring
Section titled “Remote Monitoring”SSH Access
Section titled “SSH Access”Enable SSH server:
sudo apt install openssh-server
sudo systemctl enable ssh
sudo systemctl start sshMonitor remotely:
# SSH into machine
ssh user@workstation-ip
# Check GPUs
nvidia-smi
# Check system
htopWeb Dashboards
Section titled “Web Dashboards”Netdata - Real-time monitoring
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Access at http://workstation-ip:19999Glances - Web UI for system stats
pip install glances[web]
glances -w # Access at http://workstation-ip:61208Mobile Monitoring
Section titled “Mobile Monitoring”Termux (Android) - SSH from phone Blink (iOS) - SSH client
Set up alerts:
# Send email on high temperature
# (Configure with cron)
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
if [ $temp -gt 85 ]; then
echo "GPU temp high: $temp°C" | mail -s "Alert" [email protected]
fiPreventive Maintenance
Section titled “Preventive Maintenance”Dust Management
Section titled “Dust Management”Signs of dust buildup:
- Rising temperatures over time
- Louder fans
- More frequent thermal throttling
Cleaning:
- Use compressed air
- Hold fans while blowing (prevent spin damage)
- Clean monthly if in dusty environment
Thermal Paste Replacement
Section titled “Thermal Paste Replacement”When to replace:
- Every 12-18 months
- If temps rising 10°C+ from baseline
- After moving system
For GPUs:
- Voids warranty usually
- Only if out of warranty
- Use quality paste (Thermal Grizzly, Noctua)
Alerts & Automation
Section titled “Alerts & Automation”Temperature Alerts
Section titled “Temperature Alerts”Script to monitor and alert:
#!/bin/bash
# save as gpu_temp_alert.sh
THRESHOLD=85
EMAIL="[email protected]"
while true; do
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
if [ $temp -gt $THRESHOLD ]; then
echo "GPU temperature critical: ${temp}°C" | mail -s "GPU Alert" $EMAIL
# Optional: shutdown
# sudo shutdown -h now
fi
sleep 60
doneRun on startup:
# Add to crontab
crontab -e
# Add line:
@reboot /path/to/gpu_temp_alert.sh &Disk Space Alerts
Section titled “Disk Space Alerts”# Check disk usage
df -h
# Alert if >90% full
usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 90 ]; then
echo "Disk usage critical: ${usage}%" | mail -s "Disk Alert" $EMAIL
fiTroubleshooting
Section titled “Troubleshooting”High Temperatures
Section titled “High Temperatures”Causes:
- Dust buildup
- Poor airflow
- Ambient temperature high
- Thermal paste dried out
- Fan failure
Solutions:
- Clean system
- Improve case airflow
- Underclock GPU slightly
- Replace thermal paste
- Replace fans
Performance Degradation
Section titled “Performance Degradation”Symptoms:
- Training slower than before
- Lower GPU utilization
- Longer iteration times
Causes:
- Thermal throttling
- Driver issues
- Background processes
- Storage degradation
Diagnostics:
# Check throttling
nvidia-smi -q -d PERFORMANCE
# Monitor clocks
nvidia-smi -q -d CLOCK
# Check for background processes
htopCrashes During Training
Section titled “Crashes During Training”Potential causes:
- Power delivery issues (PSU)
- Overheating
- RAM errors
- Driver bugs
- Overclocking instability
Solutions:
- Test with different PSU
- Monitor temps closely
- Run memtest86+
- Update/downgrade drivers
- Reset to stock clocks
Data Backup
Section titled “Data Backup”Critical data to backup:
- Model checkpoints
- Training configs
- Custom code
- Processed datasets (if expensive to regenerate)
Backup strategies:
# Local backup to external drive
rsync -av --progress /path/to/models /mnt/external/backup/
# Cloud backup (rclone to Google Drive, S3, etc.)
rclone sync /path/to/models remote:backup/
# Automated daily backup
# Add to crontab
0 2 * * * rsync -av /path/to/models /mnt/external/backup/Monitoring Stack Setup
Section titled “Monitoring Stack Setup”Recommended setup:
- Basic: nvidia-smi + htop
- Intermediate: nvtop + Netdata web dashboard
- Advanced: Prometheus + Grafana + TensorBoard
Next Steps
Section titled “Next Steps”- Monitor GPU health with
nvidia-smiandnvtop - Set up temperature alerts (see scripts above)
- Enable remote access via SSH
- Configure web dashboards for long-term tracking