GWDG HPC Resources
Overview
Section titled “Overview”The GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen) is a joint data and IT service center for the University of Göttingen and the Max Planck Society. It provides:
- High-performance computing (HPC) clusters
- GPU resources for deep learning
- Training and courses
- Research collaboration
Getting Started
Section titled “Getting Started”Access Requirements
Section titled “Access Requirements”- Account Registration - Apply through your institution
- SSH Key Setup - Required for cluster access
- Course Completion - Recommended for efficient usage
Cluster Access
Section titled “Cluster Access”Login to GWDG cluster
# Using SSH key authentication
ssh -i $HOME/.ssh/id_rsa_nhr -l username glogin.hlrn.de
# Example with specific key and user
ssh -i $HOME/.ssh/id_ed25519 [email protected]Replace:
id_ed25519with your key filenameu10000with your actual usernameglogin9.hlrn.dewith your assigned login node
~/.ssh/config - Simplify login
# Add to ~/.ssh/config for easier access
Host gwdg
HostName glogin9.hlrn.de
User u10000
IdentityFile ~/.ssh/id_ed25519
ServerAliveInterval 60
ServerAliveCountMax 3
# Now just use: ssh gwdgGenerate SSH key for GWDG
# Generate ed25519 key (recommended)
ssh-keygen -t ed25519 -f ~/.ssh/id_gwdg -C "[email protected]"
# Or RSA key
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa_gwdg
# Copy public key to GWDG
# (Submit through GWDG portal or email)
cat ~/.ssh/id_gwdg.pubGWDG Training Courses
Section titled “GWDG Training Courses”Deep Learning with GPUs
Section titled “Deep Learning with GPUs”Comprehensive course on GPU computing for deep learning on HPC clusters.
Topics Covered:
- GPU architecture and CUDA basics
- Deep learning frameworks (PyTorch, TensorFlow)
- Batch job submission with GPUs
- Multi-GPU training strategies
- Performance optimization
Deep Learning with GPUs Course
Scientific Computing on Clusters
Section titled “Scientific Computing on Clusters”Practical introduction to HPC cluster usage.
Topics Covered:
- Cluster architecture
- SLURM job scheduler
- Resource allocation
- Module system
- Data management
Scientific Computing Cluster Course
Common GWDG Workflows
Section titled “Common GWDG Workflows”Submit GPU Job
Section titled “Submit GPU Job”gpu_job.sh - Example SLURM GPU job
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out
# Load modules
module load python/3.10
module load cuda/12.1
# Activate environment
source $HOME/venvs/ml/bin/activate
# Run training
python train.py --epochs 100 --batch-size 64Interactive GPU Session
Section titled “Interactive GPU Session”# Request interactive session with GPU
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 --time=2:00:00 --pty bash
# Check GPU availability
nvidia-smi
# Run interactive work
pythonCheck Job Status
Section titled “Check Job Status”# Your jobs
squeue -u $USER
# Specific job details
scontrol show job JOBID
# Job history
sacct -u $USER --format=JobID,JobName,State,Elapsed,MaxRSSData Management
Section titled “Data Management”Transfer Data to GWDG
Section titled “Transfer Data to GWDG”Sync data to cluster
# Upload dataset
rsync -avhP --progress \
/local/datasets/imagenet \
[email protected]:/work/datasets/
# Download results
rsync -avhP --progress \
[email protected]:/work/experiments/checkpoints \
./local_backups/Copy files with scp
# Upload file
scp -i ~/.ssh/id_gwdg large_dataset.tar.gz \
[email protected]:/work/datasets/
# Download file
scp -i ~/.ssh/id_gwdg \
[email protected]:/work/results/model.pth \
./local/Interactive file transfer
# Connect with sftp
sftp -i ~/.ssh/id_gwdg [email protected]
# SFTP commands
put local_file.txt # Upload
get remote_file.txt # Download
ls # List remote files
lcd /local/path # Change local directoryStorage Locations
Section titled “Storage Locations”# Home directory (limited space)
$HOME
# Work directory (larger quota)
/work/$USER
# Scratch for temporary files
/scratch/$USER
# Check quotas
quota -sModule System
Section titled “Module System”# List available modules
module avail
# Search for specific module
module avail cuda
# Load modules
module load python/3.10 cuda/12.1 cudnn/8.9
# List loaded modules
module list
# Unload module
module unload cuda
# Purge all modules
module purgeBest Practices
Section titled “Best Practices”- Test Locally First - Debug on small datasets before cluster submission
- Use Batch Jobs - Don’t run long jobs on login nodes
- Monitor Resources - Use
seff JOBIDto check efficiency - Clean Up - Remove old data from scratch regularly
- Checkpointing - Save progress frequently for long jobs
Troubleshooting
Section titled “Troubleshooting”Job Won’t Start
Section titled “Job Won’t Start”# Check queue
squeue -p gpu
# Check job priority
sprio -j JOBID
# Explain why job is pending
squeue -j JOBID --startOut of Memory Errors
Section titled “Out of Memory Errors”# Check actual memory usage
seff JOBID
# Request more memory in job script
#SBATCH --mem=64GConnection Issues
Section titled “Connection Issues”# Test connection
ssh -v [email protected]
# Check key permissions
chmod 600 ~/.ssh/id_gwdg
chmod 644 ~/.ssh/id_gwdg.pubAdditional Resources
Section titled “Additional Resources”- GWDG Official Documentation
- SLURM Guide - General HPC usage
- Multi-GPU Training - Distributed training
- Backup & Sync - Data transfer strategies