Skip to content

GWDG HPC Resources

The GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen) is a joint data and IT service center for the University of Göttingen and the Max Planck Society. It provides:

  • High-performance computing (HPC) clusters
  • GPU resources for deep learning
  • Training and courses
  • Research collaboration

  1. Account Registration - Apply through your institution
  2. SSH Key Setup - Required for cluster access
  3. Course Completion - Recommended for efficient usage
Login to GWDG cluster
# Using SSH key authentication
ssh -i $HOME/.ssh/id_rsa_nhr -l username glogin.hlrn.de

# Example with specific key and user
ssh -i $HOME/.ssh/id_ed25519 [email protected]

Replace:

  • id_ed25519 with your key filename
  • u10000 with your actual username
  • glogin9.hlrn.de with your assigned login node

Comprehensive course on GPU computing for deep learning on HPC clusters.

Topics Covered:

  • GPU architecture and CUDA basics
  • Deep learning frameworks (PyTorch, TensorFlow)
  • Batch job submission with GPUs
  • Multi-GPU training strategies
  • Performance optimization

Deep Learning with GPUs Course

Practical introduction to HPC cluster usage.

Topics Covered:

  • Cluster architecture
  • SLURM job scheduler
  • Resource allocation
  • Module system
  • Data management

Scientific Computing Cluster Course


gpu_job.sh - Example SLURM GPU job
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out

# Load modules
module load python/3.10
module load cuda/12.1

# Activate environment
source $HOME/venvs/ml/bin/activate

# Run training
python train.py --epochs 100 --batch-size 64
# Request interactive session with GPU
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 --time=2:00:00 --pty bash

# Check GPU availability
nvidia-smi

# Run interactive work
python
# Your jobs
squeue -u $USER

# Specific job details
scontrol show job JOBID

# Job history
sacct -u $USER --format=JobID,JobName,State,Elapsed,MaxRSS

Sync data to cluster
# Upload dataset
rsync -avhP --progress \
  /local/datasets/imagenet \
  [email protected]:/work/datasets/

# Download results
rsync -avhP --progress \
  [email protected]:/work/experiments/checkpoints \
  ./local_backups/
# Home directory (limited space)
$HOME

# Work directory (larger quota)
/work/$USER

# Scratch for temporary files
/scratch/$USER

# Check quotas
quota -s

# List available modules
module avail

# Search for specific module
module avail cuda

# Load modules
module load python/3.10 cuda/12.1 cudnn/8.9

# List loaded modules
module list

# Unload module
module unload cuda

# Purge all modules
module purge

  1. Test Locally First - Debug on small datasets before cluster submission
  2. Use Batch Jobs - Don’t run long jobs on login nodes
  3. Monitor Resources - Use seff JOBID to check efficiency
  4. Clean Up - Remove old data from scratch regularly
  5. Checkpointing - Save progress frequently for long jobs

# Check queue
squeue -p gpu

# Check job priority
sprio -j JOBID

# Explain why job is pending
squeue -j JOBID --start
# Check actual memory usage
seff JOBID

# Request more memory in job script
#SBATCH --mem=64G
# Test connection
ssh -v [email protected]

# Check key permissions
chmod 600 ~/.ssh/id_gwdg
chmod 644 ~/.ssh/id_gwdg.pub