GWDG HPC Resources

Overview

The GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen) is a joint data and IT service center for the University of Göttingen and the Max Planck Society. It provides:

High-performance computing (HPC) clusters
GPU resources for deep learning
Training and courses
Research collaboration

Getting Started

Access Requirements

Account Registration - Apply through your institution
SSH Key Setup - Required for cluster access
Course Completion - Recommended for efficient usage

Cluster Access

# Using SSH key authentication
ssh -i $HOME/.ssh/id_rsa_nhr -l username glogin.hlrn.de

# Example with specific key and user
ssh -i $HOME/.ssh/id_ed25519 [email protected]

Replace:

id_ed25519 with your key filename
u10000 with your actual username
glogin9.hlrn.de with your assigned login node

~/.ssh/config - Simplify login

# Add to ~/.ssh/config for easier access
Host gwdg
    HostName glogin9.hlrn.de
    User u10000
    IdentityFile ~/.ssh/id_ed25519
    ServerAliveInterval 60
    ServerAliveCountMax 3

# Now just use: ssh gwdg

Generate SSH key for GWDG

# Generate ed25519 key (recommended)
ssh-keygen -t ed25519 -f ~/.ssh/id_gwdg -C "[email protected]"

# Or RSA key
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa_gwdg

# Copy public key to GWDG
# (Submit through GWDG portal or email)
cat ~/.ssh/id_gwdg.pub

GWDG Training Courses

Deep Learning with GPUs

Comprehensive course on GPU computing for deep learning on HPC clusters.

Topics Covered:

GPU architecture and CUDA basics
Deep learning frameworks (PyTorch, TensorFlow)
Batch job submission with GPUs
Multi-GPU training strategies
Performance optimization

Deep Learning with GPUs Course

Scientific Computing on Clusters

Practical introduction to HPC cluster usage.

Topics Covered:

Cluster architecture
SLURM job scheduler
Resource allocation
Module system
Data management

Scientific Computing Cluster Course

Common GWDG Workflows

Submit GPU Job

gpu_job.sh - Example SLURM GPU job

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out

# Load modules
module load python/3.10
module load cuda/12.1

# Activate environment
source $HOME/venvs/ml/bin/activate

# Run training
python train.py --epochs 100 --batch-size 64

Interactive GPU Session

# Request interactive session with GPU
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 --time=2:00:00 --pty bash

# Check GPU availability
nvidia-smi

# Run interactive work
python

Check Job Status

# Your jobs
squeue -u $USER

# Specific job details
scontrol show job JOBID

# Job history
sacct -u $USER --format=JobID,JobName,State,Elapsed,MaxRSS

Data Management

Transfer Data to GWDG

Sync data to cluster

# Upload dataset
rsync -avhP --progress \
  /local/datasets/imagenet \
  [email protected]:/work/datasets/

# Download results
rsync -avhP --progress \
  [email protected]:/work/experiments/checkpoints \
  ./local_backups/

Copy files with scp

# Upload file
scp -i ~/.ssh/id_gwdg large_dataset.tar.gz \
  [email protected]:/work/datasets/

# Download file
scp -i ~/.ssh/id_gwdg \
  [email protected]:/work/results/model.pth \
  ./local/

Interactive file transfer

# Connect with sftp
sftp -i ~/.ssh/id_gwdg [email protected]

# SFTP commands
put local_file.txt         # Upload
get remote_file.txt        # Download
ls                         # List remote files
lcd /local/path            # Change local directory

Storage Locations

# Home directory (limited space)
$HOME

# Work directory (larger quota)
/work/$USER

# Scratch for temporary files
/scratch/$USER

# Check quotas
quota -s

Module System

# List available modules
module avail

# Search for specific module
module avail cuda

# Load modules
module load python/3.10 cuda/12.1 cudnn/8.9

# List loaded modules
module list

# Unload module
module unload cuda

# Purge all modules
module purge

Best Practices

Test Locally First - Debug on small datasets before cluster submission
Use Batch Jobs - Don’t run long jobs on login nodes
Monitor Resources - Use seff JOBID to check efficiency
Clean Up - Remove old data from scratch regularly
Checkpointing - Save progress frequently for long jobs

Troubleshooting

Job Won’t Start

# Check queue
squeue -p gpu

# Check job priority
sprio -j JOBID

# Explain why job is pending
squeue -j JOBID --start

Out of Memory Errors

# Check actual memory usage
seff JOBID

# Request more memory in job script
#SBATCH --mem=64G

Connection Issues

# Test connection
ssh -v [email protected]

# Check key permissions
chmod 600 ~/.ssh/id_gwdg
chmod 644 ~/.ssh/id_gwdg.pub

Additional Resources

GWDG Official Documentation
SLURM Guide - General HPC usage
Multi-GPU Training - Distributed training
Backup & Sync - Data transfer strategies