Multi-GPU Training Setup
Multi-GPU Training Overview
Section titled “Multi-GPU Training Overview”Training on multiple GPUs can dramatically reduce training time, but requires proper setup and understanding of different parallelism strategies.
Parallelism Strategies
Section titled “Parallelism Strategies”Data Parallelism (DP)
Section titled “Data Parallelism (DP)”Most common approach
- Same model replicated on each GPU
- Different data batches per GPU
- Gradients synchronized after backward pass
- Linear scaling (ideally)
Best for: Most training scenarios, models that fit on single GPU
Distributed Data Parallel (DDP)
Section titled “Distributed Data Parallel (DDP)”Recommended over DP
- More efficient than Data Parallel
- Better multi-node support
- Process per GPU
- NCCL backend for communication
Best for: Any multi-GPU training
Model Parallelism
Section titled “Model Parallelism”For very large models
- Model split across GPUs
- Different layers on different GPUs
- Sequential execution
- Communication overhead
Best for: Models too large for single GPU
Pipeline Parallelism
Section titled “Pipeline Parallelism”Advanced
- Model split into stages
- Micro-batching
- Overlapped computation
- Complex to implement
Best for: Very large models with many layers
Hardware Requirements
Section titled “Hardware Requirements”GPU Selection
Section titled “GPU Selection”Critical: Use identical GPUs
✓ Good: 4x RTX 4090
✓ Good: 2x A100 80GB
✗ Bad: 2x RTX 4090 + 1x RTX 4080
✗ Bad: 1x A100 + 1x A6000Why identical GPUs?
- Mixed GPUs limited by slowest card
- Memory differences cause imbalance
- Driver compatibility issues
PCIe Configuration
Section titled “PCIe Configuration”Check PCIe lanes:
# See PCIe generation and lanes
lspci -vv | grep -i "lnkcap\|lnksta"
# Ideal for multi-GPU:
# 2 GPUs: x16/x16 or x16/x8
# 3 GPUs: x16/x8/x8
# 4 GPUs: x8/x8/x8/x8PCIe Gen 4 x8 vs Gen 3 x16:
- Gen 4 x8: ~16 GB/s
- Gen 3 x16: ~16 GB/s
- Both adequate for most training
NVLink (Professional GPUs)
Section titled “NVLink (Professional GPUs)”If available (A100, H100, etc.):
# Check NVLink status
nvidia-smi nvlink --status
# Should show connected topologyNVLink benefits:
- 10x faster GPU-to-GPU vs PCIe
- Enables larger effective batch sizes
- Better scaling efficiency
System RAM
Section titled “System RAM”Rule of thumb: 16GB RAM per GPU
2 GPUs: 32GB minimum
4 GPUs: 64GB minimum
8 GPUs: 128GB minimumSetup Guide
Section titled “Setup Guide”PyTorch DistributedDataParallel Setup
Section titled “PyTorch DistributedDataParallel Setup”Step 1: Basic Training Script
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def setup(rank, world_size):
"""Initialize distributed training"""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# Initialize process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
"""Clean up distributed training"""
dist.destroy_process_group()
def train(rank, world_size):
"""Training function for each GPU"""
# Setup
setup(rank, world_size)
# Create model and move to GPU
model = YourModel().to(rank)
model = DDP(model, device_ids=[rank])
# Create distributed sampler
train_sampler = DistributedSampler(
train_dataset,
num_replicas=world_size,
rank=rank,
shuffle=True
)
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
sampler=train_sampler,
num_workers=4,
pin_memory=True
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
# Training loop
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch) # Important for shuffling
for batch in train_loader:
data, target = batch
data, target = data.to(rank), target.to(rank)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
cleanup()
def main():
world_size = torch.cuda.device_count()
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
if __name__ == "__main__":
main()Step 2: Launch Training
# Automatic (recommended)
python train_script.py
# Manual with torchrun
torchrun --nproc_per_node=4 train_script.py
# Multi-node
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
--master_addr="192.168.1.1" --master_port=12355 train_script.pyTensorFlow MirroredStrategy
Section titled “TensorFlow MirroredStrategy”Step 1: Setup Strategy
import tensorflow as tf
# Create strategy
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")
# Build model within strategy scope
with strategy.scope():
model = create_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(
optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Create distributed dataset
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.batch(batch_size * strategy.num_replicas_in_sync)
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)
# Train
model.fit(train_dataset, epochs=10)Step 2: Multi-Node (MultiWorkerMirroredStrategy)
import json
import os
# Configure cluster
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker': ['192.168.1.1:12345', '192.168.1.2:12345']
},
'task': {'type': 'worker', 'index': 0} # Change per node
})
# Create strategy
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
model = create_model()
# ... rest same as beforeOptimization Tips
Section titled “Optimization Tips”Batch Size Scaling
Section titled “Batch Size Scaling”Linear scaling rule:
# Single GPU
batch_size_per_gpu = 32
# 4 GPUs
total_batch_size = 32 * 4 # = 128Adjust learning rate:
# Linear scaling (most common)
lr_single_gpu = 1e-3
lr_multi_gpu = 1e-3 * num_gpus
# Square root scaling (sometimes better)
lr_multi_gpu = 1e-3 * sqrt(num_gpus)Gradient Accumulation
Section titled “Gradient Accumulation”For even larger effective batch sizes:
accumulation_steps = 4 # Accumulate 4 batches
effective_batch = batch_size * num_gpus * accumulation_steps
for i, batch in enumerate(train_loader):
output = model(data)
loss = criterion(output, target) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()Communication Optimization
Section titled “Communication Optimization”Gradient bucketing:
# PyTorch DDP automatically buckets gradients
model = DDP(
model,
device_ids=[rank],
bucket_cap_mb=25, # Default, tune if needed
find_unused_parameters=False # Set True only if necessary
)NCCL tuning:
# Environment variables for better performance
export NCCL_DEBUG=INFO # For debugging
export NCCL_IB_DISABLE=0 # Enable InfiniBand if available
export NCCL_SOCKET_IFNAME=eth0 # Specify network interfaceMonitoring Multi-GPU Training
Section titled “Monitoring Multi-GPU Training”GPU Utilization
Section titled “GPU Utilization”# Watch all GPUs
watch -n 1 nvidia-smi
# Or use nvtop for better visualization
nvtopWhat to look for:
- All GPUs at similar utilization (80-100%)
- Balanced memory usage across GPUs
- Minimal GPU-Util fluctuation
Check Scaling Efficiency
Section titled “Check Scaling Efficiency”Ideal scaling:
1 GPU: 1x speed (baseline)
2 GPUs: 2x speed (100% efficiency)
4 GPUs: 4x speed (100% efficiency)Real-world scaling:
1 GPU: 1x speed
2 GPUs: 1.9x speed (95% efficiency) ✓ Good
4 GPUs: 3.5x speed (88% efficiency) ✓ Acceptable
8 GPUs: 6.5x speed (81% efficiency) ✓ AcceptableMeasure it:
import time
# Single GPU
start = time.time()
train_one_epoch()
single_gpu_time = time.time() - start
# Multi GPU
start = time.time()
train_one_epoch() # With DDP
multi_gpu_time = time.time() - start
speedup = single_gpu_time / multi_gpu_time
efficiency = speedup / num_gpus * 100
print(f"Speedup: {speedup:.2f}x")
print(f"Efficiency: {efficiency:.1f}%")Common Issues
Section titled “Common Issues”GPUs Not Detected
Section titled “GPUs Not Detected”# Check all GPUs visible
python -c "import torch; print(torch.cuda.device_count())"
# Should match:
nvidia-smi -LIf mismatch:
# Check CUDA_VISIBLE_DEVICES
echo $CUDA_VISIBLE_DEVICES
# Clear it if set incorrectly
unset CUDA_VISIBLE_DEVICESUnbalanced GPU Load
Section titled “Unbalanced GPU Load”Symptoms:
- GPU 0 at 100%, others at 60%
- Different memory usage across GPUs
Causes:
- Data loading bottleneck
- Uneven data distribution
- Model on GPU 0 during evaluation
Solutions:
# Use distributed sampler
train_sampler = DistributedSampler(dataset, shuffle=True)
# Increase num_workers
DataLoader(dataset, num_workers=8) # Higher
# Move evaluation model correctly
model.eval()
with torch.no_grad():
# Ensure data goes to correct GPU
data = data.to(rank)Out of Memory with Multi-GPU
Section titled “Out of Memory with Multi-GPU”Surprising but common:
Cause: Larger effective batch size
Solution:
# Reduce per-GPU batch size
batch_size = 16 # Instead of 32
# Or use gradient accumulation
# (See above)Slow Multi-GPU Training
Section titled “Slow Multi-GPU Training”Check:
- PCIe bandwidth:
nvidia-smi topo -m - CPU bottleneck:
htopduring training - Data loading: Profile with PyTorch Profiler
- Network (multi-node): Check with
iperf3
Advanced: Model Parallelism
Section titled “Advanced: Model Parallelism”For models too large for single GPU:
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
# Wrap model with FSDP
model = FSDP(
model,
device_id=rank,
auto_wrap_policy=default_auto_wrap_policy,
mixed_precision=mp_policy,
)
# Train normally
# Model automatically sharded across GPUsimport deepspeed
# DeepSpeed config
ds_config = {
"train_batch_size": 128,
"gradient_accumulation_steps": 4,
"fp16": {"enabled": True},
"zero_optimization": {
"stage": 2, # ZeRO stage 2 or 3
}
}
# Initialize
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
model_parameters=model.parameters(),
config=ds_config
)
# Train
for batch in train_loader:
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()Multi-Node Training
Section titled “Multi-Node Training”When you need it:
- More than 8 GPUs
- Model doesn’t fit on single node
- Distributed across cluster
See: HPC Integration for SLURM-based multi-node training
Benchmarking Script
Section titled “Benchmarking Script”import torch
import time
from torch.utils.data import DataLoader, TensorDataset
def benchmark_multi_gpu(model, batch_size, num_iterations=100):
"""Benchmark multi-GPU training speed"""
# Dummy data
dummy_data = torch.randn(1000, 3, 224, 224)
dummy_labels = torch.randint(0, 1000, (1000,))
dataset = TensorDataset(dummy_data, dummy_labels)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
model.train()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()
# Warmup
for i, (data, target) in enumerate(loader):
if i >= 10:
break
data, target = data.cuda(), target.cuda()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Benchmark
torch.cuda.synchronize()
start = time.time()
for i, (data, target) in enumerate(loader):
if i >= num_iterations:
break
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
elapsed = time.time() - start
throughput = (num_iterations * batch_size) / elapsed
print(f"Throughput: {throughput:.2f} images/sec")
print(f"Time per iteration: {elapsed/num_iterations*1000:.2f} ms")
# Usage
# benchmark_multi_gpu(model, batch_size=32)Next Steps
Section titled “Next Steps”- Monitor GPU Health - Keep your GPUs healthy
- HPC Cluster Integration - Multi-node training
- Training Optimization - Maximize performance