Batch Size Optimization

The Batch Size Myth

Understanding Batch Size Impact

Batch size affects multiple aspects of your training:

Training Speed

Too Small: Underutilizes GPU, slower throughput per epoch
Too Large: May hit memory limits, forces smaller models or reduces throughput
Optimal: Balances GPU utilization with memory efficiency

Model Performance

Larger batches can lead to worse generalization (sharp minima)
Smaller batches provide more frequent weight updates
Batch size affects effective learning rate

Optimization Checklist

1. Monitor GPU Utilization

# Watch real-time GPU usage
nvidia-smi dmon -s u

# Or use nvtop for detailed metrics
nvtop

Ensure your GPU utilization is optimal by monitoring GPU usage and adjusting batch size to fully utilize GPU memory without excessive overhead.

2. Check for Data Loading Bottlenecks

Analyze if there are bottlenecks elsewhere in your training pipeline, such as data loading. Efficient data loading is essential to keep the GPU consistently fed with data.

Signs of data bottleneck:

GPU utilization drops between batches
CPU usage is high during training
Slow iteration times despite small batch size

Solutions:

Increase num_workers in DataLoader
Use faster storage (SSD vs HDD)
Preload data to RAM if possible
Use data prefetching

3. Enable Mixed-Precision Training

Consider using mixed-precision training if you’re not doing so already, as it can result in faster computations and reduced memory usage.

PyTorch
TensorFlow

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()

    with autocast():
        output = model(data)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

import tensorflow as tf

# Enable mixed precision
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(policy)

# Build and compile model
model = create_model()
optimizer = tf.keras.optimizers.Adam()

# TensorFlow handles loss scaling automatically
model.compile(
    optimizer=optimizer,
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train normally - mixed precision is handled automatically
model.fit(train_dataset, epochs=10)

4. Gradient Accumulation

If you need larger effective batch sizes but hit memory limits:

PyTorch
TensorFlow

accumulation_steps = 4
optimizer.zero_grad()

for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target)
    loss = loss / accumulation_steps  # Normalize
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

import tensorflow as tf

accumulation_steps = 4
optimizer = tf.keras.optimizers.Adam()

# Accumulate gradients
gradient_accumulation = [tf.zeros_like(var) for var in model.trainable_variables]

for i, (data, target) in enumerate(train_dataset):
    with tf.GradientTape() as tape:
        output = model(data, training=True)
        loss = loss_fn(target, output)
        loss = loss / accumulation_steps  # Normalize

    # Accumulate gradients
    gradients = tape.gradient(loss, model.trainable_variables)
    gradient_accumulation = [acc + grad for acc, grad in zip(gradient_accumulation, gradients)]

    if (i + 1) % accumulation_steps == 0:
        # Apply accumulated gradients
        optimizer.apply_gradients(zip(gradient_accumulation, model.trainable_variables))
        # Reset accumulation
        gradient_accumulation = [tf.zeros_like(var) for var in model.trainable_variables]

Finding Your Optimal Batch Size

Start with a power of 2 (e.g., 16, 32, 64)
Increase gradually until you hit ~80-90% GPU memory usage
Monitor training speed (samples/second)
Test model performance with different batch sizes
Adjust learning rate proportionally if changing batch size significantly

Real-World Example

A common issue with YOLOv8 and similar models:

Key Takeaways

Profile before optimizing - measure actual GPU utilization
Data loading bottlenecks are often the real problem
Mixed-precision training can double your effective batch size
Bigger batches ≠ better models (consider generalization)
Always test multiple configurations for your specific use case