Skip to content

Learning Rate Optimization

The learning rate is arguably the most important hyperparameter in deep learning. Set it wrong, and your model either:

  • Too High: Diverges, produces NaN losses, or oscillates wildly
  • Too Low: Trains painfully slow or gets stuck in poor local minima
  • Just Right: Converges quickly to good solutions

The most reliable method to find a good learning rate:

import torch
import matplotlib.pyplot as plt

def find_lr(model, train_loader, optimizer, criterion,
            start_lr=1e-7, end_lr=10, num_iter=100):
    """
    Perform learning rate range test
    """
    lrs = []
    losses = []

    lr_mult = (end_lr / start_lr) ** (1 / num_iter)
    lr = start_lr
    optimizer.param_groups[0]['lr'] = lr

    best_loss = float('inf')
    batch_iter = iter(train_loader)

    for i in range(num_iter):
        try:
            data, target = next(batch_iter)
        except StopIteration:
            batch_iter = iter(train_loader)
            data, target = next(batch_iter)

        data, target = data.cuda(), target.cuda()

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)

        # Stop if loss explodes
        if loss.item() > 4 * best_loss or torch.isnan(loss):
            break

        if loss.item() < best_loss:
            best_loss = loss.item()

        lrs.append(lr)
        losses.append(loss.item())

        loss.backward()
        optimizer.step()

        # Update learning rate
        lr *= lr_mult
        optimizer.param_groups[0]['lr'] = lr

    # Plot results
    plt.figure(figsize=(10, 6))
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Range Test')
    plt.grid(True)
    plt.savefig('lr_range_test.png')

    return lrs, losses

# Usage
# lrs, losses = find_lr(model, train_loader, optimizer, criterion)

How to interpret:

  1. Look for the steepest downward slope in the loss curve
  2. Pick a learning rate from the middle of that slope
  3. Usually 10x smaller than where loss starts to increase

Smoothly decreases learning rate following a cosine curve:

from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train(model, train_loader, optimizer)
    scheduler.step()

Pros:

  • Smooth decay prevents sudden performance drops
  • Works well for most architectures
  • Can help escape local minima with warm restarts

Increases then decreases learning rate in one cycle:

from torch.optim.lr_scheduler import OneCycleLR

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = OneCycleLR(
    optimizer,
    max_lr=0.1,
    epochs=100,
    steps_per_epoch=len(train_loader)
)

for epoch in range(100):
    for batch in train_loader:
        train_step(batch)
        scheduler.step()  # Call after each batch!

Best for:

  • Fast convergence (often beats other schedules)
  • Limited training time/budget
  • When you know total training iterations upfront

Decreases LR when metrics stop improving:

from torch.optim.lr_scheduler import ReduceLROnPlateau

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',
    factor=0.5,      # Reduce by half
    patience=5,      # Wait 5 epochs
    min_lr=1e-6
)

for epoch in range(100):
    train(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)
    scheduler.step(val_loss)  # Pass validation loss

Best for:

  • Unknown optimal training length
  • When validation loss is your primary metric
  • Conservative training approaches
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,           # Good default
    weight_decay=0.01, # Regularization
    betas=(0.9, 0.999)
)

Typical LR ranges:

  • Transformers: 1e-4 to 5e-4
  • CNNs: 1e-3 to 3e-3
  • Small models: 1e-3 to 1e-2
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,            # Usually 10-100x higher than Adam
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True      # Often helps
)

Typical LR ranges:

  • ResNets: 0.1 (with decay)
  • Small CNNs: 0.01 to 0.1

Gradually increase learning rate at start of training:

import math

def get_lr_with_warmup(current_step, warmup_steps, max_lr, total_steps):
    """Calculate learning rate with warmup and cosine decay"""
    if current_step < warmup_steps:
        # Linear warmup
        return max_lr * current_step / warmup_steps
    else:
        # Cosine decay
        progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
        return max_lr * 0.5 * (1 + math.cos(math.pi * progress))

# Usage in training loop
for step in range(total_steps):
    lr = get_lr_with_warmup(step, warmup_steps=1000,
                           max_lr=1e-3, total_steps=100000)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

Why warmup helps:

  • Prevents early training instability
  • Allows batch normalization statistics to stabilize
  • Essential for large batch training
  • Recommended warmup: 1-5% of total training steps
# Check for:
1. Learning rate too high → reduce by 10x
2. Gradient explosion → add gradient clipping
3. Bad initialization → use proper init (Xavier/He)

# Add gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Try:
1. Increase learning rate (use LR range test)
2. Use AdamW instead of SGD
3. Add learning rate warmup
4. Try OneCycleLR scheduler
# Solutions:
1. Reduce learning rate
2. Add/increase weight decay
3. Use learning rate decay schedule
4. Add dropout or other regularization
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.05,
    betas=(0.9, 0.999)
)

scheduler = CosineAnnealingLR(
    optimizer,
    T_max=epochs,
    eta_min=1e-6
)

# With warmup
warmup_epochs = 5
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4
)

# Step decay every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=30,
    gamma=0.1
)
# Different learning rates for different layers
optimizer = torch.optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # New layers
], weight_decay=0.01)
  • Always run a learning rate range test for new models/datasets
  • Start with proven configurations for your architecture type
  • Use warmup for the first 1-5% of training
  • OneCycleLR often converges fastest
  • AdamW is a safe default optimizer
  • Monitor training curves - they tell you if LR is wrong
  • When fine-tuning, use 10-100x lower learning rates