Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

This research reveals how momentum and mini-batch gradients actively constrain solution 'sharpness' near an instability boundary in SGD. Leverage this understanding to design more stable, robust deep learning models and optimize training strategies for better generalization.

intermediate15 min5 steps

The play

Understand SGD Dynamics
Recognize that Stochastic Gradient Descent (SGD) naturally operates near an instability boundary, and that this boundary significantly influences model stability and generalization capabilities.
Vary Momentum Parameter
Experiment with different momentum values (e.g., 0.9, 0.95, 0.99) in your optimizer configuration. Observe how these changes impact training stability, convergence speed, and the final model's performance on validation data.
Adjust Mini-Batch Size
Test various mini-batch sizes (e.g., 32, 64, 128, 256) during training. Analyze how different batch sizes affect the smoothness of the loss landscape, training dynamics, and the model's ability to generalize to unseen data.
Monitor Stability Proxies
While direct 'sharpness' measurement is complex, monitor proxy metrics like training loss variance, gradient norms, or validation loss stability over epochs. These can provide insights into how close your model is to the instability boundary.
Tune for Robustness
Apply these insights during hyperparameter tuning. Prioritize combinations of momentum and batch size that yield more stable training, reduced variance, and improved generalization, rather than solely focusing on the fastest convergence.

Starter code

import tensorflow as tf
from tensorflow.keras import layers, models, optimizers

# 1. Define a simple model
model = models.Sequential([
    layers.Flatten(input_shape=(28, 28)),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')
])

# 2. Configure SGD optimizer with momentum
# Experiment with different momentum values (e.g., 0.9, 0.95, 0.99)
sgd_optimizer = optimizers.SGD(learning_rate=0.01, momentum=0.9)

# 3. Compile the model
model.compile(optimizer=sgd_optimizer,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 4. Load a sample dataset (e.g., MNIST)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# 5. Train the model with a specified mini-batch size
# Experiment with different batch sizes (e.g., 32, 64, 128, 256)
print(f"Training model with momentum={sgd_optimizer.momentum.numpy()} and batch_size=64")
history = model.fit(x_train, y_train, epochs=1, batch_size=64, validation_data=(x_test, y_test), verbose=0)

print(f"Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")
print("To continue experimenting, change momentum and batch_size in steps 2 and 5.")

Source

Paperarxiv.org