hyperparameter

Hyperparameters are variables that control different aspects of training. Three common hyperparameters are:

  • Learning rate
  • Batch size
  • Epochs

Learning rate

  • Learning rate is a floating point number you set that influences how quickly the model converges.
  • If the learning rate is too low, the model can take a long time to converge.
  • However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimize the loss.
  • The goal is to pick a learning rate that’s not too high nor too low so that the model converges quickly.

Batch Size

  • Batch size is a hyperparameter that refers to the number of examples the model processes before updating its weights and bias.
  • You might think that the model should calculate the loss for every example in the dataset before updating the weights and bias.
  • However, when a dataset contains hundreds of thousands or even millions of examples, using the full batch isn’t practical.

Stochastic gradient descent (SGD):

  • Stochastic gradient descent uses only a single example (a batch size of one) per iteration.
  • Given enough iterations, SGD works but is very noisy.
  • “[[Noise]]” refers to variations during training that cause the loss to increase rather than decrease during an iteration.
  • The term “stochastic” indicates that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD):

  • Mini-batch stochastic gradient descent is a compromise between full-batch and SGD.
  • For number of data points, the batch size can be any number greater than 1 and less than
  • The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration.
  • Determining the number of examples for each batch depends on the dataset and the available compute resources.
  • In general, small batch sizes behaves like SGD, and larger batch sizes behaves like full-batch gradient descent.

Epochs

  • During training, an epoch means that the model has processed every example in the training set once.
  • For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch.
  • Training typically requires many epochs. That is, the system needs to process every example in the training set multiple times.
  • The number of epochs is a hyperparameter you set before the model begins training.
  • In many cases, you’ll need to experiment with how many epochs it takes for the model to converge.
  • In general, more epochs produces a better model, but also takes more time to train.

Full batch

  • After the model looks at all the examples in the dataset. For instance, if a dataset contains 1,000 examples and the model trains for 20 epochs, the model updates the weights and bias 20 times, once per epoch.

Stochastic gradient descent

  • After the model looks at a single example from the dataset. For instance, if a dataset contains 1,000 examples and trains for 20 epochs, the model updates the weights and bias 20,000 times.

Mini-batch stochastic gradient descent

  • After the model looks at the examples in each batch. For instance, if a dataset contains 1,000 examples, and the batch size is 100, and the model trains for 20 epochs, the model updates the weights and bias 200 times.