Some posts ago, we were discussing that AutoML is the best way to avoid the overwhelming process of tuning all the hyper-parameters that modify the behavior of a neural network. However, in AutoML you can also choose which hyper-parameters should be the most relevant ones to explore. So the question is: which hyper-parameter should we explore first? And our answer is the learning rate.
The learning rate is a critical hyper-parameter because deep-learning-based neural networks use the stochastic gradient descent optimization algorithm. This means that we need to decide how much to move towards the optima when the weights are updated after computing the loss and the gradients. Selecting the right learning rate can make an important difference. A learning rate with a too small value could get stuck and reach the optima after too many epochs. On the other hand, a too large learning rate could overshoot the optima and make the training process very unstable.
Other questions that can arise are whether the same learning rate should be used at different moments of the training or whether a different learning rate should be used in different parts of the architecture while using transfer learning. And the answer to both questions is that different learning rates at different moments/layers can boost performance. Techniques such as Cyclical Learning Rate or using a small learning rate in the bottom layers have proved to work quite well.
How does the learning rate change its value while using the Cyclical Learning Rate technique?
In some of our notebooks, we cover in depth different techniques to manage efficiently the adaptation of the learning rate in different moments and layers. They all are working examples ready for your modifications.
If you find this repository interesting, please help us with a star and share it with your colleagues.