Efficient Transfer Learning: Fine-tuning or Top-Tuning?

Often fine-tuning a deep neural network is more art than science/engineering. Which is the most optimal learning rate? How many layers shall we freeze? Should we use different learning rates depending on the layer? Most of the time, many of these questions are solved by trial and error (and some AutoML). Would it be a better mindset to take the pre-trained weights as they come, use the architecture as a feature extractor, and train on top of it a “traditional” model like Support Vector Machines or Random Forests? In other words, is top-tuning better (besides being more straightforward) than fine-tuning?

Freezing all layers: Top-tuning as a more efficient approach to transfer learning (Source: Pixabay

In [1], the authors ran several experiments to shed some light on the differences between fine-tuning and top-tuning. Specifically, they addressed the image classification problem. By performing hundreds of training processes involving 32 target datasets and 99 different settings, they showed that the top-tuning approach provides comparable accuracy with fine-tuning in most experiments, with a training time between one and two orders of magnitude smaller. These results suggest that top-tuning offers a practical alternative for fine-tuning in small/medium datasets (as is usually the case with agricultural datasets), especially when training efficiency is critical. Finally, their results show that the benefit of fine-tuning is not highly dependent on the pre-trained model architecture; on the other hand, the choice of an appropriate pre-training dataset significantly impacts the accuracy, especially in the case of top-tuning. We highly recommend its reading!

In our Gihub repo, we have implemented an example of top-tuning by using:

  • Feature Extraction: EfficientNetB0.
  • Top Classifier: Random Forests.

We used four different Eden datasets as a benchmark, and the performance was pretty high. Moreover, the training time was much lower than the necessary one for fine-tuning. Also, the need for tuning the hyper-parameters was virtually nonexistent (we just used the default hyper-parameters provided by Scikit-learn). Since we used timm to instantiate the pre-trained EfficientNetB0 feature extractor, this notebook can easily be extended with many more architectures.

On our notebook, at the repository, you can find a plethora of notebooks implementing different deep learning techniques. These notebooks are fully functional and can be readily modified to suit individual needs.
In case you find these notebooks compelling, we would greatly appreciate your support in the form of a star and sharing with your colleagues.

[1] Alfano, P.D., Pastore, V.P., Rosasco, L., & Odone, F. (2022). Fine-tuning or top-tuning? Transfer learning with pretrained features and fast kernel methods. ArXiv, abs/2209.07932.