Introduction to knowledge distillation

Imagine that you have trained several models to find the most suitable one for your application and the one that obtained the best performance is also the largest. And, additionally, the slowest one at inference time. This problem is rather common. Very large models that fit the training objectives, but often fail to meet latency, performance, and throughput benchmarks at inference time under real-world settings. So, Are there any chances to reduce the size and the inference time without compromising too much the performance? Let’s talk about knowledge distillation.

Learning by mimicking is a normal behavior in nature. (Source: Pixabay

Deploying large deep neural network models is a challenge, especially when edge devices with limited memory and computational power are in the equation. To overcome this challenge, the “Knowledge Distillation” technique arose as a form of model compression. This method focuses on the process of transferring knowledge from a large model to a smaller model that can be practically deployed under real-world constraints. It was first successfully demonstrated by Bucilua et al., (2006) [1] and then, popularized by Hinton et al., (2015) [2]. The real aim is to implement this mechanism without a significant loss in performance. In the case of current SOTA neural networks, knowledge distillation has arisen as a possibility to compress all those millions of weights and tens of processing layers.

In knowledge distillation, a small “student” model learns to mimic a large “teacher” model to obtain similar performance. Source: Arxiv

Since several types of knowledge can be transferred, there are also various distillation pipelines. For instance, the response-based knowledge focuses on the final output layer of the teacher model. The hypothesis is that the student will learn to mimic the predictions of the teacher. As you can see in this notebook and this post, the key ingredients for implementing a response-based Knowledge Distillation pipeline with Tensorfow are:

  • An offline trained teacher model: The complex teacher network is first trained offline using the complete dataset.
  • student model to train: It is necessary to create a connection between the student and the teacher. This connection can be established between the intermediate outputs or the final prediction. For instance, this can be done with a distillation loss function, along with a temperature, on the difference between the soft student predictions and the soft teacher labels. As this loss is minimized over training, the student model will become better at making the same predictions as the teacher.
  • loss function on the difference between student predictions and ground-truth
  • An alpha factor to weigh the relevance of the student and distillation losses.

On the Eden Library, you can find notebooks implementing different deep learning techniques, such as Knowledge Distillation. They all are working examples ready for your modifications. Check them by clicking on the link below. And if you find this repo interesting,  please, help us with a star and share it with your colleagues.

[1] Bucila, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. Knowledge Discovery and Data Mining.

[2] Hinton, G.E., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. ArXiv, abs/1503.02531.