Imagine that you have trained several models to find the most suitable one for your application and the one that obtained the best performance is also the largest. And, additionally, the slowest one at inference time. This problem is rather common. Very large models that fit the training objectives, but often fail to meet latency, […]