Active Learning in Machine Learning: A brief intro

Any team that wants to implement AI methods in their pipeline to offer novel solutions to more complex tasks they are facing, will sooner or later have to deal with the necessity for plenty of relevant data. 
However, what constitutes plenty? Are 100 samples enough? 1000? 10000? Where does that stop? The lower limit should be obvious from the model’s behavior during training, but what about an upper limit? 

An initial upper limit could be set from a technical perspective, given storage and cost concerns; all these samples will require annotation, which is the most labor-intensive process in AI. Here, a few questions arise:
How can we avoid high annotation costs?
Can we automate this laborious process?
Is all of this data equally valuable?

Active Learning (AL) is a methodology that was developed on the basis of such questions, and the assumption that not all data points are equally useful to a given model’s training capacity. What if we let the model decide which data samples it considers useful for its training? This would increase automationspeed, and reproducibility while reducing labor costs.

AL is an iterative process that starts with training a given model on a very small human-annotated subset of the dataset. Then, this undertrained model is used to run inference on the rest of the dataset’s samples. Nevertheless, the metric of interest is not its predictions themselves, but its predictions’ confidence: the whole premise of AL is that predictions of higher confidence refer to features that the model has learned how to deal with, whereas predictions of lower confidence highlight feature the model does not know what to do with yet. 
From the inference process, we select a batch of data samples with low prediction confidence and forward it to the human annotators for annotation. This annotated batch is added to the previously human-annotated subset and used jointly to retrain a newer model. The loop of training-inferring-annotating is repeated until the model displays no further improvement in performance, or the annotation costs exceed a given budget. The output of this process is a smaller dataset that maximizes the model’s training capacity under set technical limitations.

Eden Library has fully adopted this methodology in its AI pipelines. Given the quantity of data we are receiving from the fields and processing on a weekly basis, we required a method to filter the data samples that were beneficial to the models we are developing. This leads to increased savings in storage costs, as well as the more significant annotation costs, streamlining our whole process.