Research / Projects / Active learning in A.I. models for biomedical segmentation
Link to our work on the arXiv: https://arxiv.org/abs/2312.10361
This work has been published in the journal Computers in Medicine and Biology.
In its simplest incarnation, active learning in the language of A.I. refers to a neural network (or more generally an A.I. model) being trained on a small subset of training dataset in each iteration (where weights are updated as usual following, say, a gradient-descent-based minimization of some loss function) rather than the entire dataset. The sampling principle constitutes the major non-trivial part of active learning. The theory of active learning concerns the criterion upon which the sampling principle should be based on. Given some training dataset, active learning seeks an answer to the question: what is an optimal subset of data for the model to be trained upon such that subsequent iterations will eventually enable early convergence of the model without the use of the full training dataset? How do we choose this optimal subset?
Active learning has a natural counterpart in terms of how our own human intelligence participates in learning. We are constantly surrounded by an evolving complex environment, and although we are capable of attempting to grasp a huge variety of details of it, our limited attention span, finite ability to focus and assimilate knowledge will, in all cases, constrain us to select a small sector of the observed external environment as our brain’s training material to comprehend our environment. When we study a textbook for an exam, we tend not to study and reserve equal attention to every single word, but pick those more representative or more difficult concepts that we can devote our learning process to. In a similar vein, active learning articulates how an artificial intelligent system may benefit from a selective query approach when it is attempting to learn from data. In the machine learning literature, active learning has quite a bit of history stretching back into the 90s (e.g. see this seminal paper by Cohn et al.) and continues to be a pertinent topic of modern research with overlapping aspects in continual learning and reinforcement learning.
In a recent work, we examined a variety of active learning techniques to the segmentation of biomedical images, taking our datasets to be the prostate and cardiac images in the publicly accessible Medical Segmentation Decathlon. In literature, there are two major classes of sampling principles: (i)uncertainty-based sampling, (ii)representativeness sampling. In uncertainty-based sampling, the small set of images picked in each iteration are those for which the model has the highest uncertainty for, the rationale being to expose the model earlier to those associated with steeper learning curves. We used Shannon entropy to quantify the model uncertainty in this work. For representativeness sampling, the idea is that we pick the images that are most representative of the underlying data distribution at each iteration, so that model learning is supported by a more balanced exposure to the data distribution.
A typical approach is to use the feature vectors defined within the neural network to be representative objects for the images. Since these feature vectors are generically very high in dimensions, a natural simplification technique is to use some dimension-reduction method to reduce the dimension of these feature vectors before using them to portray the underlying data distribution. One common approach is principal component analysis where we only pick a certain linear combination of the original feature vector components that maximally account for the variance in the data distribution. Another more recent technique is Uniform Manifold Approximation Projection (UMAP) which is an algebraic topology-based technique that enables us to perform a dimensionality reduction while preserving the higher-dimensional topological structure.
We noted that UMAP has been found in a work by Yan et al. to be quite remarkably effective in generating a low-dimensional representation of multi-contrast MRI images that facilitates parcellation of thalamic nuclei. In our work, we proposed a novel active learning method that we called Entropy-UMAP where uncertainty sampling is first performed before UMAP-based representativeness sampling. We found that when applied to segmentation of the prostate and cardiac MRI datasets of the Medical Segmentation Decathlon, this new sampling principle supersedes the random baseline method with the largest margin among 10 different other active learning methods.
We hope that our work will inspire more in-depth exploration of UMAP in active learning strategies especially in biomedical segmentation!