Writings / Articles / Uncertainty Quantification

experience, the human brain performs a predictive thought together with an accompanying sense of uncertainty. We can characterize our inferences with some psychological measure of confidence. On the other hand, a conventional artificial neural network typically yields outputs without any quantification of their associated uncertainties. A neural network which can express its own uncertainty enables a statistically more robust use of its predictive capability. It can allow us to recognize input data that does not fall within the training data distribution. In the following, we list a few historically seminal works which had laid the foundation for uncertainty quantification in machine learning.

(i) ‘Probabilistic Reasoning in Intelligent Systems: networks of plausible inference’ by J. Pearl (1988).

This is an influential text that laid a number of crucial foundational ideas for uncertainty quantification in artificial intelligence.

(ii) ‘Bagging Predictors’ by L. Breiman (1996).

This seminal paper presented the concept of bagging/ bootstrap aggregation which is currently a popular technique used in ensemble-based methods. The main idea is to train a number of models on different subsets of the training dataset, and use such a collection of models for prediction and quantifying uncertainties.

(iii) ‘Gaussian Processes for Machine Learning’ by Rasmussen and Williams (2006).

This book furnishes a comprehensive introduction to Gaussian processes which is a methodology for uncertainty quantification of model predictions. To put it simply, a Gaussian process is a distribution over functions (instead of vectors).

(iv) ‘Bayesian Learning for Neural Networks’ by R. Neal (1996).

Neal’s thesis furnishes a solid exposition and formulation of Bayesian statistics-based neural networks.

(v) ‘Dropout as a Bayesian approximation : representing model uncertainty in deep learning’ by Y. Gal and Z. Ghahramani (2016).

This paper by Gal and Ghahramani was crucial in popularizing the Monte-Carlo Dropout method in uncertainty quantification.

(vi) ‘Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles’ by Lakshminarayanan, Pritzel and Blundell (2016).

This paper by Lakshminarayanan et al. presented an essential theoretic basis and implementation for Deep Ensembles-based uncertainty quantification. This method has been found by many to be superior in most contexts compared to other available methods of uncertainty quantification.

While there are many proposals for uncertainty quantification frameworks in A.I., there are a few main broad classes of frameworks that have been found to apply well to many different contexts.

(I) **Bayesian Neural Networks**: these are essentially probabilistic neural networks which furnish distributions for the model’s predictions using Bayes theorem. Below, we briefly sketch its main fundamental idea and working principle.

Let \(W\) denote the set of weights of the neural network and \( \mathcal{D} \) the training dataset. According to Bayes’ Theorem, the conditional probability

$$

P\left( W | \mathcal{D} \right) =

\frac{P\left( \mathcal{D} | W \right) P(W) }{P\left(\mathcal{D}\right)},

$$ where \( \left( \mathcal{D} | W \right) \) is the likelihood of having the training data distribution given weights \(W\), \( P(W) \) is the prior distribution for the weights, e.g. some normal distribution, and \( P\left(\mathcal{D}\right) \) is the marginal distribution that can be obtained by the integral

$$

P\left(\mathcal{D} \right) = \int_{w’} dw’\, P \left( \mathcal{D} | w’\right)

P(w’).

$$ This integral cannot be computed analytically in general, and is approximated via part of the model’s algorithm. Once we know the posterior distribution \( P \left( W | \mathcal{D} \right) \), we can then compute the model’s (probabilistic) output \( \hat{y}(x) \),

$$

P\left(

\hat{y}(x) | \mathcal{D}

\right) = \int_W P \left( \hat{y}(x) | W \right)

P\left( W | \mathcal{D} \right) dW,

$$ where we have denoted the input by \( x \), and \( P \left( \hat{y}(x) | W \right) \) denotes the model’s probabilistic outputs. For example, for a Bayesian neural network with two output heads giving the mean \( \mu \) and variance \( \sigma^2 \) of a Gaussian distribution, \( P \left( \hat{y}(x) | W \right) \sim \mathcal{N}\left( \mu, \sigma^2 \right) \). Two general algorithms for approximating the posterior

\( P \left( W | \mathcal{D} \right) \) are

- sampling methods: idea is to generate a finite set of weight values whose empirical distribution matches that of the posterior.
- variational methods: idea is to approximate the posterior with some function \( P_{approx} (W|\mathcal{D}) \) using information-theoretic arguments, etc.

(II) **Monte-Carlo Dropout**: This is an uncertainty quantification method that relies placing stochastic units in the neural network during training, and then generating a set of different predictions by switching some of them off during model inference after training. Monte-Carlo Dropout originally arose in the context of Bayesian neural networks as a means of approximating $$

P\left(

\hat{y}(x) | \mathcal{D}

\right) = \int_W P \left( \hat{y}(x) | W \right)

P\left( W | \mathcal{D} \right) dW

\approx \frac{1}{N} \sum^N_{t = 1} P\left( \hat{y}(x) ,

\tilde{W}_t \right),

$$ where the sampling weights \( \tilde{W}_t, t = 1, 2, \ldots N \) are drawn from an approximate posterior distribution, i.e.

$$

\tilde{W}_t \sim P_{approx} \left( \tilde{W} | \mathcal{D} \right).

$$ The integral is now approximated by a discrete finite sum, each of which can be obtained by a forward inference pass of the neural network. Each pass involves randomly switching off the neurons in the neural network as first explained in the work by Gal and Ghahramani (2016). Although this method was first inspired by Bayesian neural networks, **in modern implementation, we can use other neural networks with stochastic dropout units placed within their architectures. The mean and standard deviations associated with the uncertainties are then obtained by multiple forward passes** (typically 30 – 100 forward passes as suggested in the work of Gal and Ghahramani).

(III) **Deep Ensembles:** This class of uncertainty quantification methods is relatively simpler to implement, the main idea being to use an assembly of models to estimate the mean and standard deviation directly. Typically, what is meant by the ‘Deep Ensemble’ approach is to **use models similar in structure but differing only in their random initial values for their weights.** There are many other variants of selection principles for the ensemble of models, including having each model trained on different subsets of the entire dataset (bagging/bootstrap aggregation), random forest-style approaches, etc. In a typical implementation of the Deep Ensemble method with different initial weights for the models, we take the number of models to be at least five. The main burden of this method lies in the computational cost involved in developing numerous models for the same task, the exception being the Monte-Carlo Dropout method which can be considered as a special type of Deep Ensemble methodology. As we mentioned earlier, each model in the MC Dropout method is realised just by switching on/off distinct subsets of dropout units in the neural network instead of training it from scratch.

This seminal work by Lakshminarayanan et al. establishes the foundation for Deep Ensemble which has often been found to be superior in performance, whether regression or classification types, compared to other uncertainty quantification frameworks. The theoretical reason behind its effectiveness was examined in this paper by Lakshminarayanan et al., where the authors found that Deep Ensemble methods tend to reach more diverse points in the *loss landscape* of neural network – the space of weights and and biases that the model navigates during training.

(IV) **Deep Evidential Learning** :

This is a relatively new proposal for uncertainty quantification where we assert the existence of another multivariate probability distribution apart from the one in the conventional probabilistic neural network. This `second-order’ probability distribution can be interpreted as the prior distribution for the parameters of the original probabilistic neural network. For example, if the original model yields the mean and standard deviation of a Gaussian regression output, one specifies a prior distribution for the mean and standard deviation parameter, and let the parameters of this prior distribution be the model’s outputs. Such a framework was first proposed in this seminal paper by Sensoy et al. and later adapted to regression type problem in this awesome paper by Amini et al.

The term `evidential’ refers to the second-order prior distribution of which parameters are now the direct output of the neural network. In the classification context of Sensoy et al., the authors alluded to Dempster-Shafer theory of evidence to define a generic notion of `uncertainty’. In the regression context proposed by Amini et al., no reference to Dempster-Shafer theory of evidence was made. The notion of a prior distribution was sufficient to define uncertainties, and this framework naturally enables separate computations of aleatoric and epistemic uncertainties defined as

$$

U_{alea} = \mathbb{E}_{\mathcal{P}}

[\text{Var}_{model}], \qquad

U_{epis} = \text{Var}_{\mathcal{P}}

[\mathbb{E}_{model}], \qquad

$$ where the evidential prior distribution is denoted by \( \mathcal{P} \) and the mean and variance enclosed in brackets refer to the parameters of the first-order probability distributions. Aleatoric uncertainties are those that arise from the inherent randomness or noisiness of the data whereas epistemic uncertainties are those related to model’s suitability/ generalizability, and how aligned the testing input is with regards to the training data distribution.

Deep Evidential Learning is the theme of a few of our recent and current research projects at the Gryphon Center. The picture below shows an example application of Deep Evidential Learning adapted for the task of radiotherapy dose prediction. In this context, the input data is a set of CT images delineating various organs and radiation target areas for a patient with head-and-neck cancer undergoing photon radiotherapy. The leftmost diagram shows the model’s prediction error heatmap whereas the center and rightmost diagrams are the aleatoric and epistemic uncertainty heatmaps respectively.

These diagrams were generated in a project we recently completed on using Deep Evidential Learning for radiotherapy dose prediction (see our arxiv preprint). Uncertainty heatmaps equip the user of the model to interpret the model’s prediction with an estimate of the model’s reliability and can be very useful in discovering potential regions of ignorance in the model’s predictive landscape. Uncertainty quantification is a crucial element of machine learning research, enabling one to harness the prowess of A.I. with statistical robustness and enhanced interpretability.

End