Likelihood and information criteria such as Akaike Information Criterion (AIC) are commonly used in statistical modeling to compare different models and select the one that best fits the data. However, the computation of likelihood and AIC for artificial neural network (ANN) models is not straightforward as it is for traditional statistical models.
In ANNs, the objective is to optimize the model parameters to minimize a cost function, such as mean squared error or cross-entropy loss, using an optimization algorithm such as gradient descent. Unlike traditional statistical models, where the likelihood can be directly computed from the probability density function, the likelihood in ANNs is not explicitly defined.
However, there are some techniques that can be used to estimate the likelihood and information criteria for ANN models. Here are some of them:
Maximum Likelihood Estimation (MLE)
MLE is a commonly used technique in statistics to estimate the parameters of a probability distribution that maximizes the likelihood of the observed data. In ANNs, MLE can be used to estimate the likelihood of the model by assuming that the output of the network follows a known probability distribution, such as a Gaussian or a Bernoulli distribution.
To use MLE, one needs to compute the log-likelihood of the observed data given the model parameters. The log-likelihood can be computed by evaluating the probability density function of the output of the network at the observed data points. However, computing the probability density function can be computationally expensive, especially for complex models.
Information criteria
Information criteria such as AIC and Bayesian Information Criterion (BIC) are commonly used in statistics to compare different models based on their goodness-of-fit and complexity. In ANNs, information criteria can be used to compare different network architectures or to select the best model from a set of models.
To compute the information criteria for ANN models, one needs to compute the likelihood of the model given the observed data and the number of parameters in the model. One way to estimate the likelihood is to use cross-validation, where the data is split into training and validation sets, and the likelihood is computed on the validation set.
Once the likelihood is computed, one can compute the AIC or BIC by adding a penalty term that depends on the number of parameters in the model. The penalty term helps to avoid overfitting and select a simpler model that is more likely to generalize well to new data.
In summary, computing the likelihood and information criteria for ANN models is not straightforward as it is for traditional statistical models. However, techniques such as maximum likelihood estimation and information criteria can be used to estimate the likelihood and compare different models.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Chapter 5, pages 171-175, covers the computation of likelihood for neural networks.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723. The original paper on AIC.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research, 33(2), 261-304. A comprehensive explanation of AIC and BIC.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. Chapter 5, pages 153-157, covers the computation of likelihood for neural networks.
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press. Chapter 7, pages 203-206, covers the computation of likelihood for neural networks.