Abstract
Typically, measuring the generalization ability of a neural network relies on the well-known method of cross-validation which statistically estimates the classification error of a network architecture thus assessing its generalization ability. However, for a number of reasons, cross-validation does not constitute an efficient and unbiased estimator of generalization and cannot be used to assess generalization of neural network after training. In this paper, we introduce a new method for evaluating generalization based on a deterministic approach revealing and exploiting the network’s domain of validity. This is the area of the input space containing all the points for which a class-specific network output provides values higher than a certainty threshold. The proposed approach is a set membership technique which defines the network’s domain of validity by inverting its output activity on the input space. For a trained neural network, the result of this inversion is a set of hyper-boxes which constitute a reliable and \(\varepsilon\)-accurate computation of the domain of validity. Suitably defined metrics on the volume of the domain of validity provide a deterministic estimation of the generalization ability of the trained network not affected by random test set selection as with cross-validation. The effectiveness of the proposed generalization measures is demonstrated on illustrative examples using artificial and real datasets using swallow feed-forward neural networks such as Multi-layer perceptrons.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Abbreviations
- HPD:
-
Highest posterior density
- INTLAB:
-
INTerval LABoratory
- IA:
-
Interval analysis
- MLP:
-
Multi-layer perceptron
- OTS:
-
Off training set
- PDF:
-
Probability density function
- SCS:
-
Set computations with subpavings
- SIVIA:
-
Set inversion via interval analysis
References
Adam SP, Karras DA, Magoulas GD, Vrahatis MN (2015) Reliable estimation of a neural network’s domain of validity through interval analysis based inversion. In: 2015 international joint conference on neural networks (IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN.2015.7280794
Adam SP, Likas AC, Vrahatis MN (2017) Interval analysis based neural network inversion: a means for evaluating generalization. In: Boracchi G, Iliadis L, Jayne C, Likas A (eds) Engineering applications of neural networks. Springer International Publishing, Berlin, pp 314–326
Adam SP, Magoulas GD, Karras DA, Vrahatis MN (2016) Bounding the search space for global optimization of neural networks learning error: an interval analysis approach. J Mach Learn Res 17(169):1–40. http://jmlr.org/papers/v17/14-350.html
Bishop CM (1996) Neural networks for pattern recognition. Oxford University Press, Oxford
Courrieu P (1994) Three algorithms for estimating the domain of validity of feedforward neural networks. Neural Netw 7(1):169–174
Eberhart R, Dobbins R (1991) Designing neural network explanation facilities using genetic algorithms. In: 1991 IEEE international joint conference on neural networks, vol 2, pp 1758–1763
Hampshire II JB, Pearlmutter BA (1991) Equivalence proofs for multilayer perceptron classifiers and the Bayesian discriminant function. In: Proceedings of the 1990 connectionist models summer school, vol 1, pp 159–172
Hassoun MH (1995) Fundamentals of artificial neural networks. MIT Press, Cambridge
Haykin S (1999) Neural networks a comprehensive foundation, 2nd edn. Prentice-Hall, Upper Saddle River, NJ
Hernández-Espinosa C, Fernández-Redondo M, Ortiz-Gómez M (2003) Inversion of a Neural Network via Interval Arithmetic for Rule Extraction. In: Kaynak O, Alpaydin E, Oja E, Xu L (eds) Artificial Neural Networks and Neural Information Processing ICANN/ICONIP 2003, vol 2714. Springer, Berlin Heidelberg, pp 670–677 Lecture Notes in Computer Science
Jaulin L, Kieffer M, Didrit O, Walter E (2001) Applied interval analysis with examples in parameter and state estimation, robust control and robotics. Springer, London
Jaulin L, Walter E (1993) Set inversion via interval analysis for nonlinear bounded-error estimation. Automatica 29(4):1053–1064
Jensen C, Reed R, Marks R, El-Sharkawi M, Jung JB, Miyamoto R, Anderson G, Eggen C (1999) Inversion of feedforward neural networks: algorithms and applications. In: Proceedings of the IEEE 87(9):1536–1549
Kamimura R (2017) Mutual information maximization for improving and interpreting multi-layered neural networks. In: 2017 IEEE symposium series on computational intelligence (SSCI), pp 1–7
Karystinos GN, Pados DA (2000) On overfitting, generalization, and randomly expanded training sets. IEEE Trans Neural Netw 11(5):1050–1057
Kearfott RB (1996) Interval computations: introduction, uses, and resources. Euromath Bull 2(1):95–112
Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23:462–466
Kindermann J, Linden A (1990) Inversion of neural networks by gradient descent. Parallel Comput 14(3):277–286
Likas A (2001) Probability density estimation using artificial neural networks. Comput Phys Commun 135(2):167–175
Liu Y (1995) Unbiased estimate of generalization error and model selection in neural network. Neural Netw 8(2):215–219
Lu BL, Kita H, Nishikawa Y (1999) Inverting feedforward neural networks using linear and nonlinear programming. IEEE Trans Neural Netw 10(6):1271–1290
Novak R, Bahri Y, Abolafia DA, Pennington J, Sohl-Dickstein J (2018) Sensitivity and generalization in neural networks: an empirical study. In: International conference on learning representations. https://openreview.net/forum?id=HJC2SzZCW
Reed R, Marks R (1995) An evolutionary algorithm for function inversion and boundary marking. In: IEEE international conference on evolutionary computation, 1995, vol 2, pp 794–797
Richard M, Lippmann R (1991) Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput 3(4):461–483. https://doi.org/10.1162/neco.1991.3.4.461
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407
Rump SM (1999) INTLAB - INTerval LABoratory. In: Csendes T (ed) Developments in reliable computing. Kluwer Academic, Dordrecht, Netherlands, pp 77–104
Saad EW, Wunsch DC II (2007) Neural network explanation using inversion. Neural Netw 20(1):78–93
Theodoridis S, Pikrakis A, Koutroumbas K, Kavouras D (2010) Introduction to pattern recognition: a MATLAB approach. Academic Press, Burlington, MA 01803, USA
Thrun SB (1993) Extracting provably correct rules from artificial neural networks. Technical Report IAI–TR–93–5, Institut fur Informatik III, Bonn, Germany
Tornil-Sin S, Puig V, Escobet T (2010) Set computations with subpavings in MATLAB: the SCS toolbox. In: 2010 IEEE international symposium on computer-aided control system design (CACSD), pp 1403–1408
Wolpert DH (1990) A mathematical theory of generalization: part I. Complex Syst 4(2):151–200
Wolpert DH (1990) A mathematical theory of generalization: part II. Complex Syst 4(2):201–249
Wolpert DH (1992) On the connection between in-sample testing and generalization error. Complex Syst 6(1):47–94
Wolpert DH (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8(7):1391–1420. https://doi.org/10.1162/neco.1996.8.7.1391
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390. https://doi.org/10.1162/neco.1996.8.7.1341
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable suggestions and comments on earlier version of the manuscript that helped to significantly improve the paper at hand.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest:
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
In order to illustrate the impact of the \(\beta\)-cut on the domain of validity first let us consider the two-dimensional classification dataset with two classes forming nine groups shown in Fig. 6a. A \(2-10-1\) MLP, using logistic sigmoid activation functions, has been trained on this dataset, and the contour plot of its output is shown in Fig. 6b. In this Figure, the white regions (output greater than \(1-\beta\)) correspond to patterns classified by the MLP network into class 1 (red points), while the black regions (output lower than \(\beta\)) correspond to patterns classified into class 2 (blue points). Obviously, the gray level zone depicts the ambiguity of classification for patterns near the class boundaries and provides MLP output values in the interval \([\beta ,1-\beta ]\).
The impact of this \(\beta\)-cut classification decision is better depicted in Figs. 7 and 8. For each one of these Figures, the red colored area corresponds to a specific domain of validity defined for some specific interval \([1-\beta ,1]\) of the network output, for the MLP trained on the above two-dimensional problem. Each area is determined using SIVIA to invert the MLP output interval \([1-\beta ,1]\) for class 1 in the input space.
It is obvious that the value of \(\beta\) clearly extends or restricts the input space area classified by the MLP into class 1. This argument can be easily verified by simple observation of Fig. 7a, b, while for problems with a higher dimension this can be confirmed with the comparison of the volumes of the respective domains of validity. This shows the importance of choosing the right value for \(\beta\) which, here, needs to be 0.1 if one wants to take the right classification decision for a significant part of the input space. As shown in Fig. 8a, b, the appropriate value of \(\beta\) also depends on the number of training epochs and the error threshold that were used chosen to train the MLP.
Appendix B
In the extreme case, of a pattern producing more than one valid outputs (i.e., it is assigned to more than one classes), the current implementation computing the domain of validity results in considering this pattern misclassified. Actually, for its proper class the pattern is correctly classified while for any other class for the other classes it is a misclassified pattern. A previous approach for determining the domain of validity considered such a pattern unclassified. However in terms of the proposed metrics both approaches compute the same result given that unclassified and misclassified patterns have the same status for the computed metrics.
In many cases a training algorithm results in either under-trained or over-trained networks. Under-training arises for many reasons (insufficient training, small sized training data, inappropriate network architecture, etc.). In consequence, as shown in Fig. 9b, the domain of validity covers either small regions of the input space or a large region is incorrectly classified. For instance, the validity domain of a 2–4–2 MLP, shown in Fig. 9a, exhibits a more regular coverage of the input space, while the 2–2–2 MLP, as shown in Fig. 9b manages to cover a narrow strip in the input space. In general, it can be stated the validity domain of an under-trained network is composed of a small number of large regions with regularly shaped boundaries.
Besides under-training, another issue affecting generalization is network over-training. Typically, an over-trained network fails to correctly classify unseen patterns as it has learned “exactly” the training data and hence it is not able to generalize well. In this case the decision boundaries computed by the network delimit as firmly as possible the regions in the input space and the network fails to interpolate even among close neighboring groups. In such a case the domain of validity consists of smaller regions and so its volume diminishes. An indicative example of such a validity domain is given in Fig. 9c. As a result we may state that for a well-trained network, the lower the volume of its domain of validity, the poorer the generalization achieved by the network due to over-training.
Another unfortunate result, when considering over-training is that MLPs, especially those with a high number of nodes in the hidden(s) layer(s), tend to fit outliers, noisy input patterns as well as patterns with noisy class labels, see Fig. 10. In these cases the network has the flexibility to form the decision boundaries that discriminate the outlying or misplaced patterns. Doing so, the network defines isolated regions, such as isles or lobes, in the input space which delimit not only these very patterns but also important parts of the input space for which there is no information about the class or the classes they belong to. In general, it can be stated that the validity domain of an over-trained network contains regions with small size and irregularly shaped boundaries.
Hence, the previous cases constitute different aspects of over-training that need to be taken into account when considering the volume of the domain of validity as a metric of the network’s generalization performance.
Rights and permissions
About this article
Cite this article
Adam, S.P., Likas, A.C. & Vrahatis, M.N. Evaluating generalization through interval-based neural network inversion. Neural Comput & Applic 31, 9241–9260 (2019). https://doi.org/10.1007/s00521-019-04129-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04129-5