Abstract
Convolutional neural networks (CNNs) have become a useful tool for a wide range of applications such as text classification. However, CNNs are not always sufficiently accurate to be useful in certain applications. The selection of activation functions within CNN architecture can affect the efficacy of the CNN. However, there is limited research regarding which activation functions are best for CNN text classification. This study tested sixteen activation functions across three text classification datasets and six CNN structures, to determine the effects of activation function on accuracy, iterations to convergence, and Positive Confidence Difference (PCD). PCD is a novel metric introduced to compare how activation functions affected a network’s classification confidence. Tables were presented to compare the performance of the activation functions across the different CNN architectures and datasets. Top performing activation functions across the different tests included the symmetrical multi-state activation function, sigmoid, penalised hyperbolic tangent, and generalised swish. An activation function’s PCD was the most consistent evaluation metric during activation function assessment, implying a close relationship between activation functions and network confidence that has yet to be explored.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The nature of the internet has led to a growing abundance of textual data. Researchers have used neural network approaches to process large datasets that would be difficult to analyse manually [1,2,3]. Neural networks use a structured arrangement of functions to analyse certain regions of data and enable recognition or prediction tasks. Convolutional Neural Networks (CNNs) are a subset of neural networks that can be used for text classification tasks. While Recurrent Neural Networks, Capsule Neural Networks and Transformers are all popular models in text classification, CNNs remain useful for detecting patterns and features such as key phrases [1]. CNNs have been successfully applied to several problems involving computer vision [4,5,6], speech recognition [7,8,9], or natural language processing [10,11,12]. However, some CNNs have yielded incorrect classification with a high confidence [13,14,15], and can be prone to overfitting. To overcome these issues, a variety of settings within a typical CNN architecture have been proposed [16,17,18]. These settings include the number of hidden layers, the number, size and stride of filters, the activation functions, and loss functions.
It is not always possible to determine the best CNN approach for a specific application a-priori. In particular, activation functions have been shown to have a noticeable effect on the accuracy of a neural network [19,20,21,22]. Many novel activation functions have been created to provide improved results for specific applications [23,24,25]. However, an activation function that suits one application is not guaranteed to suit another [26]. Even theoretically similar activation functions can perform differently in practice [22]. There are several papers comparing the performance of different activation functions on image, audio and other non-text-based datasets [27,28,29]. However, findings in these papers may not necessarily apply to text classification.
There are several theories as to why certain features of different activation functions may be advantageous. In particular, [30] theorise that bounded functions have greater numerical stability than unbounded functions. Furthermore, [29] claim that symmetric activation functions lead to improved suppression of adversarial samples, reducing high confidence false predictions. In the context of rectifier-based activations [31] purport that non-monotonic functions are particularly strong, as they introduce negative activation values and non-zero derivative values corresponding to negative inputs.
Assessment of the efficacy of activation functions on text classification has been undertaken, with varied results [22, 32, 33]. However, there is no thorough exploration on the effect of activation functions across both text classification datasets and CNN structures. It is difficult to predict the efficacy of various activation functions based on the performance in a proximal context [34]. Therefore, there are no generally accepted optimal criteria for selecting activation functions across text classification tasks. In this paper sixteen activation functions will be assessed across three different text classification datasets and six different CNN structures to determine the best activation functions.
2 Background
CNN architectures for text classification tasks can include convolutional layers, activation function layers, max pooling layers, batch normalisation layers, and fully connected layers. Convolutional layers are generally used to extract features in the raw data, and thus, often appear early in the network architecture. Activation function layers apply non-linear transformations to outputs from convolutional and fully connected layers. Max pooling layers effectively condense the output of convolutional layers and can reduce computational burden. Batch normalisation layers can be added to re-distribute the output values of a layer. This normalisation can help prevent overfitting but may negatively affect a network’s performance [35]. Fully connected layers add information from the previous layers using simple linear functions. The output layer of a text classification CNN is commonly a fully connected layer, with softmax activation functions (Eq. 1). Softmax activations allow for an output of dependent probabilities, corresponding to the likelihood that each class matches the input.
where ‘x’ is a vector of the previous layer’s outputs, ‘\(K\)’ is the vector length, ‘\(j\)’ is the index of a vector element, and ‘\(i\)’ is the index of the corresponding softmax output element.
The stochastic gradient decent parameter identification method is classically applied to train CNNs. Stochastic gradient decent iterates from an initial network parametrisation using the loss function (\(L\)) derivative (\(\nabla L\)), the current trainable parameter set (\(w_{old}\)), and a predefined learning rate (\(\eta\)) to obtain the new trainable parameter set (\(w_{new}\)) (Eq. 2). The loss function calculates how close the network outputs (\(\hat{y}\)) are to the desired outputs (\(y\)). Cross entropy loss or its multi-class counterpart (Eq. 3, within which ‘\(N\)’ is the number of outputs) are common loss functions for CNNs with softmax outputs. Network training is considered successful if the convergence criteria is met prior to the maximum number of training iterations.
Since words must be converted to numbers for use as CNN inputs, word embeddings are a common pre-processing step in text classification problems. Word embeddings are vectors of numbers that correspond to words. They can be pre-trained on large volumes of text to maintain aspects of a word’s context. CNNs for text classification are typically shallower than CNNs for other applications [35] and by utilising certain word embeddings, acceptable results can be achieved with single convolutional layer [36].
3 Methods
This research considered how activation functions perform in CNN text classification tasks with different architectures and datasets. Sixteen diverse activation functions were tested (Table 1). This is not an exhaustive list of potential functions. However, the list represents the specific or approximate form of most popular types. Several rectifier-based, piecewise, continuous, bounded, unbounded, symmetric, monotonic and/or non-monotonic functions were included. The attempt was made to include a 1-D radial basis function (RBF), but the deeper networks using the 1-D RBF would not convergence for the datasets tested. Thus, the function was abandoned.
Trainable activation functions such as ‘maxout’ were excluded from analysis. Trainable activation functions have their own parameters that must be trained with a network’s usual parameters. While trainable activation functions often perform well [37], they can be mathematically equivalent to a network with a different structure and non-trainable activation functions [34]. A network with trainable activation functions is also more difficult to train [34]. When effort is taken to determine if marginal performance gains using maxout functions are due to the increased number of trainable parameters or the activation function itself, accuracy was found to correlate with memory usage [33]. Thus, conducting a direct but equitable comparison with trainable activation functions is difficult.
The activation functions were grouped into five categories, based on the function features. The rectifier-based functions are all unbounded. Rectifier-based functions can be continuous and piecewise (Fig. 1a) or non-monotonic (Fig. 1b). Of the non-rectifier-based functions, some were tanh-based (Fig. 2a), some were sigmoid-based (Fig. 2b), and some were neither (Fig. 2c). All non-rectifier-based functions were monotonic, all but G-Swish were bounded, and all but HardTanh and PTanh were continuous.
As CNNs can have variable performance across different datasets [51, 52], three common text classification datasets were chosen to diversify the testing. The TREC-6 [53], IMDb [54], and AG News [55] datasets were used as they include a range of text lengths, data sizes and class numbers. The TREC-6 dataset is a 6 class question classification dataset where questions are labelled as relating to either an abbreviation, description and abstract concept, entity, human being, location, or numeric value. The IMDb dataset consists of 50,000 movie reviews that have been classed as positive or negative reviews. The AG News dataset consist of titles and descriptions of news articles that can be categorised as world, sports, business, or science-and-technology. These datasets are often used for text classification method testing and are already split into allotted testing and training datasets.
Text was split with a plain English tokenizer into a list of words for each instance of text data. Datasets had a varying number of words for the text instances, so a maximum number of words was chosen to decrease computational load. When the number of words exceeded the maximum number, the excess words were ignored. For some datasets, only a portion of the allotted training data was used, and the remainder was added to the testing data. Table 2 summarizes dataset information. Words were converted to word embeddings using pre-trained 50 dimension GloVe embeddings, trained on 6 billion words from Wikipedia and Gigaword [56].
The datasets were classified using six different network structures. Shallow networks (Table 3) and deeper networks (Table 4) were explored. Small, medium and large CNNs for each depth were designed with approximately 5,000, 25,000, and 125,000 trainable parameters, respectively. For consistency, softmax activations (Eq. 1) were used in the output layers. The hidden activation layers exclusively contained the activation function being tested. To equitably test activation functions known to be subject to the vanishing gradient problem (such as tanh and sigmoid), a batch normalization layer was included in the deeper network structure across all activation function tests.
To train a particular network, training data was split into batch sizes of 100. Batches of data were padded with zeros to be equal length. Training was undertaken with stochastic gradient decent (Eq. 2), and the learning rates noted in Table 5. Multi-class cross entropy loss was applied (Eq. 3). Convergence was reached when the average loss changed less than a given tolerance across five iterations of all training data. The tolerance was 0.01 for all tests. While it is common to split a dataset to include a validation set to test for overfitting and stop training before convergence, early stopping was not utilised in this study. The number of iterations to convergence was a metric of interest and early stopping would invalidate those results. Furthermore, some activation functions may have been more prone to overfitting and using a validation set would have obscured that phenomenon.
The class with the highest associated softmax probability was chosen as the network output. Because of random elements in neural networks, such as weight initialization, five successfully converged network training runs were recorded for each activation function of interest. In cases where a trial failed, it was repeated. A trial was classified to have failed when the average loss became infinite, or the network converged but did not appear to have fit well to the training data. For example, if a converged network yielded unusually low accuracy on the training data (< 70%), it was counted as a failed trial. The number of failed trials required to reach five successful trials was recorded. If no successful convergences were achieved after 15 attempts, no more attempts were made for that test. The testing procedure is shown in Fig. 2.
Since high confidence misclassification is a key CNN problem, a metric is needed to compare the activation function’s effect on the level of confidence assigned to misclassifications. Such a metric would need to summarise the network’s tendency for high confidence misclassification with brevity, for ease of comparison, and suit a range of datasets. In the literature, Free-Response Receiver Operating Characteristic (FROC) curves have been used for similar purposes [57,58,59]. Contour plots of the confidence scores [13], and total variation distance [60] have also been used in the literature. However, these existing metrics appeared unsuitable for comparing multiple candidate activation functions. While the prediction probability from a softmax layer does not directly correspond to confidence in isolation, the prediction probability tends to be lower for misclassified data when compared to correctly classified data [61]. Therefore, a larger difference between the probability of correctly classified and misclassified samples could indicate a network more adept at recognising misclassification. Understanding this contrast in probability, a novel metric, Positive Confidence Difference (PCD) was defined. PCD is calculated as the difference between the mean probability of true positive predictions (\(P_{TP}\)) and false positive predictions (\(P_{FP}\)) (Eq. 4).
The number of iterations, overall accuracy and PCD for each activation function in each CNN architecture were averaged across five trials. High accuracy, low iterations and high PCDs would imply positive performance. To enable broad recommendations, the activation functions that yielded either preferable accuracy, iteration, or PCD performance for a particular dataset, and network structure combination, were recorded.
4 Results
All activation functions were successfully implemented in a CNN structure and convergence was typically achieved. There were no activation functions that were not successful in any application. This outcome provides a degree of confidence in implementation. Table 6 shows the mean accuracy across the three datasets and six structures considered. Tables 7 and 8 show the number of iterations to convergence, and the PCD, respectively. Table 9 shows the incidence of failed training. The HardTanh activation function in the Deep Large network analysis of the IMDB data, failed to converge in 15 attempts. Thus, this point is omitted from Tables 6, 7, 8 and 9.
5 Discussion
The results show the Symmetric Multi-State Activation Function (or symMSAF) was often one of the most accurate functions (Table 6) and had no failed trials (Table 9). It produced high PCD values (Table 8) but was often slow to converge (Table 7). The results imply symMSAF is a good choice of activation function for precise, albeit slower, text classification CNNs. The results for symMSAF were consistent with the sigmoid activation function. Thus, in contrast to other applications [27,28,29], it seems that the sigmoidal activation functions are effective in the text-classification tasks in this analysis.
Penalized Tanh (PTanh) performed well in all three metrics, converged in all cases and was rarely a poor performer. If symMSAF is deemed too slow for a particular application, PTanh represents the best compromise of speed to convergence and good accuracy. A similar conclusion about PTanh was reached by [22] in their study. Generalized Swish (G-Swish) was similar and performed particularly well in convergence speed. However, G-Swish occasionally required re-trialling and yielded low PCD values. The ReLU networks converged quickly, but often had poor accuracy and PCD (Table 8). Furthermore, the dying ReLU problem (where nodes iterate to zero) led to several failed trials. HardTanh, Threshold and Cloglogm were not robust activation functions for the text classification tasks tested. All three activation functions had many failed trials, with HardTanh encountering problems on a third of the tests, and completely failing to converge in one network structure and dataset combination.
In some cases, the activation functions showed particular efficacy in certain test case groupings. For example, RLReLU was the most accurate activation function for all Deep Small network tests (Table 6), but did not perform particularly well for any other test. The PCD metric seemed most dependent on activation function and least dependent on the dataset and network structure, implying a close relationship between activation function and network confidence. While CNN accuracy was affected by activation function, it seemed to be dominated by network structure (Table 6). Furthermore, the shallow networks seemed consistently more accurate than the deeper network structures. It is possible that the addition of the batch normalisation layer had an adverse effect on the accuracy [35].
Implementing each CNN over five trials with differing initial values allowed for more confidence in the results. It also allowed an analysis of the brittleness of the CNN approaches. The mean range in accuracy within an activation function and architecture for the AG dataset was only 0.7% (Table 6). TREC and IMDB datasets yielded higher ranges (3.8% and 1.4%, respectively). While these values do not imply a particularly impactful level of brittleness, the ranges observed can absorb some of the inter-activation function distinctions observed in accuracy. However, in most datasets and architectures, the distinctions in accuracy were 2–3 × greater than the ranges observed. The distinctions in inter-architecture accuracy were larger than the distinctions in inter-activation function architecture. Average ranges observed for iterations across AG, TREC, and IMDB were 14.8, 16.1 and 20.0, respectively. These ranges do imply a level of brittleness. However, since the convergence criteria was based on change in loss, changes in initial conditions may be expected lead to significantly different convergence paths, and thus iterations. PCD had ranges of 2%, 5.1% and 0.9% for AG, TREC and IMDB respectively. These values were typically much smaller than the distinctions across activation functions.
To maintain consistency, each dataset and network structure combination kept the same training parameters across activation functions. In particular, layer features and structure were consistent within each depth, and the number of trainable parameters was proximal across depths. This consistency allowed for an equitable comparison between activation functions within each combination. However, it meant that none of the activation functions or CNN structures were optimised for the specific task. To enable reasonable convergence, learning rate was adapted across CNN structures (Table 5), but was consistent across activation functions. Thus, the conclusions of this paper are more meaningful in terms of comparing activation functions than CNN structure.
Adjusting the learning rate for specific activation functions could have improved their convergence success. However, adjusting learning rates for each specific activation function and task would have somewhat invalidated comparison of convergence speed. Prioritising test consistency also meant CNN accuracies did not attain peak values observed in the literature. Consistency seemed more important than peak performance to achieve equipoise in a comparison.
The novel (to the authors’ knowledge) PCD metric offered insight into the propensity of a CNN to incorrect predictions with high confidence. Ultimately, the PCD is similar to the FROC curve as it allows a critical analysis of classification characteristics. However, it differs in several important ways. Specifically, it offers a simple, interpretable value that is cognizant of the strength of misclassification, rather than the rate of misclassification. Thus, the PCD metric could be used to determine which networks have a higher chance of successfully coupling with strategies intended to reduce the rate of false positives. This could be useful in specific applications where False Positives are particularly undesirable, and data is sufficiently abundant to allow rejection of ambiguous outcomes. Hendrycks and Gimpel [61] showed outputs, or prediction probabilities, from softmax layers had consistently lower rates of incorrect classifications and thus could be used to help identify misclassified samples. Table 8 showed convincing trends in the PCD performance of certain activation functions. GELU, Mish, Swish and ReLU activation functions yielded false positives with relatively strong prediction probabilities (Table 8) and should thus be used with caution in cases that false positives are particularly deleterious. Furthermore, almost none of the rectifier-based functions produced competitive PCD values. SymMSAF, Sigmoid and PTanh had a relatively low confidence in false positives and could thus suit output adaptation to mitigate false positive risk.
In image classification CNNs, rectifier-based activations often perform well [27, 33, 43]. However, Table 6 indicated rectifier-based activation functions did not exhibit notably improved performance in text classification accuracy. However, the rectifier-based activation functions did seem to exhibit reasonably favourable convergence speed (Table 7). Although both image data and text data types can be fed into a CNN as a matrix of numbers, the underlying nature of the data leads to systematic differences in optimal activation functions and CNN structure. This may be due to the different manifestations of patterns across data types, the less scalable nature of word embeddings, and the orientation of features occurring differently across text and image classification.
6 Conclusions and recommendations
While previous studies noted the rectifier-based activation functions performed well in CNNs for image classification, this study reveals the same was not found for text classification. Text classification CNN efficacy was affected by activation function selection. Cloglogm, Threshold, Hardtanh, and G-Swish were the least robust functions with the most failed trials and are not recommended for text classification. Accuracy was most affected by network structure, but activation functions also impacted classification performance. To obtain the best accuracy in most CNN text classification tasks, a shallow network structure with a large trainable parameter frequency is recommended. Among the activation functions tested, the symMSAF activation function showed the most preferable performance. In particular, symMSAF maintained high accuracy, and maintained the greatest distinction between the predictions of true and false positive probabilities. Overall, the bounded activation functions, and in particular, the sigmoid-based functions, performed better than the unbounded activation functions, especially the rectifier-based functions.
Data availability
Data used in this study were downloaded from publically available sources.
References
Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2020) Deep learning based text classification: a comprehensive review. arXiv:2004.03705
Zhang Q, Wang Y, Gong Y, Huang X (2016) Automatic keyphrase extraction using recurrent neural networks. In: Proceedings of the 2016 Conference On Empirical Methods In Natural Language Processing, Austin, Texas, 2016: Association for Computational Linguistics, https://doi.org/10.18653/v1/D16-1080https://www.aclweb.org/anthology/D16-1080
Kim J, Jang S, Park E, Choi S (2020) Text classification using capsules. Neurocomputing 376:214–221. https://doi.org/10.1016/j.neucom.2019.10.033
Dey S, Singh AK, Prasad DK, McDonald-Maier KD (2019) SoCodeCNN: program source code for visual CNN classification using computer vision methodology. IEEE Access 7:157158–157172. https://doi.org/10.1109/ACCESS.2019.2949483
Kamal R, Tiwari S, Kolhe S, Deshpande MV (2021) A design approach for identifying, diagnosing and controlling soybean diseases using cnn based computer vision of the leaves for optimizing the production. IOP Conf Series Mater Sci Eng 1099(1):12037. https://doi.org/10.1088/1757-899X/1099/1/012037
Havaei M et al (2017) Brain tumor segmentation with deep neural networks. Med Image Anal 35:18–31. https://doi.org/10.1016/j.media.2016.05.004
Anvarjon T, Mustaqeem, Kwon S (2020) Deep-net: a lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 20(18):5212https://doi.org/10.3390/s20185212
Singhal S et al (2019) Multi-level region-of-interest CNNs for end to end speech recognition. J Ambient Intell Humaniz Comput 10(11):4615–4624. https://doi.org/10.1007/s12652-018-1146-z
Qian Y et al (2016) Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Transact Audio Speech Lang Process 24(12):2263–2276. https://doi.org/10.1109/TASLP.2016.2602884
Baker H, Hallowell MR, Tixier AJP (2020) Automatically learning construction injury precursors from text. Automat Construct 118:103145. https://doi.org/10.1016/j.autcon.2020.103145
Abid F, Alam M, Yasir M, Li C (2019) Sentiment analysis through recurrent variants latterly on convolutional neural network of Twitter. Futur Gener Comput Syst 95:292–308. https://doi.org/10.1016/j.future.2018.12.018
Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing.
Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. Comput Vision Pattern Recogn (CVPR)
Xia G, Ha S, Azevedo T, Maji P, (2021) An underexplored dilemma between confidence and calibration in quantized neural networks. https://arxiv.org/abs/2111.08163.
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On Calibration of Modern Neural Networks. In: Presented at the Proceedings of the 34th International Conference on Machine Learning. https://proceedings.mlr.press/v70/guo17a.html.
Melotti G, Premebida C, Bird JJ, Faria DR, Gonçalves N (2020) Probabilistic object classification using CNN ML-MAP layers. https://arxiv.org/abs/2005.14565.
Wang J, Ai J, Lu M, Liu J, Wu Z (2023) Predicting neural network confidence using high-level feature distance. Inform Softw Technol 159:107214. https://doi.org/10.1016/j.infsof.2023.107214
Hong S, Yoon J, Park B, Choi MK (2023) Rethinking soft label in label distribution learning perspective. https://arxiv.org/abs/2301.13444.
Obla S, Gong X, Aloufi A, Hu P, Takabi D (2020) Effective activation functions for homomorphic evaluation of deep neural networks. IEEE Access 8:153098–153112. https://doi.org/10.1109/ACCESS.2020.3017436
Mohammed MA, Naji TAH, Abduljabbar HM (2019) The effect of the activation functions on the classification accuracy of satellite image by artificial neural network. Energy procedia 157:164–170. https://doi.org/10.1016/j.egypro.2018.11.177
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. In: CoRR
Eger S, Youssef P, Gurevych I (2018) Is it time to swish? Comparing deep learning activation functions across NLP tasks.
Samatin Njikam AN, Zhao H (2016) A novel activation function for multilayer feed-forward neural networks. Appl Intell 45(1):75–82. https://doi.org/10.1007/s10489-015-0744-0
Shridhar K et al (2019) ProbAct: a probabilistic activation function for deep neural networks. arXiv:1905.10761
Chieng HH, Wahid N, Pauline O, Perla SRK (2018) Flatten-T swish: a thresholded ReLU-Swish-like activation function for deep learning. Int J Adv Int Inform 4(2):76–86. https://doi.org/10.26555/ijain.v4i2.249
Nanni L, Lumini A, Ghidoni S, Maguolo G (2020) Stochastic selection of activation layers for convolutional neural networks. Sensors 20(6):1626. https://doi.org/10.3390/s20061626
Wang Y, Li Y, Song Y, Rong X (2020) The influence of the activation function in a convolution neural network model of facial expression recognition. Appl Sci 10(5):1897. https://doi.org/10.3390/app10051897
Favorskaya MN, Andreev VV (2019) The study of activation functions in deep learning for pedestrian detection and tracking. Int Archives Photogramm Remote Sens Spatial Inform Sci https://doi.org/10.5194/isprs-archives-XLII-2-W12-53-2019
Zhao Q Griffin LD (2016) Suppressing the unusual: towards robust CNNs using symmetric activation functions. arXiv:1603.05145
Liew SS, Khalil-Hani M, Bakhteri R (2016) Bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems. Neurocomputing 216:718–734. https://doi.org/10.1016/j.neucom.2016.08.037
Zhu M, Min W, Wang Q, Zou S, Chen X (2021) PFLU and FPFLU: two novel non-monotonic activation functions in convolutional neural networks. Neurocomputing 429:110–117. https://doi.org/10.1016/j.neucom.2020.11.068
Farzad A, Farzad A, Mashayekhi H, Mashayekhi H, Hassanpour H, Hassanpour H (2019) A comparative performance analysis of different activation functions in LSTM networks for classification. Neural Comput Appl 31(7):2507–2521. https://doi.org/10.1007/s00521-017-3210-6
Castaneda G, Morris P, Khoshgoftaar TM (2019) Evaluation of maxout activations in deep learning across several big data domains. J Big Data 6(1):1–35. https://doi.org/10.1186/s40537-019-0233-0
Apicella A, Donnarumma F, Isgrò F, Prevete R (2021) A survey on modern trainable activation functions. Neural Netw 138:14–32. https://doi.org/10.1016/j.neunet.2021.01.026
Gu J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377. https://doi.org/10.1016/j.patcog.2017.10.013
Kim Y (2014) Convolutional Neural Networks for Sentence Classification. In: Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Liao Z (2020) Trainable activation function in image classification. Assoc Adv Artific Int
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Presented at the Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel
Cao J, Pang Y, Li X, Liang J (2018) Randomly translational activation inspired by the input distributions of ReLU. Neurocomputing 275:859–868. https://doi.org/10.1016/j.neucom.2017.09.031
Maas AL (2013) Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning (ICML)
Nwankpa C, Ijomah W, Gachagan A, Marshall S (2018) Activation functions: comparison of trends in practice and research for deep learning. arXiv:1811.03378
You W, Shen C, Wang D, Chen L, Jiang X, Zhu Z (2020) An intelligent deep feature learning method with improved activation functions for machine fault diagnosis. IEEE Access 8:1975–1985 https://doi.org/10.1109/ACCESS.2019.2962734
Macêdo D, Zanchettin C, Oliveira ALI, Ludermir T (2019) Enhancing batch normalized convolutional networks using displaced rectifier linear units: a systematic comparative study. Expert Syst Appl 124:271–281. https://doi.org/10.1016/j.eswa.2019.01.066
Hendrycks D Gimpel K (2017) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. In: Presented at the ICLR
Misra D (2019) Mish: a self regularized non-monotonic activation function. arXiv:1908.08681
Xu B, Huang R, Li M (2016) Revise saturated activation functions. In: CoRR
Cai C, Xu Y, Ke D, Su K (2015) Deep neural networks with multistate activation functions. Comput Intell Neurosci 2015:721367–721410 https://doi.org/10.1155/2015/721367
da Gecynalda S, Gomes S, Ludermir TB, Lima LM (2011) Comparison of new activation functions in neural network for forecasting financial time series. Neural Comput Appl 20:417–439
Burhani H, Feng W, Hu G (2015) Denoising autoencoder in neural networks with modified elliott activation function and sparsity-favoring cost function. In: 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence: IEEE 343–348https://doi.org/10.1109/ACIT-CSI.2015.67. http://canterbury.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07a8MwEBZNpk5tSUrf6AfUiWxZrzG4Mc1QKCSdg2RJYChyyOv3V5LjtA0dOllYGqQ77Dvdfd8dADgboeTkn8A1J4RKpir_9XGEbaa9p4yV97cZQ7Gdyg_SK3g-UmOMMRGLZkZhGFP7uql2IXI2DpX8vEfeAz2OREvd6hKRSIwnxWyRFPNZgG-REf3dPiVaj_ICdEzvDjVyepyT2oz_3dIlGH4T9uD70RRdgTPjBmD_YlxTh1gAnOy2zdQF-voa1g6Gihzy0z8iBHwDQzAWvjW6tt4hhQcYB5xUXeszWHrrFwfSaThfyYjkSEq5j_A9WDSb7XHNEHyU00XxmhzaLCS1F1SeaC4okTRkdjOTppRzW2VW-WMgQxWzzGZS2oopYbXQFDMrTKq1rrz34nWK8TXou8aZGwAFUf76hnSOMM6Jwcoyf_1OqRQ58VP4FgyCzJartpLG8iCuu79f34PzoLsWHPsA-tv1zjx2aNCgo6eo8y9fNre8
Koçak Y, ÜstündağŞiray G (2021) New activation functions for single layer feedforward neural network. Expert Syst Appl 164:113977https://doi.org/10.1016/j.eswa.2020.113977
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS
Duch W, Jankowski N (1997) New neural transfer functions. Neural Comput Surv 7:639–658
Li X, Roth D (2002) Learning question classifiers. In: COLING 2002: The 19th International Conference on Computational Linguistics https://cogcomp.seas.upenn.edu/Data/QA/QC/
Lakshmipathi N (2019) IMDB dataset of 50K movie reviews. https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. Adv Neural Inform Process Syst 649–657
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543. https://nlp.stanford.edu/projects/glove/. https://nlp.stanford.edu/projects/glove/
Jin H, Li Z, Tong R, Lin L (2018) A deep 3D residual CNN for false-positive reduction in pulmonary nodule detection. Medical Phys 45(5):2097–2107. https://doi.org/10.1002/mp.12846
Henriksson J et al (2021) Performance analysis of out-of-distribution detection on trained neural networks. Inform Softw Technol 130:106409. https://doi.org/10.1016/j.infsof.2020.106409
Mittapalli PS, Thanikaiselvan V (2021) Multiscale CNN with compound fusions for false positive reduction in lung nodule detection. Artific Intell Med 113:102017–102017. https://doi.org/10.1016/j.artmed.2021.102017
Papadopoulos A-A, Rajati MR, Shaikh N, Wang J (2021) Outlier exposure with confidence control for out-of-distribution detection. Neurocomputing 441:138–150. https://doi.org/10.1016/j.neucom.2021.02.007
Hendrycks D Gimpel K (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: Presented at the ICLR
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. RHKE was supported by the Unviersity of Canterbury Aho Hīnātore | UC Accelerator PhD Scholarship.
Author information
Authors and Affiliations
Contributions
Credit author contributions: Conceptualization RHKE, Data curation RHKE, Formal Analysis RHKE PDD KM, Funding acquisition PDD, Investigation RHKE HL KM, Methodology RHKE PDD, Project administration PDD, Resources PDD, Software RHKE, Supervision PDD HL, Validation RHKE KM, Visualization RHKE PDD, Writing – original draft RHKE, Writing – review & editing PDD HL KM.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare they have no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publishers note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Emanuel, R.H.K., Docherty, P.D., Lunt, H. et al. The effect of activation functions on accuracy, convergence speed, and misclassification confidence in CNN text classification: a comprehensive exploration. J Supercomput 80, 292–312 (2024). https://doi.org/10.1007/s11227-023-05441-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05441-7