1 Introduction

The nature of the internet has led to a growing abundance of textual data. Researchers have used neural network approaches to process large datasets that would be difficult to analyse manually [1,2,3]. Neural networks use a structured arrangement of functions to analyse certain regions of data and enable recognition or prediction tasks. Convolutional Neural Networks (CNNs) are a subset of neural networks that can be used for text classification tasks. While Recurrent Neural Networks, Capsule Neural Networks and Transformers are all popular models in text classification, CNNs remain useful for detecting patterns and features such as key phrases [1]. CNNs have been successfully applied to several problems involving computer vision [4,5,6], speech recognition [7,8,9], or natural language processing [10,11,12]. However, some CNNs have yielded incorrect classification with a high confidence [13,14,15], and can be prone to overfitting. To overcome these issues, a variety of settings within a typical CNN architecture have been proposed [16,17,18]. These settings include the number of hidden layers, the number, size and stride of filters, the activation functions, and loss functions.

It is not always possible to determine the best CNN approach for a specific application a-priori. In particular, activation functions have been shown to have a noticeable effect on the accuracy of a neural network [19,20,21,22]. Many novel activation functions have been created to provide improved results for specific applications [23,24,25]. However, an activation function that suits one application is not guaranteed to suit another [26]. Even theoretically similar activation functions can perform differently in practice [22]. There are several papers comparing the performance of different activation functions on image, audio and other non-text-based datasets [27,28,29]. However, findings in these papers may not necessarily apply to text classification.

There are several theories as to why certain features of different activation functions may be advantageous. In particular, [30] theorise that bounded functions have greater numerical stability than unbounded functions. Furthermore, [29] claim that symmetric activation functions lead to improved suppression of adversarial samples, reducing high confidence false predictions. In the context of rectifier-based activations [31] purport that non-monotonic functions are particularly strong, as they introduce negative activation values and non-zero derivative values corresponding to negative inputs.

Assessment of the efficacy of activation functions on text classification has been undertaken, with varied results [22, 32, 33]. However, there is no thorough exploration on the effect of activation functions across both text classification datasets and CNN structures. It is difficult to predict the efficacy of various activation functions based on the performance in a proximal context [34]. Therefore, there are no generally accepted optimal criteria for selecting activation functions across text classification tasks. In this paper sixteen activation functions will be assessed across three different text classification datasets and six different CNN structures to determine the best activation functions.

2 Background

CNN architectures for text classification tasks can include convolutional layers, activation function layers, max pooling layers, batch normalisation layers, and fully connected layers. Convolutional layers are generally used to extract features in the raw data, and thus, often appear early in the network architecture. Activation function layers apply non-linear transformations to outputs from convolutional and fully connected layers. Max pooling layers effectively condense the output of convolutional layers and can reduce computational burden. Batch normalisation layers can be added to re-distribute the output values of a layer. This normalisation can help prevent overfitting but may negatively affect a network’s performance [35]. Fully connected layers add information from the previous layers using simple linear functions. The output layer of a text classification CNN is commonly a fully connected layer, with softmax activation functions (Eq. 1). Softmax activations allow for an output of dependent probabilities, corresponding to the likelihood that each class matches the input.

$$f\left( {\varvec{x}} \right)_{i} = \frac{{e^{{x_{i} }} }}{{\mathop \sum \nolimits_{j = 1}^{K} e^{{x_{j} }} }}$$
(1)

where ‘x’ is a vector of the previous layer’s outputs, ‘\(K\)’ is the vector length, ‘\(j\)’ is the index of a vector element, and ‘\(i\)’ is the index of the corresponding softmax output element.

The stochastic gradient decent parameter identification method is classically applied to train CNNs. Stochastic gradient decent iterates from an initial network parametrisation using the loss function (\(L\)) derivative (\(\nabla L\)), the current trainable parameter set (\(w_{old}\)), and a predefined learning rate (\(\eta\)) to obtain the new trainable parameter set (\(w_{new}\)) (Eq. 2). The loss function calculates how close the network outputs (\(\hat{y}\)) are to the desired outputs (\(y\)). Cross entropy loss or its multi-class counterpart (Eq. 3, within which ‘\(N\)’ is the number of outputs) are common loss functions for CNNs with softmax outputs. Network training is considered successful if the convergence criteria is met prior to the maximum number of training iterations.

$$w_{new} = w_{old} - \eta \nabla L$$
(2)
$$L\left( {\hat{y}, y} \right) = - \mathop \sum \limits_{k = 1}^{N} y^{\left( k \right)} \log \hat{y}^{\left( k \right)}$$
(3)

Since words must be converted to numbers for use as CNN inputs, word embeddings are a common pre-processing step in text classification problems. Word embeddings are vectors of numbers that correspond to words. They can be pre-trained on large volumes of text to maintain aspects of a word’s context. CNNs for text classification are typically shallower than CNNs for other applications [35] and by utilising certain word embeddings, acceptable results can be achieved with single convolutional layer [36].

3 Methods

This research considered how activation functions perform in CNN text classification tasks with different architectures and datasets. Sixteen diverse activation functions were tested (Table 1). This is not an exhaustive list of potential functions. However, the list represents the specific or approximate form of most popular types. Several rectifier-based, piecewise, continuous, bounded, unbounded, symmetric, monotonic and/or non-monotonic functions were included. The attempt was made to include a 1-D radial basis function (RBF), but the deeper networks using the 1-D RBF would not convergence for the datasets tested. Thus, the function was abandoned.

Table 1 The sixteen activation functions tested, where ‘\(x\)’ is the function input

Trainable activation functions such as ‘maxout’ were excluded from analysis. Trainable activation functions have their own parameters that must be trained with a network’s usual parameters. While trainable activation functions often perform well [37], they can be mathematically equivalent to a network with a different structure and non-trainable activation functions [34]. A network with trainable activation functions is also more difficult to train [34]. When effort is taken to determine if marginal performance gains using maxout functions are due to the increased number of trainable parameters or the activation function itself, accuracy was found to correlate with memory usage [33]. Thus, conducting a direct but equitable comparison with trainable activation functions is difficult.

The activation functions were grouped into five categories, based on the function features. The rectifier-based functions are all unbounded. Rectifier-based functions can be continuous and piecewise (Fig. 1a) or non-monotonic (Fig. 1b). Of the non-rectifier-based functions, some were tanh-based (Fig. 2a), some were sigmoid-based (Fig. 2b), and some were neither (Fig. 2c). All non-rectifier-based functions were monotonic, all but G-Swish were bounded, and all but HardTanh and PTanh were continuous.

Fig. 1
figure 1

Activation functions tested. A Piecewise, rectifier-based activation functions. B Non-monotonic, rectifier-based activation functions. C Tanh-based activation functions. D Sigmoid-based activation functions. E Other continuous monotonic activation functions

Fig. 2
figure 2

Testing procedure for each activation function, network structure and dataset combination

As CNNs can have variable performance across different datasets [51, 52], three common text classification datasets were chosen to diversify the testing. The TREC-6 [53], IMDb [54], and AG News [55] datasets were used as they include a range of text lengths, data sizes and class numbers. The TREC-6 dataset is a 6 class question classification dataset where questions are labelled as relating to either an abbreviation, description and abstract concept, entity, human being, location, or numeric value. The IMDb dataset consists of 50,000 movie reviews that have been classed as positive or negative reviews. The AG News dataset consist of titles and descriptions of news articles that can be categorised as world, sports, business, or science-and-technology. These datasets are often used for text classification method testing and are already split into allotted testing and training datasets.

Text was split with a plain English tokenizer into a list of words for each instance of text data. Datasets had a varying number of words for the text instances, so a maximum number of words was chosen to decrease computational load. When the number of words exceeded the maximum number, the excess words were ignored. For some datasets, only a portion of the allotted training data was used, and the remainder was added to the testing data. Table 2 summarizes dataset information. Words were converted to word embeddings using pre-trained 50 dimension GloVe embeddings, trained on 6 billion words from Wikipedia and Gigaword [56].

Table 2 Characteristics of the different datasets and how they were truncated

The datasets were classified using six different network structures. Shallow networks (Table 3) and deeper networks (Table 4) were explored. Small, medium and large CNNs for each depth were designed with approximately 5,000, 25,000, and 125,000 trainable parameters, respectively. For consistency, softmax activations (Eq. 1) were used in the output layers. The hidden activation layers exclusively contained the activation function being tested. To equitably test activation functions known to be subject to the vanishing gradient problem (such as tanh and sigmoid), a batch normalization layer was included in the deeper network structure across all activation function tests.

Table 3 The shallow network structure, based on [36]
Table 4 The deep network structure

To train a particular network, training data was split into batch sizes of 100. Batches of data were padded with zeros to be equal length. Training was undertaken with stochastic gradient decent (Eq. 2), and the learning rates noted in Table 5. Multi-class cross entropy loss was applied (Eq. 3). Convergence was reached when the average loss changed less than a given tolerance across five iterations of all training data. The tolerance was 0.01 for all tests. While it is common to split a dataset to include a validation set to test for overfitting and stop training before convergence, early stopping was not utilised in this study. The number of iterations to convergence was a metric of interest and early stopping would invalidate those results. Furthermore, some activation functions may have been more prone to overfitting and using a validation set would have obscured that phenomenon.

Table 5 Learning rates used to train different networks on different datasets

The class with the highest associated softmax probability was chosen as the network output. Because of random elements in neural networks, such as weight initialization, five successfully converged network training runs were recorded for each activation function of interest. In cases where a trial failed, it was repeated. A trial was classified to have failed when the average loss became infinite, or the network converged but did not appear to have fit well to the training data. For example, if a converged network yielded unusually low accuracy on the training data (< 70%), it was counted as a failed trial. The number of failed trials required to reach five successful trials was recorded. If no successful convergences were achieved after 15 attempts, no more attempts were made for that test. The testing procedure is shown in Fig. 2.

Since high confidence misclassification is a key CNN problem, a metric is needed to compare the activation function’s effect on the level of confidence assigned to misclassifications. Such a metric would need to summarise the network’s tendency for high confidence misclassification with brevity, for ease of comparison, and suit a range of datasets. In the literature, Free-Response Receiver Operating Characteristic (FROC) curves have been used for similar purposes [57,58,59]. Contour plots of the confidence scores [13], and total variation distance [60] have also been used in the literature. However, these existing metrics appeared unsuitable for comparing multiple candidate activation functions. While the prediction probability from a softmax layer does not directly correspond to confidence in isolation, the prediction probability tends to be lower for misclassified data when compared to correctly classified data [61]. Therefore, a larger difference between the probability of correctly classified and misclassified samples could indicate a network more adept at recognising misclassification. Understanding this contrast in probability, a novel metric, Positive Confidence Difference (PCD) was defined. PCD is calculated as the difference between the mean probability of true positive predictions (\(P_{TP}\)) and false positive predictions (\(P_{FP}\)) (Eq. 4).

$$PCD = \overline{{P_{TP} }} - \overline{{P_{FP} }}$$
(4)

The number of iterations, overall accuracy and PCD for each activation function in each CNN architecture were averaged across five trials. High accuracy, low iterations and high PCDs would imply positive performance. To enable broad recommendations, the activation functions that yielded either preferable accuracy, iteration, or PCD performance for a particular dataset, and network structure combination, were recorded.

4 Results

All activation functions were successfully implemented in a CNN structure and convergence was typically achieved. There were no activation functions that were not successful in any application. This outcome provides a degree of confidence in implementation. Table 6 shows the mean accuracy across the three datasets and six structures considered. Tables 7 and 8 show the number of iterations to convergence, and the PCD, respectively. Table 9 shows the incidence of failed training. The HardTanh activation function in the Deep Large network analysis of the IMDB data, failed to converge in 15 attempts. Thus, this point is omitted from Tables 6, 7, 8 and 9.

Table 6 Average accuracy (and range: max–min) across the various activation functions, architectures, and datasets
Table 7 Average iterations (and range) to convergence across the various activation functions, architectures, and datasets
Table 8 Average positive confidence difference (PCD), (and range) across the various activation functions, architectures, and datasets
Table 9 Incidence of failed convergence across the various activation functions, architectures, and datasets

5 Discussion

The results show the Symmetric Multi-State Activation Function (or symMSAF) was often one of the most accurate functions (Table 6) and had no failed trials (Table 9). It produced high PCD values (Table 8) but was often slow to converge (Table 7). The results imply symMSAF is a good choice of activation function for precise, albeit slower, text classification CNNs. The results for symMSAF were consistent with the sigmoid activation function. Thus, in contrast to other applications [27,28,29], it seems that the sigmoidal activation functions are effective in the text-classification tasks in this analysis.

Penalized Tanh (PTanh) performed well in all three metrics, converged in all cases and was rarely a poor performer. If symMSAF is deemed too slow for a particular application, PTanh represents the best compromise of speed to convergence and good accuracy. A similar conclusion about PTanh was reached by [22] in their study. Generalized Swish (G-Swish) was similar and performed particularly well in convergence speed. However, G-Swish occasionally required re-trialling and yielded low PCD values. The ReLU networks converged quickly, but often had poor accuracy and PCD (Table 8). Furthermore, the dying ReLU problem (where nodes iterate to zero) led to several failed trials. HardTanh, Threshold and Cloglogm were not robust activation functions for the text classification tasks tested. All three activation functions had many failed trials, with HardTanh encountering problems on a third of the tests, and completely failing to converge in one network structure and dataset combination.

In some cases, the activation functions showed particular efficacy in certain test case groupings. For example, RLReLU was the most accurate activation function for all Deep Small network tests (Table 6), but did not perform particularly well for any other test. The PCD metric seemed most dependent on activation function and least dependent on the dataset and network structure, implying a close relationship between activation function and network confidence. While CNN accuracy was affected by activation function, it seemed to be dominated by network structure (Table 6). Furthermore, the shallow networks seemed consistently more accurate than the deeper network structures. It is possible that the addition of the batch normalisation layer had an adverse effect on the accuracy [35].

Implementing each CNN over five trials with differing initial values allowed for more confidence in the results. It also allowed an analysis of the brittleness of the CNN approaches. The mean range in accuracy within an activation function and architecture for the AG dataset was only 0.7% (Table 6). TREC and IMDB datasets yielded higher ranges (3.8% and 1.4%, respectively). While these values do not imply a particularly impactful level of brittleness, the ranges observed can absorb some of the inter-activation function distinctions observed in accuracy. However, in most datasets and architectures, the distinctions in accuracy were 2–3 × greater than the ranges observed. The distinctions in inter-architecture accuracy were larger than the distinctions in inter-activation function architecture. Average ranges observed for iterations across AG, TREC, and IMDB were 14.8, 16.1 and 20.0, respectively. These ranges do imply a level of brittleness. However, since the convergence criteria was based on change in loss, changes in initial conditions may be expected lead to significantly different convergence paths, and thus iterations. PCD had ranges of 2%, 5.1% and 0.9% for AG, TREC and IMDB respectively. These values were typically much smaller than the distinctions across activation functions.

To maintain consistency, each dataset and network structure combination kept the same training parameters across activation functions. In particular, layer features and structure were consistent within each depth, and the number of trainable parameters was proximal across depths. This consistency allowed for an equitable comparison between activation functions within each combination. However, it meant that none of the activation functions or CNN structures were optimised for the specific task. To enable reasonable convergence, learning rate was adapted across CNN structures (Table 5), but was consistent across activation functions. Thus, the conclusions of this paper are more meaningful in terms of comparing activation functions than CNN structure.

Adjusting the learning rate for specific activation functions could have improved their convergence success. However, adjusting learning rates for each specific activation function and task would have somewhat invalidated comparison of convergence speed. Prioritising test consistency also meant CNN accuracies did not attain peak values observed in the literature. Consistency seemed more important than peak performance to achieve equipoise in a comparison.

The novel (to the authors’ knowledge) PCD metric offered insight into the propensity of a CNN to incorrect predictions with high confidence. Ultimately, the PCD is similar to the FROC curve as it allows a critical analysis of classification characteristics. However, it differs in several important ways. Specifically, it offers a simple, interpretable value that is cognizant of the strength of misclassification, rather than the rate of misclassification. Thus, the PCD metric could be used to determine which networks have a higher chance of successfully coupling with strategies intended to reduce the rate of false positives. This could be useful in specific applications where False Positives are particularly undesirable, and data is sufficiently abundant to allow rejection of ambiguous outcomes. Hendrycks and Gimpel [61] showed outputs, or prediction probabilities, from softmax layers had consistently lower rates of incorrect classifications and thus could be used to help identify misclassified samples. Table 8 showed convincing trends in the PCD performance of certain activation functions. GELU, Mish, Swish and ReLU activation functions yielded false positives with relatively strong prediction probabilities (Table 8) and should thus be used with caution in cases that false positives are particularly deleterious. Furthermore, almost none of the rectifier-based functions produced competitive PCD values. SymMSAF, Sigmoid and PTanh had a relatively low confidence in false positives and could thus suit output adaptation to mitigate false positive risk.

In image classification CNNs, rectifier-based activations often perform well [27, 33, 43]. However, Table 6 indicated rectifier-based activation functions did not exhibit notably improved performance in text classification accuracy. However, the rectifier-based activation functions did seem to exhibit reasonably favourable convergence speed (Table 7). Although both image data and text data types can be fed into a CNN as a matrix of numbers, the underlying nature of the data leads to systematic differences in optimal activation functions and CNN structure. This may be due to the different manifestations of patterns across data types, the less scalable nature of word embeddings, and the orientation of features occurring differently across text and image classification.

6 Conclusions and recommendations

While previous studies noted the rectifier-based activation functions performed well in CNNs for image classification, this study reveals the same was not found for text classification. Text classification CNN efficacy was affected by activation function selection. Cloglogm, Threshold, Hardtanh, and G-Swish were the least robust functions with the most failed trials and are not recommended for text classification. Accuracy was most affected by network structure, but activation functions also impacted classification performance. To obtain the best accuracy in most CNN text classification tasks, a shallow network structure with a large trainable parameter frequency is recommended. Among the activation functions tested, the symMSAF activation function showed the most preferable performance. In particular, symMSAF maintained high accuracy, and maintained the greatest distinction between the predictions of true and false positive probabilities. Overall, the bounded activation functions, and in particular, the sigmoid-based functions, performed better than the unbounded activation functions, especially the rectifier-based functions.