1. Introduction
Speech constitutes one of the most popular and significant means for humans to communicate, express their emotions, cognitive states, and intentions to each other. Speech is produced by humans using a natural biological mechanism in which lungs discharge the air and convert it to speech passing through the vocal cords and organs including the tongue, teeth, lips etc. [
1]. In general, a speech and voice recognition system can be used for gender identification. A natural voice recognition system is the human ear. The human ear has an excellent mechanism which can efficiently distinguish the gender by voice and speech based on attributes like frequency and loudness. In a similar way, a machine can be taught to do the same thing by choosing and incorporating the right features from voice data on a machine learning algorithm.
Gender recognition is a technique which is often utilized to determine the gender category of a speaker by processing speech signals. Speech signals taken from a recorded speech can be used to acquire acoustic attributes such as duration, intensity, frequency and filtering [
2]. Some applications where gender recognition can be useful are speech emotion recognition, human to machine interaction, sorting of telephone calls by gender categorization, automatic salutations, muting sounds for a gender and audio/video categorization with tagging (see [
1,
3,
4,
5] and the references therein).
As technology is growing in a rapid way, machine learning is a research field which has had major developments; thus it has been widely established as a popular trend. Machine learning is a subset of artificial intelligence which utilizes algorithms and data to teach computers make decisions on specific problems in various fields like finance, banking, and medicine etc. [
6,
7,
8]. Along this line, several studies for gender recognition and identification by voice using machine learning and data mining techniques have been conducted [
1,
9,
10,
11]. However, the development of an accurate prediction model for gender recognition by voice is still considered a rather difficult and challenging task. In [
12,
13] conducted an extensive experimental analysis and pointed out the difficulties of this classification problem since speech signals are highly time-varying and have very high randomness. This is mainly due to the fact that the progress in the field has been hampered by the lack of available labeled data for efficiently training a supervised classifier. Furthermore, in order to train efficiently a classifier and be able to make accurate predictions, it often needs a large amount of labeled data. Nevertheless, the process of finding sufficient labeled data for training classifiers to make accurate predictions is often an expensive and time-consuming task as it requires human efforts, while in contrast finding unlabeled data in general is significantly easier. To address the problem of insufficient labeled data, semi-supervised learning (SSL) algorithms constitute the appropriate methodology to exploit the hidden information found in the unlabeled set aiming to build more accurate classifiers. In the literature, several classes of SSL algorithms have been proposed and evaluated, each of them based on different methodologies and techniques related to the link between the distribution of labeled and unlabeled data (see [
14,
15,
16] and the references therein). Self-labeled algorithms probably constitute the most popular and widely utilized class of SSL algorithms. The algorithms of this class follow an iterative procedure, augmenting an initial labeled dataset using their own predictions from a large pool of unlabeled datasets. Triguero et al. [
14] proposed an in-depth taxonomy of self-labeled algorithms based on their main characteristics and made an exhaustive research of their classification efficacy on various datasets.
Ensemble learning (EL) is another way of obtaining better results for a higher classification accuracy, which has been developed over the last decades. The main object of this methodology is the combination of several prediction models, in order to build a more accurate model rather than using a single one. Furthermore, the development of algorithms which hybridize SSL and EL approaches is another recent methodology which can be beneficial to each other and can build more robust classification algorithms, leading to even better classification results [
17,
18].
In this work, we propose a new ensemble-based self-labeled algorithm, called iCST-Voting, for gender recognition by voice. This algorithm combines the individual predictions of three of the most popular and efficient self-labeled methods i.e., Co-training, Self-training, and Tri-training utilizing an ensemble as base learner. Our experimental results reveal the efficiency of this algorithm compared against state-of-the-art self-labeled algorithms.
The remainder of this paper is organized as follows:
Section 2 presents a synopsis of related work on gender recognition by voice.
Section 3 presents a brief description of the self-labeled algorithms.
Section 4 presents the proposed classification algorithm.
Section 5 presents the datasets and our experimental results. Finally,
Section 6 presents our concluding remarks and some directions for future research.
2. Related Work
During the last decades, machine learning models and data mining techniques have been widely utilized for gender recognition by voice. These prediction models can identify the gender of a person by utilizing various features such as the length of the vocal folds, gait and speech. More specifically, the acoustic properties acquired from voice and speech signals like duration, intensity and frequency can be used as features to recognize the gender of the speaker. A number of studies have been carried out in recent years; some useful outcomes of them are briefly presented below.
Maka et al. [
10] used 630 speakers, 438 males and 192 females in their experiments for the gender identification problem in different acoustical environments (indoor and outdoor auditory scenes). In addition, for the evaluation stage each sentence has been mixed with several types of background noise. In their results, they found out that non-linear smoothing increases the classification accuracy by 2% and the recognition accuracy obtained was 99.4%.
Bisio et al. [
4] developed an android SPEech proCessing plaTform as smaRtphone Application (SPECTRA) for gender, speaker and language recognition by utilizing multiple unsupervised support vector machine classifiers. An interesting and innovative point in this work is the dynamic training with the features extracted from every user, having SPECTRA installed on his personal android smartphone. This can lead on building more robust classifiers with higher classification accuracy, resulting in better recognition performances.
Pahwa et al. [
1], proposed a recognition system to determine the gender using speech samples of 46 speakers. In particular, they extracted one of the most dominant and most researched speech feature, Mel coefficients and the first and second order derivatives. Their proposed model consists of a support vector machine and neural network classifier using a stacking methodology. The classification accuracy obtained from their numerical experiments was 93.48%.
Pribil et al. [
19], proposed a two level Gaussian Mixture Model (GMM) algorithm to recognize age and gender. Their proposed classifier was first verified for detection of four age categories (child, young, adult, senior) and for recognizing the gender for all but children’s voices in Czech and Slovak languages. The prediction accuracy on gender identification was above 90%. In a similar work, Pribil et al. [
20] developed a two level GMM classifier to detect age and gender. The classification accuracy achieved on gender recognition was 97.5%. Furthermore, the obtained gender and age classification accuracy results were compared with the results achieved by the conventional listening test which is an evaluation method of the quality of the synthetic speech [
21].
Buyukyilmaz et al. [
9], utilized a multilayer perceptron deep learning model using the acoustic properties of the voices and speech to identify the voice gender. The dataset they utilized for their experiments consisted of 3168 recorded samples of human voices. Their classification model managed to achieve 96.74% accuracy. Additionally, they have designed a web page to detect the gender of voice by utilizing the obtained model.
Zvarevashe et al. [
11], proposed a gender voice recognition technique utilizing feature selection through the random forest recursive feature elimination with gradient boosting machines (GBMs) algorithm for gender classification. Acoustic features were collected from a public gender voice dataset including 1584 males and 1584 females. The GBMs algorithm had obtained an accuracy of 97.58% without feature selection while by applying feature selection it almost achieved 100%.
3. On Semi-supervised Self-Labeled Classification Algorithms
In this section, we present a short description of the most popular self-labeled classification algorithms proposed in the literature.
Suppose that is a set of instances, where each example is a n-dimensional vector of n features belonging to a class y and is the i-th instance of m total instances. Then, let assume that a training set is composed of a labeled set L of instances where y is known and of an unlabeled set U of instances where y is unknown with and also a test set T that consists of unseen instances where y is unknown. In general, the basic objective of self-labeled algorithms if to classify unlabeled data based on the most confident predictions in order to augment the initial training set.
The self-labeled methods are divided into two main categories, Self-training [
22] and Co-training [
23]. In Self-training,one classifier is initially trained on a labeled training set and it is used to make predictions for the examples of an unlabeled set, where the most confident predictions are added to the labeled training set. Next, the classifier is re-trained on the new enlarged labeled set and this procedure is repeated until some stopping criteria are met, such as having no unlabeled examples left. The advantage of the Self-training algorithm is its simplicity, being at the same time one of the most efficient self-labeled algorithms. However, one disadvantage is the possible incorporation of noise examples into the labeled training set. Co-training is a multi-view algorithm meaning that the feature space is divided in two conditionally independent views. Two classifiers are initially trained separately on each view using the labeled set and then following an iterative procedure each classifier adds the most confident predictions to the training set of the other. Most self-labeled methods are influenced by these two algorithms while some of them are also based on ensemble techniques. Democratic-Co learning, SETRED,Tri-training, Co-Forest and Co-Bagging are some other efficient self-labeled algorithms proposed in the literature.
Democratic-Co learning [
24] is a single view algorithm utilizing multiple classifiers for predicting the values of unlabeled examples incorporating a majority voting strategy and a mean confidence measurement for the majority and the minority of classifiers which disagree. SETRED is a self-labeled algorithm [
25] which incorporates data editing in the self-training framework in order to learn actively from the self-labeled examples. More analytically, data editing is a method which improves the quality of the training set by identifying and eliminating the mislabeled examples acquired from a self-training procedure with the help of some local information in a neighborhood graph. The Tri-training algorithm [
26] constitutes a single view algorithm which utilizes three classifiers for teaching each other based on a majority voting strategy avoiding by this way the confidence measuring of the labeling of the classifiers since this process is sometimes time-consuming and quite complicated. In particular, three classifiers are trained on data subsets generated through bootstrap sampling from the initial training set, which are then used to label the instances of the unlabeled set. If two classifiers agree on predicting a value for an unlabeled example, then this is labeled for the third too. Motivated by the previous works, Li & Zhou [
27] proposed the Co-Forest algorithm. This algorithm utilizes random trees, which are trained on bootstrap data from the dataset assigning a few unlabeled examples to each random tree. The final decision is made by a majority vote. Two advantages comparing with the rest self-labeled algorithms are that no physical connection between the attributes is required because of the random selection of features from the basic feature vector and the reduced ripples of its performance in the case of small number of labeled examples provided. The Co-Bagging algorithm [
28] utilizes several base classifiers trained on bootstrap data created by random resampling with replacement from the training set. Each sample contains around two thirds of the training set. One advantage is that this algorithm works well for unstable learning algorithms where a small change in the input training data can cause a significant big change in the output hypothesis.
4. iCST-Voting
During the last decade, research focused on incorporating ensemble learning techniques in the self-labeled framework in order to build powerful classifiers. Recently, Livieris et al. [
29] utilized an ensemble of classifiers as base learners to increase the efficiency of semi-supervised self-labeled methods. In another study, Livieris et al. [
17] proposed the CST-Voting algorithm which exploits the individual predictions of the three of the most efficient self-labeled algorithms, i.e Co-training, Self-training and Tri-training utilizing a simple majority voting.
Motivated by the previous works, we attempt to go a step further by hybridizing these approaches and improving the classification efficiency of CST-Voting by incorporating an ensemble of classifiers as a base learner in all its component self-labeled algorithms. This is the main novelty of the present work over CST-Voting [
17], resulting in increased prediction accuracy of iCST-Voting over CST-Voting. A high-level description of this framework is presented in
Table 1 which is composed of two phases: Training and voting.
In the training phase, the self-labeled algorithms which constitute iCST-Voting are trained using an ensemble of supervised classifiers E as base learner and the same labeled L and unlabeled U datasets (Steps 1–3). Subsequently, the trained classifiers and are constructed using Co-training, Self-training and Tri-training algorithms, respectively.
Next, in the voting phase, each trained self-labeled classifier is applied on each unlabeled example x of the test set T. Let and be the hypothesis of the classifiers , and on x, respectively (Steps 6–8). The final hypothesis on x is defined by combining the individual predictions and utilizing a majority voting methodology (Step 9).
5. Numerical Experiments
In this section, we present a series of experiments in order to evaluate the performance of the proposed algorithm iCST-Voting for gender classification from voice. The implementation codes were written in Java, using the WEKA 3.9 Machine Learning Toolkit [
30].
Our numerical experiments took place in two distinct phases: In the first phase (
Section 5.2) we evaluate the performance of the iCST-Voting, against its component self-labeled algorithms: Self-training, Co-training and Tri-training and the state-of-the-art self-labeled algorithms: SETRED, Co-Bagging, Democratic-Co learning and Co-Forest; while in the second phase (
Section 5.3), we compare the performance of the proposed algorithm iCST-Voting against classical supervised algorithms.
Table 2 reports the configuration parameters of all evaluated self-labeled algorithms. In our original experiments, we have utilized several choices of parameter values, nevertheless the classification performance was usually worse and only in few cases there was a negligible and marginal improvement. The performance of the classification algorithms was evaluated using the performance metrics
F-measure (
) and accuracy (Acc).
5.1. Dataset
The efficiency of the all self-labeled algorithms was evaluated using the Voice gender dataset and the Deterding dataset [
31].
Voice gender dataset (
https://www.kaggle.com/primaryobjects/voicegender/home). This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. It consists of 3168 recorded voice samples, collected from male and female speakers. The voice samples were collected from
- -
The Harvard-Haskins Database of Regularly-Timed Speech.
- -
Telecommunications & Signal Processing Laboratory Speech Database at McGill University.
- -
VoxForge Speech Corpus.
- -
Festvox CMU-ARCTIC Speech Database at Carnegie Mellon University.
Each voice sample is stored as a .WAV file, which is then pre-processed by acoustic analysis in software R using the seewave and tuneR packages, with an analyzed frequency range of 0-280 hz (human vocal range).
Deterding dataset (
https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Vowel+Recognition+-+Deterding+Data)). This dataset consists of the steady-state portion of 11 vowels in British English, spoken in the context of h*d. The recorded speech samples were low-pass filtered at 4.7 KHz before being digitised at 10 KHz with a 12-bit resolution and the steady-state portion of the vowel in each utterance was partitioned into six 512 Hamming windowed segments. Next, linear predictive analysis was performed and the linear prediction reflection coefficients were calculated and used to generate 10 log area parameters. These parameters were recorded and constituted a 10-dimensional input space. In total, the dataset consists of 990 recorded voice samples, collected from 528 male and 462 female speakers.
All algorithms were evaluated using the stratified 10-fold cross-validation and in order to study the influence of the amount of labeled data, five different ratios (R) of the training data were used, i.e., 10%, 20%, 30%, 40% and 50%.
5.2. Performance Evaluation of iCST-Voting against State-of-the-Art Self-Labeled Algorithms
Next, we focus our interest on the experimental analysis for evaluating the classification performance of iCST-Voting against the most efficient and frequently utilized self-labeled algorithms. The proposed algorithm iCST-Voting utilizes an ensemble classifier as base learner which combines the individual predictions of sequential minimum optimization (SMO) algorithm [
32],
kNearest Neighbor (
kNN) algorithm [
33] and C
decision tree algorithm [
34] utilizing a majority voting. These supervised classifiers probably constitute the most effective and popular machine learning algorithms for classification problems [
35]. Additionally, in order to present a complete performance evaluation of iCST-Voting, similar to [
29], Self-training, Co-training and Tri-training utilized as base learner, the ensemble classifier used by iCST-Voting. The configuration parameters for all supervised classifiers were set as in [
14] which are also reported in
Table 3 for completeness. It worth noticing that SMO was implemented using Pearson VII function-based universal kernel instead of the classical polynomial kernel, which significantly improved its performance.
Table 4 presents the performance evaluation of the iCST-Voting against the state-of-the-art self-labeled algorithms Self-training, Co-training, Tri-training, SETRED, Co-Bagging, Democratic-Co learning, Co-Forest on the voice gender dataset. Notice that the highest classification performance for each labeled ratio and performance metric is highlighted in bold. It is worth mentioning that the aggregated results demonstrate that iCST-Voting is the most efficient and robust method, independent of the utilized ratio of labeled instances in the training set.
Table 5 reports the performance of Self-training, Co-training, Tri-training, SETRED, Co-Bagging, Democratic-Co learning, Co-Forest and iCST-Voting on the Deterding dataset. As above mentioned, the accuracy measure of the best performing algorithm is highlighted in bold. Similar observations can be made with the previous benchmark. Clearly, iCST-Voting was by far the most efficient and robust method, demonstrating 1.21%–5.66% better classification accuracy, independent of the utilized ratio of labeled instances in the training set.
The statistical comparison of several classification algorithms over multiple datasets is fundamental in the area of machine learning and it is frequently performed by means of a statistical test [
17,
18]. Since our motivation stems from the fact that we are interested in evaluating the rejection of the hypothesis that all the algorithms perform equally well for a given level based on their classification accuracy and highlighting the existence of significant differences between our proposed algorithm and the classical self-labeled algorithms, we utilized the non-parametric Friedman aligned ranking (FAR) [
36] test. Additionally, in order to identify which algorithms report significant differences, the Finner test [
37] is applied as a post-hoc procedure. It is worth mentioning that the control algorithm for the post-hoc test is determined by the best (lowest) ranking obtained in each FAR test. Furthermore, the adjusted
p-value with Finner’s test (
) was presented based on the corresponding control algorithm at the
level of significance while the post-hoc test test rejects the hypothesis of equality when the value of
is less than the value of
a.
Table 6 presents the information of the statistical analysis performed by nonparametric multiple comparison procedures for Self-training, Co-training, Tri-Training, CST-Voting and iCST-Voting. The interpretation of
Table 6 illustrates that the proposed iCST-Voting algorithm reports the highest probability-based ranking by statistically presenting better results, outperforming the rest self-labeled algorithms.
5.3. Performance Evaluation of iCST-Voting against Classical Supervised Algorithms
In the sequel, we evaluate the classification performance of iCST-Voting against the classical supervised classification algorithms: SMO, kNN and C4.5. Moreover, we compare the performance of iCST-Voting against the ensemble of classifiers (voting) which combines the individual predictions of the supervised classifiers utilizing a majority voting strategy.
Table 7 and
Table 8 report the performance of the supervised algorithms SMO,
kNN, C4.5, Voting against iCST-Voting on Voice gender dataset and Deterding dataset, respectively, trained with different amounts of labeled data. As above mentioned, the highest classification performance for each labeled ratio and performance metric is highlighted in bold. The aggregated results show that iCST-Voting was by far the most efficient algorithm since it illustrates the highest classification performance, independent of the utilized ratio of labeled instances in the training set. More specifically, iCST-Voting outperforms all classical supervised algorithms, regarding both performance metrics and datasets.
Table 9 presents the performance of the supervised algorithms SMO,
kNN, C4.5, voting and iCST-Voting, regarding both classification benchmarks. It is worth noticing that all supervised classifiers were trained using 100% of the training data while iCST-Voting utilized only 50% of the training data. Relative to the voice gender dataset, the proposed algorithm iCST-Voting presents the best performance, outperforming all supervised algorithms. Regarding the Deterding dataset, the best performance is achieved by
kNN while the proposed algorithm iCST-Voting slightly outperforms SMO and C4.5.
6. Conclusions
In this work, we utilized a semi-supervised algorithm, called iCST-Voting for the gender recognition by voice. The proposed algorithm constitutes an ensemble of the most popular self-labeled algorithms i.e., Self-training, Co-training and Tri-training utilizing as base learner an ensemble of classifiers. The contribution of our approach as compared to other related approaches has to do with the fact that we utilize an ensemble of classifiers as base learner instead of single learners normally used in self-labeled algorithms. Our preliminary numerical results and the presented statistical analysis demonstrate the efficiency of the proposed algorithm for the gender recognition by voice compared against state-of-the-art self-labeled. Moreover, it presents competitive and sometimes better classification performance than classical supervised algorithms. Therefore, we conclude that reliable, stable and robust prediction models could be developed by the adaptation of ensemble techniques in the semi-supervised learning framework.
In order to resolve possible scalability issues we run a set of experiments using two large datasets: the Pneumonia dataset [
38] and CT Medical dataset [
39] and found that in our proposed approach there was no degradation in accuracy of prediction. Only an expected small increase in training time was noticed. Therefore, we can safely say that our proposed approach performs equally well even when input data increases drastically. Moreover, it is worth noticing that we understand the limitations imposed on the generalizability of the presented results due to the use of the only two freely available data. We certainly intend to investigate this further in the near future.
Our future work is focused on improving the prediction accuracy of our framework by combining self-labeled algorithms with more sophisticated and theoretically motivated ensemble learning methodologies. Additionally, another interesting aspect is concentrating on extending our framework for handling big data with traditional technique and platforms such as [
40,
41]. Finally, since our numerical experiments are quite promising, we intend to focus on expanding our experiments and applying further the proposed algorithm to several audio datasets for speaker recognition.