Cmu Sphinx Audio To Text
Cmu Sphinx Audio To Text
Cmu Sphinx Audio To Text
Delano de Ruiter
DelanodeRuiter@hotmail.com
Master information studies: Data Science
Faculty of Science
University of Amsterdam
2
2.2 Sentiment Analysis An information need is not always present during exploration [6].
Sentiment analysis deals with extracting subjective information Therefore, a visual approach is helpful in exploration of sentiment.
from language. There are roughly two approaches [22]. The lin- A visual approach can help guide the user and has a positive effect
guistic approach and the machine learning approach. The linguistic on engagement. The result suggest that users spend more time
based approach uses a dictionary of words with a semantic score. performing tasks when using scatter plots. Words can be found
Every token of a text is matched to the dictionary and a score is more easily that are distinctive for their respective class.
calculated based on the matching of the words. A dictionary has to
be made first for this approach. These can be created manually. 2.5 Corpus Comparison
In the machine learning approach a classification algorithm is Comparing word frequencies in corpora can be done with sta-
presented a series of feature vectors of previous data and in the case tistical significance testing [8]. Used tests are Pearson’s χ 2 test,
of supervised classification labels are attached. Feature extraction log-likelihood ratio test, and Wilcoxon rank-sum test. However,
is a step taken to go from text to a numerical representation. A according to some research the χ 2 test is not suitable for text [12].
machine learning algorithm is trained on data. A test set of data is This is due to common words defeating the null hypothesis easily.
used to measure the performance of a machine learning algorithm. The log-likelihood ratio is also problematic since the test is based
The machine learning approach is usually more adaptable and on the assumption of independent samples whereas words in text
more accurate [22]. For this study textual data is available as tran- are not independent [8].
scripts of speech. Word frequencies in transcripts can be tested for significant
differences between ’sale’ and ’no sale’ classes. The Wilcoxon rank-
sum test can be applied to the total word frequencies in both classes.
2.3 Speech Sentiment Analysis Individual words can also be tested on their frequency in multiple
Sentiment analysis on speech can be done on emotion detection conversations.
in speech by acoustic features [7] and on textual sentiment from
audio [3]. Both approaches can be combined to give a single predic- 3 METHODOLOGY
tion. Textual sentiment has a bigger predictive power than acoustic
features [7, 9]. The approach presented in this paper is using automatic speech
Another approach is doing classification on the transcript of recognition (ASR) on recorded phone calls. This results in a tran-
speech [3]. A number of classification algorithms are used of which script of all spoken words in the conversation. Three speech models
a Support Vector Machine (SVM) performed the best. This work are used to transcribe and performance is compared. The models
states that the transcripts represent just 44% of what is originally are:
said and is still able to cluster successfully on sentiment. A limitation – CMU Sphinx: A model is trained with the use of the open
of this work is the usage of an artificially generated data set. source toolkit CMU Sphinx.
Differentiating speakers and performing sentiment analysis on – Google Speech API: A cloud solution with an existing
each individual speaker is a viable approach [19]. A shortcoming of model for Dutch.
this work is also the artificially generated data set. An alternative – Kaldi-NL: An existing model for Kaldi is used on a server.
argument is keeping the conversational structure and performing Transcriptions are made with the use of speech recognition. Text-
sentiment analysis on the whole conversation [3]. based sentiment analysis is performed on the transcript and results
Besides the multiple options of transcripts, there are also multi- in a ’sale’ or ’no sale’ prediction. Classification of telephone calls
ple approaches to sentiment analysis on transcripts. For text the that convert or not are used to find key differences between both
linguistic based approach or the machine learning approach is used classes.
in research [1]. Both approaches are successful in predicting senti- Corpus differences between the ’sale’ and ’no sale’ classes are
ment. The problem with a linguistic based method is the manual visualized and tested for word frequency differences. The visualiza-
evaluation and judgment of the lexicon. tions and measures are then shown in the ’results’ section.
In short, some research has been done on speech sentiment The first subsection describes the creation of a model, after that
analysis but more should be done using real data and a machine the other two models are described. Then the approach to sentiment
learning classification approach. Difficulties with audio sentiment analysis is reported. After that the approach to visualizations is
analysis are: Speech is often less structured, contains pauses and reported.
breaks, and does not always follow the rules of grammar. Also,
the speech recognizer makes errors which creates noise in the
3.1 Creating a speech recognition model with
transcripts.
CMU Sphinx
Creating or adapting a model for a specific domain or acoustic envi-
2.4 Sentiment Visualization ronment might provide better performance than an existing model.
Sentiment visualization has matured from pie and bar charts to Two of those open-source systems are Kaldi and CMU Sphinx [2, 17].
extensive visualizations and has become a notable topic of research Kaldi is only available for Linux. CMU Sphinx is available for Mac
[13]. Often used techniques in literature are clustering and classi- and Windows. Therefore, CMU Sphinx is chosen for this research.
fication. Comparison and creating an overview is also mentioned A speech recognition model is trained on data. This data consists
often in literature [13]. of speech audio with a transcript. The model of this research is
3
Dutch. For the best performance the acoustic environment should The trained model is able to transcribe phone call conversations
match for training and testing. Sampling rate should also be the it has not been trained on. The other two existing models can also
same2 . be used to transcribe the same phone call conversations. These
models can then be compared on errors and speed. How the other
3.1.1 Data. The dutch audio that is used for training of the model two models are used to transcribe is discussed below.
is the Spoken Wikipedia Corpora3 . Wikipedia contains articles
about individual topics, which makes it a diverse set of words. The
3.2 Speech Recognition on a server with
Dutch Spoken Wikipedia Corpora contains 79 hours of word aligned
spoken language. The time alignment is not done for every word, Kaldi-NL
but every word in the speech audio needs a transcript. Misalignment A model for Dutch is created with the Kaldi speech toolkit at the
is a problem that can occur when the speech contains utterances University of Twente6 Although the Kaldi toolkit is only available
that are not in the transcript. The recognizer then hears audio and on Linux. To make it available on Windows and Mac a container is
uses the wrong word for that utterance. This is solved by taking set up with Docker. Containerization solves the problem of system
the time stamp of a word near the beginning of a sentence and at dependencies by providing the container with all its dependencies
the end of a sentence and taking all words in between. This creates and abstracting it from an operating system. A port is opened on
sentence length audio clips accompanied by the text of the sentence. the Docker container to communicate with the host computer. A
bind mount serves the purpose of accessing files between host and
3.1.2 Preprocessing Audio. In CMU Sphinx audio clips must be container. A decode script is called in the container environment
the length of roughly one sentence. The audio clips are stored per with the audio and output directory as arguments.
article, so these are cut to the appropriate length. The reason for Preparation of transcription involves segmentation. The recog-
roughly one sentence length is that longer audio clips might get nition is done with the use of a neural network. Rescoring is done
out of sync with the text. When the words are no longer aligned as a last step to improve the recognition rate. The transcripts are
with the speech, the wrong words are recognized for that speech. saved as text for every phone call.
This can also be a problem when some words are not in the text,
but are spoken. 3.3 Cloud speech recognition with Google
Audio clips cannot simply be cut by a set interval, because then
Speech API
words are cut in half. The audio clips must be cut on a time stamp of
the words that are spoken in the audio clip and contain the words Google Speech API7 is a cloud solution for speech recognition. A
that are between time stamps. The audio is cut on the time stamp connection is made to the API with a Python script. The audio is
and the words that are spoken in that audio clip are saved. Audio sent to the cloud and a transcription is retrieved. A Dutch model
is cut by using FFMPEG4 . is selected for this. Google Speech API has a synchronous and
Cmu Sphinx uses the audio format ’.WAV’. FFMPEG is used to asynchronous process. The synchronous process uses local files of
convert audio formats. It is also necessary to have uniformity in maximum one minute. The asynchronous process can handle longer
the sample rate. Other sample rates can be used, but the training audio files, but must be stored in the Google cloud. Since phone
data must have the same sample rate as the recognized speech. The calls are often longer than 1 minute the asynchronous process is
audio should also be in mono with a single channel. used to transcribe.
Although a Dutch model is available, there is no phone call
3.1.3 Training the CMU Sphinx speech model. CMU Sphinx has acoustic model for Dutch. The acoustic environment of a phone call
a Dutch dictionary and language model available5 . The acoustic is different than microphone for example. The effect of a mismatch
model is trained on 13 hours of speech, which is usually way too in acoustic environment is worse performance on recognition.
little [20]. However, this model is still useful for adapting with
the data discussed above. Adaptation of a model is suitable for 3.4 Speech Recognition Comparison
increasing the accuracy of the model and adapting to an acoustic The evaluation of speech recognition is done with the use of the
environment. This means no model has to be built from scratch. textdistance8 package in Python. The Jaccard index and WER are
With the language model, dictionary and training data in place the calculated for transcripts of three speech systems in comparison to
model is trained. a manually adjusted transcript. One hour of phone calls is manually
The steps taken in training the model are as follows. An acoustic transcribed for evaluation.
feature file is generated for every individual audio clip. The next
step is collecting statistics from the adaptation data. After that adap-
3.5 Sentiment Analysis
tation to the HMM is done. Two methods that are frequently used
are Maximum A Posteriori (MAP) and Maximum Likelihood Linear The purpose of sentiment analysis on phone calls is creating in-
Regression (MLLR). A combination of the two is most successful sights into positive and negative calls. A positive call is a sale and a
[18]. Adaptation is therefore done using a combination of the two negative call is no sale. This is a binary classification problem. Three
methods. algorithms are used to predict the correct class. The algorithms
are capable of finding distinguishing terms between classes. These
2 https://cmusphinx.github.io/wiki/tutorialadapt/
3 https://nats.gitlab.io/swc/ 6 https://github.com/opensource-spraakherkenning-nl
4 https://ffmpeg.org/ 7 https://cloud.google.com/speech-to-text/docs/
5 https://sourceforge.net/projects/cmusphinx/files/ 8 https://pypi.org/project/textdistance/
4
terms are visualized and a comparison is made for the corpora. The Parameters are tuned for best classification performance. The
data used for classification is discussed in the subsection data. ROC curve gives insight into the true positive rate and the false
positive rate. The true positive rate is the measure of correctly
3.5.1 Data. The data for sentiment analysis are phone call conver- identified as positive. The false positive rate is the measure of
sations. The conversation is between an advisor and a customer. correctly identified negatives. With the use of the ROC curve the
The subject of these phone calls is solar panels. The language spo- best tuned model is selected. The ROC curve for SVM is shown in
ken is Dutch. The conversations are about placement of solar panels figure 1.
on roofs and everything that has to do with it. Conversations do
not follow guidelines and can go in many different directions.
Attributes for the phone calls are: sale or no sale, date, duration,
customer name, advisor name, and direction of call. Classification
is done on the sale or no sale attribute.
Two groups of phone calls are made. For sales intakes, which
is the first call with a customer. And a random sample is taken of
which some calls can be sales intakes or follow up calls about sales.
For both groups a thousand calls are selected.
Two thousand calls are selected and automatically transcribed by
the speech recognition system. Calls are roughly ten minutes long.
This makes the total duration 166 hours for both groups. Table 1
shows the distribution between ’sale’ and ’no sale’ conversations.
5
The odds ratio is a measure of association between two groups. As can be seen in table 2, Kaldi-NL has the lowest error rate and
The log-odds ratio is calculated by dividing the relative word oc- captures most of the original conversation. Besides these metrics
currence in ’sale’ by the relative word occurrence in ’no sale’ and there are other differences between systems. Some differences are:
taking the log. speed of computation, access and ease of use, additions to the model,
With the use of Scattertext terms can also be queried and the and format of output. The models are each discussed in the sections
transcripts that contain that term are found. When for example a below.
word is predominantly positive or negative the phone calls can be
4.1.1 CMU Sphinx. A problem that was encountered during train-
found that contain this word.
ing is the train data mostly consisting of a few speakers. The Spoken
3.8 Corpus Comparison Wikipedia Corpus has 145 speakers, however a few top speakers
contributed to a big part of the corpus. When training for just a few
To determine a difference in word frequency of both conversation people the model fits to those speakers. Many speaker dictation
classes a statistical test is used. According to literature the Wilcoxon requires many people. Therefore the performance of this system is
rank-sum test is suitable for comparing word frequencies between not optimal.
two corpora [8, 12]. There is no segmentation to solve speakers talking at the same
The Wilcoxon rank-sum test serves to test significance of word time. Speed of dictation is also lacking compared to other systems.
frequency in ’sale’ and ’no sale’ conversations. The test is performed For recognition CMU Sphinx uses ’.WAV’ files. These are uncom-
on words with their total frequency in the ’sale’ or ’no sale’ group. pressed and can be multiple times larger than ’.mp3’ files that the
The test is also performed on individual words and their frequency other systems are able to use for recognition. There are other fac-
in a thousand individual conversations. tors that contribute to speed. This is however difficult to compare,
The reason that the test is not done on every word for every since Google Speech API is only available in the cloud and this is
conversation is because of the introduction of the multiple compar- not comparable to local recognition.
isons problem. This problem can be avoided here by not testing all
words individually. 4.1.2 Kaldi-NL. One limitation of Kaldi is the availability on Linux
Classification delivers a list of 1500 words. For SVM the list only. This problem was solved by using a Docker container to run
contains 750 positive and 750 negative coefficients. The positive Kaldi in. A connection can be made to the container on any device
and negative words are both tested. The word occurrence for ’sale’ that has network capabilities.
and ’no sale’ is counted and tested. Positive and negative coefficients One way performance of Kaldi increases is segmentation. Individ-
are split in the test to measure a difference between the two. ual speakers are partitioned into segments. After that recognition
is started. The problem of speakers speaking at the same time is
4 RESULTS solved this way.
The results section follows the structure: speech recognition evalua- Although it is difficult to compare speed of a local system, a
tion, classification performance, sentiment visualization and corpus server, and a cloud because of different hardware. An advantage of
comparison. Measurements are given of speech recognition and using Kaldi in Docker is the scalability. A Docker environment can
classification performance. Visualizations are shown in multiple fig- scale to multiple machines in a cluster. Speed of dictation is easily
ures. Significance of word differences are given in tables. Findings adjusted by the scale of the cluster.
will be stated and problems encountered will be discussed. 4.1.3 Google Speech API. Google Speech API is a cloud solution.
The advantage is ease of implementation. The drawback of an API
4.1 Speech Recognition Evaluation connection is the limitations in adjustments. The Dutch model
The speech recognition models compared in this paper are: Google of Google does not have diarization and there is also no phone
Speech API for Dutch, the open source Kaldi-NL speech model, and call acoustic model available. These features are available for the
a model trained in CMU Sphinx. These models work differently and English model.
have different attributes. First the performance measures are given. The output length of dictation is significantly shorter in length
After that other differences are stated. for Google than for the other speech systems. The amount of words
Different speech recognition models make different errors. The in the evaluation is just 26% of Kaldi-NL. However, the words that
systems are compared on word error rate (WER) and Jaccard Index are in the output capture 22.8% of the spoken text as can be seen
as mentioned in the related literature. Table 2 gives the performance in Table 1. This suggests that Google outputs words only that it is
measures on phone call conversations. confident about above a certain threshold.
REFERENCES
[1] Päivi Kristiina Jokinen Birgitta Ojamaa and Kadri Muischnek. 2015. Sentiment
analysis on conversational texts. NODALIDA (2015).
[2] Rico Petrick Patrick Proba Ahmed Malatawy Christian Gaida, Patrick Lange and
David Suendermann-Oeft. 2014. Comparing Open-Source Speech Recognition
Toolkits. DHBW (2014).
[3] Souraya Ezzat, Neamat El Gayar, and Moustafa M. Ghanem. 2010. Investigating
Analysis of Speech Content through Text Classification. International Conference
of Soft Computing and Pattern Recognition (Dec. 2010), 105–110. https://doi.org/
10.1109/SOCPAR.2010.5686000
[4] Apeksha Bhansali Saif Ali Khan Harsha Mahajan G V Garje, Apoorva Inamdar.
2016. SENTIMENT ANALYSIS: CLASSIFICATION AND SEARCHING TECH-
NIQUES. IRJET 3, 4, Article 651 (April 2016), 2796-2798 pages.
[5] Mark Gales and Steve Young. 2007. The Application of Hidden Markov Models
in Speech Recognition. Foundations and Trends in Signal Processing 1, 3, Article 3
(Jan. 2007), 195–304 pages. https://doi.org/10.1561/2000000004
[6] Mounia Baeza-Yates Ricardo Graells-Garrido, Eduardo Lalmas. 2016. Sentiment
Visualisation Widgets for Exploratory Search. Social Personalization Workshop
(Jan. 2016).
[7] David Griol, José Manuel Molina, and Zoraida Callejas. 2019. Combining speech-
based and linguistic classifiers to recognize emotion in user spoken utterances.
NEUROCOM 326-327 (Jan. 2019), 132–140. https://doi.org/10.1016/j.neucom.2017.
01.120
[8] Tanja Säily Panagiotis Papapetrou-Kai Puolamäki Jefrey Lijffijt, Terttu Nevalainen
and Heikki Mannila. 2014. Significance testing of word frequencies in corpora.
Literary and Linguistic Computing 31, Article 2 (Dec. 2014), 374–397 pages. https:
//doi.org/doi.org/10.1093/llc/fqu064
[9] Yonghong Yan Chaomin Wang-Zhijie Ren Pengyu Cong Huixin Wang Jia Sun,
Weiqun Xu and Junlan Feng. 2016. Information Fusion in Automatic User Sat-
isfaction Analysis in Call Center. IHMSC (Aug. 2016). https://doi.org/10.1109/
IHMSC.2016.49
[10] Jason Kessler. 2017. Scattertext: a Browser-Based Tool for Visualizing how
Corpora Differ. ACL System Demonstrations (April 2017).
[11] Eva Kiktova and Jozef Juhar. 2015. Comparison of Diarization Tools for Building
Speaker Database. INFORMATION AND COMMUNICATION TECHNOLOGIES
AND SERVICES 13, 4 (Nov. 2015), 6. https://doi.org/10.15598/aeee.v13i4.1468
[12] Adam Kilgarriff. 2011. Comparing Corpora. International Journal of Corpus
Linguistics 6, Article 1 (Nov. 2011), 97–113 pages. https://doi.org/doi.org/10.1075/
ijcl.6.1.05kil
[13] Carita Paradis Kostiantyn Kucher and Andreas Kerren. 2017. The State of the
Art in Sentiment Visualization. COMPUTER GRAPHICS forum 37, 1, Article 1
(June 2017), 71–96 pages. https://doi.org/10.1111/cgf.13217
[14] Veton Këpuska and Gamal Bohouta. 2017. Comparing Speech Recognition Sys-
tems (Microsoft API, Google API And CMU Sphinx). IJERA 7, 3, Article 2 (March
2017), 5 pages. https://doi.org/10.9790/9622-0703022024
9