Open AccessArticle

From Customer’s Voice to Decision-Maker Insights: Textual Analysis Framework for Arabic Reviews of Saudi Arabia’s Super App

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6952; https://doi.org/10.3390/app14166952

Submission received: 1 July 2024 / Revised: 21 July 2024 / Accepted: 23 July 2024 / Published: 8 August 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Figure 1
The Arabic Super App User Experience Framework (ASA-UXF). "> Figure 2
Sentiment distribution of the sample dataset. "> Figure 3
Sentiment distribution of the Mrsool dataset. "> Figure 4
The most used words in the Mrsool dataset: (a) word frequencies and (b) word cloud. "> Figure 5
Monthly number of reviews combined with users’ sentiments for Mrsool. "> Figure 6
The most used words for (a) positive, (b) negative, and (c) neutral sentiment in Mrsool. "> Figure 7
The extracted (a) topic 0 (b) topic 1 (c) topic 2 and (d) topic 3 using BTM. "> Figure 8
Topics distribution among Mrsool dataset. "> Figure 9
Topic-based sentiment analysis of Mrsool dataset. "> Figure 10
The elbow method of (a) topic 0 (b) topic 1 (c) topic 2 and (d) topic 3. "> Figure 11
Clustering topics using (a) Euclidean distance and (b) cosine similarity. "> Figure 12
Clustering topics using (a) POS with Euclidean distance and (b) POS with cosine similarity. "> Figure 13
Topics’ distribution in the Careem dataset. "> Figure 14
Topic-based sentiment analysis of the Careem dataset. ">

Versions Notes

Abstract

Recently, business sectors have focused on offering a wide variety of services through utilizing different modern technologies such as super apps in order to fulfill customers’ needs and create a satisfactory user experience. Accordingly, studying the user experience has become one of the most popular trends in the research field due to its essential role in business prosperity and continuity. Thus, many researchers have dedicated their efforts to exploring and analyzing the user experience across social media, blogs, and websites, employing a variety of research methods such as machine learning to mine users’ reviews. However, there are limited studies concentrated on analyzing super app users’ experiences and specifically mining Arabic users’ reviews. Therefore, this paper aims to analyze and discover the most important topics that affect the user experience in the super app environment by mining Arabic business sector users’ reviews in Saudi Arabia using biterm topic modeling, CAMeL sentiment analyzer, and doc2vec with k-means clustering. We explore users’ feelings regarding the extracted topics in order to identify the weak aspects to improve and the strong aspects to enhance, which will promote a satisfactory user experience. Hence, this paper proposes an Arabic text annotation framework to help the business sector in Saudi Arabia to determine the important topics with negative and positive impacts on users’ experience. The proposed framework uses two approaches: topic modeling with sentiment analysis and topic modeling with clustering. As a result, the proposed framework reveals four important topics: delivery and payment, customer service and updates, prices, and application. The retrieved topics are thoroughly studied, and the findings show that, in most topics, negative comments outweigh positive comments. These results are provided with general analysis and recommendations to help the business sector to improve its level of services.

Keywords:

machine learning; super app; user experience; topic modeling; biterm topic model; sentiment analysis; clustering; data mining; Arabic data annotation

1. Introduction

In the current era, smartphones are the most widely used portable devices due to their lightweight nature and large capacity for applications that are very useful for users. In 2024, the number of global smartphone users is expected to reach 7.13 billion, meaning that 90.33% of the global population will utilize mobile applications [1]. Considering their multiple uses, all user segments—both business and consumer—use a range of smartphone applications for various purposes and objectives. Thereby, many users spend most of their time using smartphone applications—on average, from 4 to 5 hours per day. Saudi Arabians in particular use smartphone apps for 5.3 hours a day, placing them third in the world for the highest use of smartphone applications in 2023, after Brazil and Indonesia [2]. This resulted in Saudi Arabia’s App market revenue growing to reach USD 1866 million in 2024, a 13.2% rise from the year before, as indicated by Statista [3]. An equally important point to mention is that 90% of consumers may decide not to return and use the application again after a negative user experience, resulting in a 35% loss of sales worth USD 1.4 trillion [4]. Hence, this reinforces the importance of user experience, which focuses on the user’s interaction with the product or service in terms of emotion, feeling, value, and importance [5].

In the same context, Gartner Inc. (Stamford, CT, USA), a technology research and consulting firm in the United States, announced that the super app had become one of the top ten emerging technologies for 2022 [6]. According to a January 2022 survey, a successful super app may reach 71 million users on day one in the US, resulting in an annual expenditure of over USD 2.23 billion [7]. To clarify, super applications, or super apps, consolidate several providers’ services into a single mobile app, eliminating the need for numerous sign-ins and passwords [8]. This characteristic makes companies more likely to offer these integrated services to suit a wide range of customer demands. As a result, researchers have focused on examining super apps from a variety of perspectives, including user behavior, privacy, security, and user experience [9,10,11,12].

Although super apps offer a good user experience by merging several services into a single mobile application, customer pleasure and experience can still be affected by a variety of factors. One of these factors is a poor user experience, causing huge losses for the business, not to mention the magnitude of the impact on the super app combining many services in one place. Such an effect will cause huge losses that go beyond the owner of the super app, reaching all service providers, which will double the losses for the owners of the services and the super app. To obtain a competitive advantage, companies and decision-makers must prioritize creating a pleasant experience for users and improving service quality. The use of customer reviews is regarded as a valuable tool for discovering user experience elements and experience concerns via mining user reviews [13]. Customer reviews are an important aspect of any online service or product since they provide useful information about the service’s quality and efficacy. In today’s competitive environment, these reviews can assist businesses in finding areas for improvement and development, providing them with a good source of information about the pros and cons of the services or products in order to fine-tune and promptly improve their services and products quickly and effectively. To enhance the overall experience and usefulness of super applications, decision-makers could use reviews to estimate user experience, whereby every USD 1 invested in enhancing the customer experience will yield a return of USD 10 to 100, thus increasing revenue [14].

All of these facts highlight how crucial it is to analyze users’ experiences, especially those of super apps, to comprehend their needs and to avoid them having poor experiences. Hence, analyzing user experience will help to retain users and ensure their loyalty, which improves the application’s reputation and increases its profits.

Considering the significance of the user’s experience, several techniques have evolved to analyze it through textual reviews. One of these approaches is opinion mining or sentiment analysis, comprising numerous levels including document, sentence, and aspect analysis, focused on determining what parts of a review are negative, positive, or neutral [14]. Additionally, the topic modeling approach is used to analyze and extract the main ideas from a document; however, further studies are needed for its application in Arabic texts [15,16]. Meanwhile, the clustering approach uses the data distribution’s underlying structure to generate criteria for grouping data with similar characteristics [17].

All of these techniques have provided solutions to real-world problems. However, studies presented in Arabic still require further exploration [15,16]. The Arabic language is the world’s sixth most spoken language, with over 400 million speakers, most of whom live in the Middle East and North Africa [18,19]. The Arabic language has three distinct categories: (1) classical Arabic, the language of the holy book of Islam (The Qur’an); (2) modern standard Arabic, which is a simplified variant of classical Arabic grammar that incorporates modern terminology; and (3) dialectical Arabic (colloquial), which refers to all oral kinds utilized in daily discussions and differs among different Arab countries and areas within the same country [20]. Therefore, several machine learning techniques face challenges when utilized in Arabic because of the language’s inherent ambiguities, dialectal diversity, morphological richness, and intricate morphosyntactic agreement rules, as well as the scarcity of Arabic resources [21]. As a result, examining the Arabic language remains an increasingly hot topic for researchers. While 59% of the current research analyzes social media content, especially on Twitter (https://twitter.com/ (19 December 2021)) (now called X), just 4% of studies analyze business-related content [22]. This presents an opportunity for researchers to uncover this content [22].

This research introduces a novel method to explore user experience through focusing on users’ reviews in the business sector super app environment by analyzing Arabic users’ reviews in the Kingdom of Saudi Arabia (KSA) using machine learning (ML) approaches. The main contributions of this research are as follows:

We propose a comprehensive framework that uses users’ reviews to provide decision-makers with a clear view of users’ opinions regarding their super app in a visualized way that will be easy and quick to understand. This is accomplished by introducing two machine learning approaches, namely, topic modeling with sentiment analysis and topic modeling with clustering.
To the best of our knowledge, this is the first research study to provide a framework for analyzing the Arabic user reviews of a business sector super app in the context of Saudi Arabia utilizing topic modeling, sentiment analysis, and clustering approaches.
We create an Arabic dataset collected from the Apple and Google Play Stores for some of the most well-known super applications found in the KSA.

The objectives of this paper are (1) to extract the most important topics that affect super applications in the field of business and (2) to measure the extent of users’ satisfaction and dissatisfaction with each extracted topic to explore the areas that require improvement.

The remaining sections of this work are structured as follows: Section 2 explores the relevant research; Section 3 discusses the methodology utilized to attain the proposed objectives; Section 4 highlights the experimental design and findings; Section 5 tests the proposed framework using the testing dataset; Section 6 presents the results and discussion; and finally, Section 7 offers this work’s conclusions and future research direction.

2. Related Work

In recent years, there has been a growing interest in using machine learning methods to analyze user experience by mining user reviews. Sentiment analysis, topic modeling, and clustering are the three main approaches that have been widely employed in this field.

2.1. Sentiment Analysis

Sentiment analysis aims to detect and extract public sentiment and viewpoints from text found in reviews, blogs, social media, news, forums, and other online sources. Therefore, businesses and brands employ sentiment analysis to comprehend the social sentiment surrounding their product, service, or brand on social media or throughout the Internet. Generally, it is regarded as a machine learning task that examines texts for polarity, such as “negative”, “positive”, or “neutral”, as well as more severe emotions, such as happiness, sadness, anger, interest, etc. [23].

Several researchers have used this method to assess user experience using reviews. Pilliang et al. [24] offered a framework for a recommendation system that the government could use to promote their Gov2Go super application in the Indonesian community, which replaced thousands of applications to support electronic governance. This methodology employed sentiment analysis based on expectation confirmation theory (ECT), which examines user pleasure, to provide suggestions that take advantage of the user’s past contact with the product or service. They used long short-term memory (LSTM) and naïve Bayes (NB), resulting in accuracy rates of 97.49% and 94.22%, respectively. Aji et al. [25] conducted another study on super applications using sentiment analysis. They focused on the Fintech sector, which relates to financial technology that utilizes smartphones, software, and modern technology to provide financial services and handle or transfer money in a simpler, easier, and smarter way. The authors used sentiment analysis at the sentence level on collected reviews for the OVO app in Tegal City in Indonesia to determine whether it is a good form of Fintech or not. They applied support vector machine (SVM) with particle swarm optimization (PSO), achieving an accuracy of 82.33%. While the preceding studies utilized other languages, Al-Hagree et al. [26] employed the Arabic language and assessed user opinion of eight applications for mobile banking in Yemen using sentiment analyses. Using NB, SVM, K-nearest neighbor (KNN), and decision tree (DT), accuracy rates of 89.65%, 68.85%, 87.89%, and 56.02%, respectively, were obtained. In the same context, Hadwan et al. [27] reviewed six government applications—Tetaman, Tawakkalna, Sehhaty, Sehhah, Tabaud, and Mawid—that were introduced by the Saudi government during the COVID-19 pandemic to evaluate and assess user opinions and satisfaction using a sentiment analysis approach. They used SVM, DT, NB, and KNN, which achieved the best accuracy of 78.46% for the KNN classifier. By optimizing the size of the dataset and utilizing several machine learning approaches, they were able to improve their work and achieve good results [28]. They balanced the data via the synthetic minority oversampling technique (SMOTE) and employed logistic regression (LR), random forest (RF), SVM, and bagging for classification. They attained the highest accuracy using the SVM method, achieving an accuracy of 94.38%. Likewise, Banjabi et al. [29] used sentiment analysis to analyze online government services using two approaches, namely, a lexicon-based and manual annotation approach, to assess customer satisfaction with the Saudi Ministry of Commerce (MOC)’s commercial e-services. They gathered reviews from the “Commercial Report” app in the Apple Store, as well as Twitter mentions of customer support and official MOC accounts. Then, they cleaned the collected data. To apply the first technique, the researchers manually annotated the data as positive, neutral, or negative, yielding 1.9% positive, 91.2% neutral, and 6.8% negative. During the annotation process, they collected 140 negative and 140 positive terms to be used with the lexicon-based technique. They then tested these two techniques using three classifiers, NB, SVM, and LR, which resulted in the annotation method outperforming the lexicon-based method. The SVM classifier achieved the best accuracy of 94%. Meanwhile, Al-Smadi et al. [30] employed a new level of sentiment analysis to examine customer opinions by providing annotated Arabic comments of restaurant corpora for four aspects: cleanliness, pricing, service, and food quality. They utilized a dataset of 3020 Arabic comments from the website of Jordan’s Jeeran restaurant, which were manually annotated using a crowdsourcing method. They also preprocessed data, extracted features, and constructed eight classifier models (NB and SVM) for each aspect, followed by sixteen tests, four for each aspect, to assess them. As a result, SVM outperformed NB in every experiment in relation to pricing, cleanliness, food quality, and service, with accuracy rates of 84.47%, 76.95%, 77.05%, and 74.80%, respectively.

2.2. Topic Modeling

An unsupervised learning approach for data analysis was used to evaluate text and various unlabeled data [31]. Topic modeling retrieves information to infer the latent themes in a variety of documents automatically, allowing for the organization, comprehension, and summarization of massive amounts of textual material. In addition, a lot of researchers employ topic models to evaluate data since these models offer an understandable representation of documents used in various natural language processing (NLP) activities [32]. Kang et al. [33] determined the research areas that have been investigated in the field of biochemistry over the past 20 years and analyzed how these areas have evolved using the latent Dirichlet allocation (LDA) technique for topic modeling. Meanwhile, Sarker et al. [34] combined non-negative matrix factorization (NMF), a type of topic modeling approach, with deep neural networks in the recommendation system to study the relationship among users and the products themselves. In the same recommendation system domain, Rajendran et al. [35] integrated topic modeling with browser history and Wikipedia data to tailor users’ content through topic extraction, while Tushev et al. [36] employed keyATM, a topic modeling method, to extract significant information from app reviews. Abuzayed et al. [37] conducted an experimental research study to evaluate an emerging modeling type called BERTopic alongside NMF and LDA using the Arabic language. Their findings showed that BERTopic outperformed LDA and NMF.

To improve user experiences, Singh et al. [38] mined 104,109 reviews of ten Indian sites to learn about visitors’ experiences and evaluate data to enhance service quality. The data were obtained from the Indian version of the TripAdvisor website utilizing the Scrapy framework. They then preprocessed and tokenized the data before performing lexicon-based sentiment analysis with the AFINN Lexicon dictionary. They used NMF to derive the topics. Their results showed that five stars was the most popular rating, with the Taj Mahal being the most visited place. The topic modeling grouped topics based on the ten locations, resulting in various themes for each site. The extracted topics were mostly about activities, views, and provided services. The researchers in [39,40] studied the hospitality industry and employed topic modeling approaches to assess user experience. Singh et al. [39] investigated the primary themes of interest for Korean accommodation users in South Korea by analyzing 104,161 English reviews of online travel agencies obtained from Agoda, Booking.com, and hotel websites in 2018. These opinions were organized by accommodation location and type. Furthermore, they employed the LDA topic model to generate 14 topics and validate them in four phases. Meanwhile, Hu et al. [40] aimed to identify the causes of consumer complaints in an attempt to assist managers of hotels in New York City to enhance customer satisfaction, service quality, and revenue. They collected 27,864 English reviews from 315 different-grade hotels on the TripAdvisor website. The text was then preprocessed to employ the structural topic model (STM) to compare the topics included in positive and negative reviews. As a result, they identified 10 negative topics expressing customer dissatisfaction, which can be described as issues related to the location in low-end hotels, issues related to services and prices in high-end hotels, and issues related to booking, room type, and cancellation in all hotels. Also, Kiatkawsin et al. [41] used LDA to compare Airbnb feedback from Singapore and Hong Kong. They believed in the ability of topic modeling to uncover underlying themes from customer opinions. To acquire the dataset, they gathered Airbnb customer reviews in English from Singapore and Hong Kong. Following that, they removed any extraneous data from the dataset before stemming to the root. They obtained 74,829 reviews for Singapore and 104,803 reviews for Hong Kong. The ideal number of topics retrieved for Hong Kong was 12, whereas for Singapore it was 5, based on the grid search approach. They used two validation techniques to analyze the topics, namely, labeling the keywords within the topics, and then validating these labels using the top reviews for every topic. The authors concluded that the smaller number of themes did not result in a loss of knowledge. In terms of scope, the obtained topics appeared to be similar in both locations since they concentrated on unit, evaluation, location, and listing management.

By combining the previous techniques, topic modeling and sentiment analysis, Twil et al. [42] evaluated 39,200 English-language tourism opinions of Marrakech city collected over a 12-year period (2008–2020) using the TripAdvisor website to target Morocco’s tourist industry. They employed a topic-based sentiment analysis approach in collaboration with tourism experts to validate the derived topic labels. The data were cleaned, stemmed, and tokenized. They then employed term frequency-inverse document frequency (TF-IDF) vectorization to transform textual data into vectors of numbers that were used as the input for the LDA. As a result, they obtained four topics identified by experts as citizen behavior, Jamaâ-el-Fna’s atmosphere, shopping experience, and overall tourism experience. Furthermore, they used TextBlob and VADER lexicon-based techniques to analyze the words in every topic dimension. The cross-tabulation approach was used to measure the topic’s valence, which was either positive or negative. The researchers then used the same extracted LDA terms to compare the joint topic model methodology for sentiment analysis of the subject level with the proposed model, utilizing both quantitative and qualitative methodologies. As a result, the lexicon-based models beat the joint subject model (JST), with VADER and TextBlob achieving 72.6% and 77.3% accuracy rates, respectively, compared to 69.6% for JST.

2.3. Clustering

The unsupervised machine learning approach identifies and groups similar data points in huge datasets without regard for specified outcomes. It divides objects into clusters that are more similar than other groupings. It is commonly used as a data analysis tool to identify patterns or interesting trends in data, such as groupings of consumers based on their behavior. Clustering may be used in a wide range of applications, including e-commerce, cybersecurity, mobile data processing, user modeling, health analytics, and behavioral analytics [23].

Several researchers have used clustering algorithms to mine user reviews. Lobo et al. [43] used public feedback on 46 commercially available applications found on the Apple and Google Play Stores to identify user experience concerns that will influence future app development for stroke caregiving. These researchers preprocessed and filtered the collected data to obtain 13,368 English reviews. Moreover, they applied the TF-IDF method and k-means clustering to extract 26 clusters classified into seven user experience dimensions: usefulness, usability, desirability, accessibility, findability, value of the app, and credibility. This study highlighted various user experience concerns as a result of the app developers’ failure to grasp the user’s demands. Furthermore, this study demonstrates the use of a participatory design method to encourage better knowledge of user demands, hence reducing difficulties and ensuring long-term usage. Also, Marzijarani et al. [44] used 564 English user reviews from two hotels on the TripAdvisor website for text summarization. These researchers cleaned the data and used part of speech (POS) tagging to obtain the total sentence similarity, which was calculated by multiplying sentiment similarity based on nouns and content similarity based on adjectives. Then, they applied k-means and Gaussian mixture mode (GMM) clustering. As a result, GMM outperforms k-means when k = 5 in terms of summary information and sentence variety. Moreover, Booth et al. [45] mined 20,461 English user reviews after cleaning for unnecessary data. The researchers gathered the dataset from seven wellbeing and mental health chatbots to discover user experience issues. Then, they used TF-IDF vectorization to obtain the word weight for each review, which was utilized as an input for the k-means clusters to identify comparable reviews based on their content. They attained the optimal number of clusters using the elbow method in k = 6. Most of the cluster’s top ten words represented positive sentiments, while negative words were rare. These findings can be utilized to offer recommendations for mental health app developers.

By combining the previous techniques for sentiment analysis, topic modeling, and clustering, Majesty et al. [46] evaluated mobile banking application feedback using LDA topic modeling and NB sentiment analysis. These researchers gathered 6194 Indonesian comments from the mobile banking app available in the Google Play Store in 2019 and manually annotated them, grouping them into 3021 negative and 3173 positive comments. They built and verified the NB model using the k-fold cross-validation technique with k-values of 3, 4, and 5. Meanwhile, the LDA model was built using the bigram, trigram, and corpus dictionary. The LDA model was then trained and assessed based on its coherence score. Consequently, the NB model performed best in the third test with k = 5, achieving 86.76% accuracy, 93.47% recall, and 92.48% precision. Contrastingly, the best coherence score was attained on 20 clusters: 0.480691 for the positive class and 0.366344 for the negative class. The authors of this study were able to identify the most essential parts of the text in both positive and negative comments. As a result, the application’s advantages were its simplicity, ease of use, and helpfulness, while the application’s shortcomings included its inability to deliver OTP codes, issues with the registration and login processes, and lost connections. Also, Alejandro et al. [47] used document clustering, topic modeling, and opinion-mining techniques to assess and characterize public opinion based on Twitter reviews of Uber customer service, using a dataset of 215,387 English-language tweets from the year 2020. Several topic models were used, including the NMF, latent semantic analysis (LSI), and LDA models, where the optimal topic number was determined using the Gibbs sampling approach. They then used different document clustering approaches, including k-means clustering and the genetic with local convergence method, to merge tweets into one of the specified topics. LDA and NMF beat the LSI, obtaining seven topics: Uber account/Uber app, Uber drivers, money and payments, Uber Eats, opinions about customer service, time and cancellation, and interaction with Uber support. Using the elbow technique, they discovered that k = 7 represented the optimal clustering value, which was consistent with the number of LDA topics. Finally, they studied and discussed the attitudes and emotions associated with each extracted topic.

In general, machine learning techniques like sentiment analysis, topic modeling, and clustering have been proven to be effective tools in analyzing user experience. By utilizing these methods, businesses can gain valuable insights into how users perceive their services and make informed decisions to improve the user experience. Previous studies have predominantly employed LDA for topic modeling, SVM and NB for sentiment analysis, and k-means for clustering as seen in Table 1. However, there is a noticeable lack of research that combines these methodologies to uncover the crucial factors influencing user experience and explore user sentiments in relation to these factors. Notably, most of these studies have mainly focused on the English language. Therefore, this study aims to address this research gap by conducting an Arabic text mining study, specifically targeting the super application environment in the Kingdom of Saudi Arabia.

3. Materials and Methods

This section presents the overall framework methodology for analyzing and discovering the most important topics that influence the user experience in the super app environment through mining customers’ reviews. The scope of this research was restricted to mining Arabic-language customers’ reviews of popular business super applications in Saudi Arabia. Figure 1 shows the Arabic Super App User Experience Framework (ASA-UXF). The main steps were collecting and annotating the dataset written in Arabic using the CAMeL tool, preprocessing the dataset, applying the topic modeling to identify the most significant topics, embedding reviews, clustering reviews under each topic obtained, and visualizing the result. The following subsections introduce and explain these stages in depth.

3.1. Data Processing

Data processing allows for the efficient and precise management of massive volumes of information. Through a series of steps, it involves receiving, cleaning, filtering, storing, and manipulating data to transform raw, unstructured text data into a machine-readable format. These steps were accomplished as follows.

3.1.1. Data Collection

The experimental dataset contained 20,311 user reviews that were collected and obtained from 2022 to 2023 by scraping the Mrsool super application (https://mrsool.co/) (accessed on 6 May 2023), which provides delivery services for a variety of stores such as food and drink stores, pharmacies, digital card stores, supermarkets, and more within the Kingdom of Saudi Arabia. It is noteworthy that Mrsool ranks among the most popular and most downloaded mobile apps in Saudi Arabia. The collected data from the Apple Store and Google Play Store reflect the recent user experiences, which can be visualized and recommended accordingly. All of the obtained data from the Google Play Store and Apple Store were from Saudi Arabia. For the purposes of scraping reviews, two tools were used: the Apple Store Scraper tool (https://github.com/facundoolano/app-store-scraper (accessed on 6 May 2023)) and the Google Play Scraper tool (https://github.com/JoMingyu/google-play-scraper (accessed on 6 May 2023)). These predefined tools use the app name and ID, which can be found in the URL, to scrape the contents of the app. The gathered data were saved in common separated values (CSV) and Excel (xlsx) files.

3.1.2. Data Annotation

The web scrapers provided us with the essential details for the app, such as the user’s name, the review text, the posting date, the number of thumbs up, the app’s current version, and its rating. It was observed that certain users labeled their reviews wrongly and correlated them with the incorrect score. This made the annotation process based on this characteristic deceptive. Table 2 highlights sample reviews from the collected dataset as examples of inconsistent ratings between the user’s chosen score and the review that was provided. For this reason, the “Rating” attribute in the Apple Play and Google Play Store linked to the user’s review was considered insufficiently accurate. As a result, we labeled the reviews with a positive, negative, or neutral sentiment score using the CAMeL tool and added a new attribute to the dataset named “Sentiment”.

The CAMeL Tool is a Python open source toolkit for Arabic NLP. It offers various utilities for the Arabic language, such as preprocessing, dialect identification, morphological modeling, sentiment analysis, part of speech (POS) tagging, and named entity recognition. The main use of the CAMeL tool in the proposed framework is to apply sentiment analysis tasks and to estimate users’ feelings and opinions. In addition, to assist in exploring clustering content and optimizing its results, this tool provides an Arabic sentiment analyzer to classify Arabic sentences as negative, positive, or neutral with fine-tuned AraBERT and multilingual BERT (mBERT) using a transformer [48].

To prevent user mistakes resulting from erroneously assigning rating scores, we invited three external annotators to voluntarily evaluate 350 reviews that were extracted randomly from the dataset by assigning a score for each amended review based on their personal opinions. One of the annotators is a graduate of King Abdulaziz University, and the others are graduates of Jeddah University. Each of them evaluated the same 350 reviews individually, and the majority vote methodology was used to determine if the review was negative, neutral, or positive. Figure 2 depicts the distribution of polarity ratings in the sample dataset. Figure 3 shows the final sentiment distribution after employing the CAMeL tool on the remaining dataset. To achieve accurate sentiment polarity scores, the confusion matrix was calculated by comparing the CAMeL tool’s predicted values to those provided by the annotators. The accuracy result indicates that the CAMeL tool predicts class labels at an accuracy of 88%.

3.1.3. Data Preprocessing

Flowing customers’ reviews are mainly unstructured and can include extraneous data, which would negatively impact machine learning performance. Hence, cleaning the data is a crucial step in enhancing the effectiveness of machine learning models. It significantly eliminates unimportant data that add unnecessary complexity and negatively affect the model’s performance, improving textual analysis. Thus, to acquire an efficient preprocessing method, the following steps were applied in the ASA-UXF framework:

We removed unnecessary additional data such as non-Arabic characters, numbers, punctuation, emojis, symbols, extra whitespace, empty rows, and duplicate reviews. Furthermore, reviews with a length of fewer than two words were removed from the dataset, considering that one word does not construct a useful sentence, which may confuse the model.
We applied tokenization to break up reviews into smaller portions (words) to aid in analyzing and comprehending these reviews.
We removed stop words that indicate the most frequent words that commonly occur together in the corpus since they do not add any semantic information that is useful for the model, such as “in”, “on”, and “from”, using the NLTK package (https://www.nltk.org/search.html?q=stopwords) (accessed on 3 June 2023). In addition, proper nouns such as the application name and the name of God (Allah) that are irrelevant to comprehending the topic’s underlying concept were removed as applied by the researchers in [41,42], as instead, they may produce a confusing impact on the model. For this reason, it is advisable to eliminate them before proceeding with further examination.
We normalized reviews to transform the text into a standard format. Additionally, we removed Tatweel or repeated words and characters that are used to emphasize the opinion. Meanwhile, we removed Tashkeel, which is a special inscription used in Arabic lettering. These steps were applied with the help of the pyArabic tool (https://github.com/linuxscout/pyarabic/tree/master (accessed on 3 June 2023)).
We applied lemmatizations to convert words into their stem or root. Different types were examined as shown in Table 3, and to obtain accurate data in a shorter time, the Qalsadi Lemmatizer was chosen to be applied in the framework.

Table 4 shows a sample of the reviews from the experimental dataset before and after applying the preprocessing steps.

3.1.4. Document Embedding

The proposed framework utilizes Doc2vec or a paragraph vector that was introduced by Le and Mikolov [49] in embedding reviews. Doc2vec is a method for unsupervised learning that creates vector representations for texts with varying lengths, such as documents and sentences. Indeed, the main benefit behind of the Doc2vec embedding method is assisting clusters to grouping positive and negative reviews, according to the document embedding’s ability to capture the context and semantic meaning of texts.

3.2. Topic Modeling

To achieve the main objective of this study, the biterm topic model (BTM) was used to extract the main topics that affect user experience in the super app environment, utilizing user reviews that are considered a type of short text. BTM was proposed by Yan et al. [50] to learn topics from short texts by explicitly modeling the biterms’ generation in the entire corpus, which helps in solving the problem of sparse patterns at the document level. Unlike traditional topic models that learn topics from single-word co-occurrence patterns at the document level, BTM learns topics from all of the generated biterms (two-word co-occurrence patterns) in the entire corpus.

Considering the comment “application prices are expensive”, ignoring the stop word “are”, its biterms would be “application prices”, “application expensive”, and “prices expensive”. This type of two-word co-occurrence pattern (biterms) demonstrates the correlation in documents and teaches the model of the appropriate topic since it is depicted as a combination of correlated words. To find the joint probability of biterm

b = (w_{i}, w_{j})

, we used Equation (1), and to find the probability of the entire corpus, we used Equation (2):

\begin{matrix} P (b) = \sum_{z} P (z) P (w_{i} | z) P (w_{j} | z) . \\ = \sum_{z} θ_{z} ϕ_{i | z} ϕ_{j | z} \end{matrix}

(1)

P (B) = \prod_{(i, j)} \sum_{z} θ_{z} ϕ_{i | z} ϕ_{j | z}

(2)

where z is the topic, b is the biterm that includes two words

(w_{i}, w_{j})

θ

is the topic distribution of the whole collection,

ϕ

is the topic word distribution, and B is the biterm set.

3.3. Clustering

After embedding the reviews and gaining vectors, another unsupervised learning algorithm was used to group these reviews, known as k-means clustering, which is regarded as one of the most popular and powerful data mining algorithms [51]. By identifying the highest degree of similarity inside a cluster and the largest degree of dissimilarity across distinct clusters, this algorithm can identify the cluster structure within a dataset [51]. During training, the k-means algorithm moves the k cluster centroids iteratively in order to maximize the following cost function:

S \sum_{i = 1}^{k} \sum_{x \in S_{i}} {‖ x - μ_{i} ‖}^{2}

(3)

where S is the cluster set

S_{1}

S_{2}

, …,

S_{k}

, k is the cluster number, and

μ_{i}

is the

S_{i}

centroid.

3.4. Visualization

Different visualization methods were used in the proposed framework to enhance the perception and understanding of the dataset structure, obtained topics, and cluster results to provide business owners with immediate insights into the users’ opinions via visualizing the extracted information in a clear, comprehensive, and easy manner. Various types of visualization techniques were employed in this work, including word clouds, bar chart, tmplots, principal component analysis (PCA), pie charts, and sunburst charts.

3.5. Evaluation Method

To evaluate the performance of the BTM topic model in the ASA-UXF framework, an optimized semantic coherence score was introduced by Mimno, Wallach, and Talley et al. [52] to assess the quality of the extracted topics. This evaluation metric as shown in Equation (4) is characterized by being well aligned with assessments of human coherence, and it does not rely on an external reference corpus; rather, it solely uses word co-occurrence data obtained from the corpus being modeled. Ideally, the model would already have taken into consideration all of these co-occurrence data. To assess a topic set’s overall quality, we computed its average coherence score, which is calculated using Equation (5):

C (z; V^{(z)}) = \sum_{t = 2}^{T} \sum_{l = 1}^{t} log \frac{D (V_{t}^{(z)}, V_{l}^{(z)}) + 1}{D (V_{l}^{(z)})}

(4)

\frac{1}{K} \sum_{k} C (Z_{k}; V^{(Z_{k})})

(5)

where z is the topic, T represents the top words

V^{(z)}

D (v)

is the document frequency of word v,

D (v, v ’)

is the co-document word frequency v and

v ’

, and K is the number of topics.

Perplexity is a popular evaluation metric in topic models that is used to assess the model’s ability to anticipate new documents. It refers to the model’s uncertainty in predicting the next word for a given batch of documents. Further, with decreasing perplexity, the model’s ability to forecast new documents increases. In general, models with lower degrees of perplexity perform better in predicting the next word [53]. To calculate the perplexity of the unseen comments for the testing dataset, Equation (6) was applied:

Perplexity = \exp (- \frac{\sum_{d = 1}^{m} log p (w_{d})}{\sum_{d = 1}^{m} N_{d}})

(6)

where w is a collection of comments, m is the total number of comments,

p (w_{d})

is the likelihood of a word appearing in a comment, and

N_{d}

is the word count of the comment.

The silhouette score is a popular and common metric used to assess the quality of data clustering outcomes such as k-means. This score depends on distance metrics such as “Manhattan”, “cosine similarity” or “Euclidean” to assess the similarity of each data point to the assigned cluster and how it is different from other clusters. In detail, a positive score indicates good clustering results, a score of zero indicates overlapping clusters, and a negative score indicates poor clustering results [54]. Equation (7) was used to evaluate clustering performance:

silhouette score = (b - a) / \max (a, b)

(7)

where for every data point, b is the mean nearest-cluster distance, and a is the mean intra-cluster distance.

The Euclidean distance metric is a commonly used distance measure. It is based on the Pythagorean theorem and represents the shortest distance between two data points. Many machine learning methods employ the Euclidean distance as the default distance metric to compare the similarity of two recorded vectors [55]. It can be calculated using Equation (8):

D = {(\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2})}^{\frac{1}{2}}

(8)

where n is the number of dimensions,

x_{i} - y_{i}

represent the data points, and D represents the distance between these data points.

The cosine similarity metric is a method for measuring the similarity of two vectors (two reviews) by computing the cosine of their angle [56,57]. To compute the cosine similarity between two vectors, A and B, Equation (9) was used:

cos_sim (A, B) = c o s (θ) = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

(9)

where

θ

represents the angle between A and B in the n-dimensional space, and

A_{i}

and

B_{i}

are the

i^{t h}

components of reviews A and B, respectively.

3.6. Finding the Optimal of Parameter

To find the optimal number of topics and clusters that yield good results, two algorithms were used in this study: grid search and the elbow method.

Grid search is a methodical technique to look for hyperparameters across the search space, and it generates all potential combinations regardless of the element’s influence in the optimization process. Thus, this method offers certain assurances when it assigns every parameter an equal chance of influencing the optimization process. However, it may require a large amount of computation and time when involving several parameters, each with multiple values [58,59].
The elbow method is important for any unsupervised method, such as k-means clustering, to establish the optimal number of clusters to put the data into. This is because unsupervised learning lacks the ability to set the number of clusters. Accordingly, the elbow method is one of the methods used to define the best number of clusters. The fundamental concept behind the elbow method is using the square of the distance between each cluster’s centroid and sample points to generate a range of k-values. It operates by calculating the WCSS (within-cluster sum of squares). The WCSS is computed by iterating over the k-value. To determine the optimal k-value, it should plot the k-WCSS curve and find the downward inflection point [60].

4. Results

This section provides a general analysis, allowing us to gain an overview of user reviews of super apps. In addition, several experiments conducted in this research work combined the topic modeling technique with sentiment analysis, and the clustering approach detailed the sentiments correlated with each of the topics covered in the dataset. These experiments could be summarized as belonging to two approaches: topic modeling with sentiment analysis and topic modeling with clustering techniques. Furthermore, various evaluations of these techniques have been undertaken, along with the clustering optimization process.

4.1. General Analysis

After the data preprocessing stage, the experimental dataset of Mrsool comprised 8562 user reviews with 9319 words. The top ten most used words in the Mrsool dataset were visualized using the bar chart and word cloud as presented in Figure 4a,b, respectively. For the Mrsool dataset, “request”, “delivery”, “delegator”, “price”, “customer”, “support”, “program”, “service”, “bad”, and “account” are the most commonly repeated words. The repetition of these words gives an idea of the most important topics that we can find in the comments. As an example, the word “price” indicates the possibility of finding a latent topic affecting the user experience centered around this word. Likewise, concerning the word “delivery”, there may be comments regarding “delivery problems”, “delivery speed”, “delayed delivery”, and so forth, which can be discovered.

The data annotation in this study offers a deeper examination of the labeled reviews using the CAMeL tool, which classified the reviews as positive, negative, or neutral. Figure 5 shows the monthly volume of reviews of the Mrsool super application based on user sentiment throughout the previously mentioned period of 2022 to 2023. This figure presents interesting data that require interpretation. There was a significant increase in the volume of daily reviews in May and July. This increment occurred with high probability due to attracting new customers to the application during summertime or due to a problem with the application that prompted users to express their opinion via writing reviews. However, this increment appeared to have started to decrease slightly by the end of the year. Although there were several negative comments, they were overshadowed by positive user opinions, which is a good sign that the application is still liked by users.

Furthermore, we extracted the most frequent words for every label, namely, positive, negative, and neutral as shown in Figure 6a–c, respectively, displaying the top terms according to Mrool’s dataset using word cloud visualization. Apparently, most of the positive comments revolve around the users’ admiration for the application. For example, “wonderful”, “respectful”, “fast”, “thanks”, “beautiful”, and “excellent” all demonstrated the users’ admiration for the application. As for the negative comments, it seemed that users were angry and dissatisfied when writing words like “bad” and “problem”. Meanwhile, for neutral comments, words such as “account” and “communication” may indicate the obstacles or needs that users experience in relation to the application. Markedly, there were general words that were repeated in all sentiments, such as “request”, “delivery”, “delegator”, “price”, and “program”, which need further investigations to discover their contents.

We observed the general statistical information of the Mrsool super application to establish an overview of the most frequent terms in all sentiments. These general analyses do not provide insights into how users feel about the services provided. For example, the words “delivery”, “order”, and “price” were repeated in all sentiments, which makes it difficult to determine whether users were satisfied or angry with the provided service. Hence, the next section of this study will explain in detail the procedures of separating different reviews according to the extracted topics, with the aim of classifying the sentiment information of each topic.

4.2. Topic Modeling and Sentiment Analysis

Topic modeling is a text mining technique commonly used to uncover the semantic components of the provided text. This statistical modeling technique is used in the NLP field, such as in sentiment analysis. In this proposed framework, the BTM topic modeling approach for short text is applied to the Mrsool dataset to discover the latent topics that affect user experience. In order to apply BTM topic modeling, the dataset underwent the necessary data preprocessing and cleaning steps. The cleaned reviews were converted into a suitable structure for BTM topic modeling. Then, different experiments were conducted to find the optimal number of topics. By using the grid search method, multiple hyperparameters such as the topic number were assigned to obtain the highest coherence score with the appropriate number of topics. Table 5 illustrates the average coherence score of Equation (5) for different topic numbers and hyperparameters, which resulted in the optimal number of topics being identified as four, with an average coherence score of −103.24.

By using the interactive Plotly tool, the four topics are visualized in Figure 7. Notably, all the groups of topics in the presented figure are clearly separated from one another. As a result, the topics have no commonalities, and we can be confident that our proposed topic modeling is suitable for analysis. Thus, our earlier investigation revealed that the optimal number of topics in the dataset is four non-overlapping latent topics. However, BTM does not expressly provide topic names. Similar to exploratory factor analysis, it must analyze and categorize the underlying theme of each topic’s keywords. Thus, the top 15 keywords were analyzed, and labels were assigned to the latent topic based on our subjective interpretation. Table 6 displays the four most frequently discussed topics in the dataset, including topic word clouds, topic descriptions, and topic names. For the first topic, the keywords were “delivery”, “account”, “payment”, “request”, “customer”, “money”, “right”, “support”, “delegator”, “amount”, “complaint”, “day”, “I say”, “pass”, and “something”, which represented delivery and payment problems. Meanwhile, for the second topic, the keywords were “customer”, “software”, “update”, “delegator”, “not very”, “very”, “account”, “support”, “problem”, “solution”, “communication”, “response”, “service”, and “request”, representing customer support and application updates. As for the third topic, the keywords were “price”, location”, “request”, “delegator”, “customer”, “value”, “delivery”, “expensive”, “kilo”, “store”, “lot”, “program”, “something”, “very”, “pass”, and “restaurant”, relating to the delivery prices. Finally, for the fourth topic, the keywords were “request”, “delivery”, “dealing”, “thanks”, “work”, “delegator”, “advice”, “service”, “excellent”, “program”, “support”, “respectable”, “preferred”, “customer”, and “fast”, which generally represented users’ opinions regarding the Mrsool app.

Moreover, using the pie chart of the Plotly tool helped in finding the proportion of each extracted topic from the Mrsool dataset as shown in Figure 8. From the chart, the most discussed topic by users in the dataset is the application, at a percentage of 29.5%, which makes the topic of application the most important topic from the user’s perspective, which should be dealt with carefully to improve user satisfaction. This is followed by the prices topic, at a percentage of 29.4%, making it the second most popular topic after the topic of application. Naturally, this topic carries a lot of weight from the users’ point of view, among others that must be paid attention to. The next most popular topic is customer support and updates, at a percentage of 23.3%, which makes it no less important than other topics, as it assesses the level of actual and effective communication with users. It shows the extent to which customer service and technical support respond to users’ needs by providing updates, responding to customers, and communicating with them when problems occur. The least popular topic is the delivery and payment topic, at a percentage of 17.8%, which may be used in helping users when there are problems related to payment and delivery, in addition to compensation and coupons.

As stated earlier, the main benefit of the sentiment analysis method is extracting insights from textual material using both positive and negative words, which makes using this method crucial for expanding the business and strengthening the bond between customers and the application by revealing their feelings and rating their opinions. Hence, based on the researchers’ recommendations and the great benefit obtained when combining topic modeling and sentiment analysis methods [41,61], we proposed the ASA-UXF framework, which comprises the BTM topic model combined with sentiment analysis utilizing the CAMeL tool sentiment analyzer. To provide decision-makers with a more comprehensive understanding of the users’ sentiments and opinions on each extracted topic, we employed the CAMeL sentiment analyzer to assess all reviews and saved the sentiment result in the “Sentiment” column dataframe, and then we employed the BTM topic model and saved the extracted latent topic for every review in another column named “Topic”. After that, we used pie visualization to correlate between identified topics and their sentiment analysis for all reviews. As shown in Figure 9, it is explicitly clear how users feel about every topic. Regarding prices, it seems that many users are dissatisfied, and the same applies to delivery and payment, customer service, and support. Meanwhile, in contrast, regarding the application topic, the majority show positive feelings towards it. Consequently, it may be stated that most users believe that the application is good but due to problems related to other topics, this may cause users to refrain from using this application and replace it with alternative applications. To avoid such occurrences, developers and business owners must consider and respond to user requirements.

4.3. Topic Modeling and Clustering

Based on the assumption that reviews within each topic can be grouped into positive and negative clusters depending on the vectors acquired from the document embedding, the second approach applied the k-means clustering algorithm to help classify these reviews into positive and negative groups. The main idea behind using the k-means clustering algorithm is the ability to handle unstructured data that construct between 80% and 90% of the world-generated data without the additional costs that structured data entail [62]. In addition, the k-means clustering algorithm is a valuable tool for obtaining insights from textual data, thanks to its simplicity, effectiveness, and efficiency with regard to computation time. Hence, in the first step in building such a model, the neutral comments obtained from the CAMeL tool were eliminated, and only positive and negative comments were used to avoid confusing the clusters with neutral reviews, based on the assumption that the reviews will be automatically divided into positive and negative categories without manual involvement. After k-means clustering, the sentiment polarity score obtained from the first approach using the CAMeL tool was used and compared with the clustering result via sunburst chart visualization to determine whether the assumption was true.

Before choosing the optimal number of k-clusters, considering the hyperparameter of the k-means algorithm, it is important to mention that Doc2Vec embedding vectors were used as the input into the k-means cluster. Hopefully, the semantic meaning of the sentiment words was captured and then classified by k-means appropriately. In the k-means algorithm, we must first predefine the number of k that could be discovered using the elbow method. Thus, from Figure 10, we can notice an elbow form on the graph, which helps in selecting the k-value at which the elbow is produced. This can be referred to as the elbow point. Beyond this point, raising the ’k’ number does not result in a substantial reduction in the WCSS. As a result, it can obviously be determined that the optimal number of clusters is k = 2 for all extracted topics, which provides a good indication to confirm our the assumption.

Table 7 presents the experiments that applied the k-means cluster fed with Doc2Vec vectors of the reviews, showing the silhouette score computations using the Euclidean distance and cosine similarity metrics, in addition to the reduced cluster results visualized using PCA and scatter plots for every extracted topic in the Mrsool dataset. As a result, the clustering process appears to have been successful using both matrices due to the good results achieved and the way that the data points are well separated from each other in the graph, which was confirmed by the elbow method. We note that the clustering process using the Euclidean distance metric Figure 11a showed better results than the cosine similarity metric Figure 11b. The next step would be to use the results of the sentiment polarity score obtained from the CAMeL tool, which will assist in interpreting the cluster’s content and discovering whether it fulfilled its desired purpose of dividing reviews into positive and negative clusters. By looking at the sunburst chart that is shown in Figure 11, containing three layers of data, namely, the topic number, clusters, and sentiment scores, it can be seen that all clusters involved both positive and negative comments in varying proportions for every topic. This observation could be due to the fact that the clustering method may have depended on the context or syntax of the comments rather than focusing on the positive and negative words due to the varying length of the comments and them containing various POSs.

Optimization

For the sake of improving the clustering outcomes, this research work utilized the POS tagging method offered by the CAMeL tool, which achieved 95.5% accuracy in predicting the correct POS. Furthermore, POS tagging is a widely used technique in NLP that involves classifying words in a text (corpus) according to a certain part of speech, based on the word’s meaning and context. Additionally, it is used to characterize the lexical words in a text by categorizing them as verbs, nouns, adjectives, and adverbs [63]. This method allows models to better comprehend and learn human language by improving their understanding of word structure and meaning. Therefore, the key rationale for employing this step is to restrict the clustering process to words that convey sentiments. Hence, only the nouns and adjectives were employed in this experiment.

After applying POS tagging with the aim of improving the model’s performance, Table 8 demonstrates that the optimization process had an influence on the cosine similarity metric in the first two topics: delivery and payment and customer support and updates. However, the performance of these topics remained steady when utilizing the Euclidean distance metric. A noticeable improvement was found in the topic of prices, in which it appeared that the Euclidean metric increased by 1%, while the cosine similarity increased by 20%, where the first metric continued to outperform the second by 70% to 58%. Furthermore, it appeared that the clustering process for the application topic decreased slightly in both metrics, with the Euclidean distance continuing to outperform cosine similarity by 74% to 46%.

Considering the results summarized in Table 7, it remained for us to explore whether the POS technique led to forcing the clusters to classify reviews into positive and negative groups. From Figure 12, despite the optimized results, it appears that the clustering process was dependent on other factors within the topic, which included both positive and negative sentiments. Hence, we conclude that using the first approach combining topic modeling and sentiment analysis achieved the research objectives, and this is what we observed during the testing phase.

5. Testing

To ensure that the framework is usable by other super applications belonging to the business domain, we tested the BTM model to verify its effectiveness, and the results were satisfactory. In this testing process, we used the Careem Super application (https://www.careem.com/) (accessed on 6 May 2023), which provides food delivery services, rides, laundry, house cleaning, spa services, and more, as a testing dataset. We collected 7218 user reviews from the entire year of 2023. The dataset was preprocessed to provide 2708 user reviews. As a result, the perplexity score reduced from 891.748 in the experimental dataset (training data) to 9.895 in the testing dataset, indicating that the model achieved promising results.

Consequently, based on applying Careem’s reviews in the pretrained model, application received the most reviews (33.5%), followed by prices (29.9%), delivery and payment (18.6%), and the customer support and updates (18%) as shown in Figure 13. Using the topic modeling and sentiment analysis approach, Figure 14 reveals a convergence between the users’ positive and negative opinions on most of the topics. We can observe that the number of positive comments regarding application was slightly greater compared to the number of negative comments, while the other topics were dominated by negative feedback. These results require those responsible for the super application to listen to users’ concerns and respond to their demands urgently.

6. Discussion

Many studies have used clustering techniques to analyze users’ sentiments, whether they are positive or negative [64,65,66,67]. However, to the best of our knowledge, there is no Arabic study addressing this method, which makes this work innovative. Although the clustering results were intriguing, they did not achieve their intended function of classifying reviews as negative or positive. This may be due to the difficulty of the Arabic language, containing many dialects and grammatical rules, which makes the analysis process difficult compared to other languages. Likewise, the effect of using the type of lemmatization or stemming may have an impact, as some studies have proven its effects on the model result [68,69]. Nevertheless, this study succeeded in extracting the most important topics that affect users’ experiences in the Mrsool and Careem super applications, supported by the level of user satisfaction or dissatisfaction for every extracted topic. The results of this study show that such applications face some difficulties and obstacles in satisfying their users. Users believe that these apps’ delivery and payment, customer support and updates, and prices are unsatisfactory. Therefore, we recommend that developers constantly test updates before releasing them to avoid errors. Customer service must respond to customer inquiries and complaints as quickly as possible. Likewise, prices must remain constant and appropriate for the services provided.

7. Conclusions

The growth of the application industry has accelerated due to applications’ availability on mobile devices. This acceleration has led to the emergence of super applications and their entry into the Middle East. There is no doubt that the user experience has a major impact on the growth or suppression of this industry. Therefore, this study collected Arabic users’ reviews of the Mrsool and Careem super applications in Saudi Arabia, preprocessed the data, and extracted the most important topics that affect the user experience using two approaches, which are topic modeling with sentiment analysis and topic modeling with clustering. The obtained topics, namely, delivery and payment, customer service and updates, prices, and the application, were analyzed deeply to acquire users’ sentiments regarding each topic. General analysis, statistics, machine learning algorithms, an evaluation, results, a discussion, and visualizations are presented in this study. The ASA-UXF framework can be generalized to other super applications. We believe that applying this framework can help businesses to quickly explore concerns and problems related to such applications and find suitable solutions, especially with visualizations that can provide a quick overview of the data. Paying close attention to customers’ reviews will improve their performance and quality, allowing them to guarantee user loyalty and attract new customers. Furthermore, the offered dataset will be useful for other researchers conducting studies in this domain. Although this study achieved the desired results, there are still challenges and limitations related to the topic modeling with clustering approach that need to be resolved, namely, the inability of the clusters to differentiate between positive and negative comments. Thus, we intend to improve this approach by standardizing sentence length and examining other clustering algorithms such as DBSCAN that eliminate noise. Our future work will focus on developing an interactive API for the proposed framework.

Author Contributions

Conceptualization, B.A., M.A., M.K. and F.A.; methodology, B.A. and M.K.; software, B.A.; validation, B.A.; formal analysis, B.A., M.A., M.K. and F.A.; investigation, B.A., M.A., M.K. and F.A.; resources, B.A.; data curation, B.A.; writing—original draft preparation, B.A.; writing—review and editing, B.A., M.A., M.K. and F.A.; visualization, B.A.; supervision, M.A. and M.K.; project administration, B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset can be obtained from: https://www.kaggle.com/datasets/bodoorayeedalrayani/careem-and-mrsool-dataset (accessed on 6 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Statista. Number of Smartphone Users Worldwide from 2014 to 2029. 2024. Available online: https://www.statista.com/forecasts/1143723/smartphone-users-in-the-world (accessed on 5 March 2024).
Data.ai. The State of Mobile 2023: How to Navigate This Uncertain Year. 2023. Available online: https://www.data.ai/en/go/state-of-mobile-2023/ (accessed on 6 March 2024).
Statista. App—Saudi Arabia. 2024. Available online: https://www.statista.com/outlook/amo/app/saudi-arabia (accessed on 6 March 2024).
Muhammed, R. 50+ Eye-Opening UX Statistics That Prove UX Matters! 2023. Available online: https://www.wowmakers.com/blog/ux-statistics/ (accessed on 7 March 2024).
Berni, A.; Borgianni, Y. From the definition of user experience to a framework to classify its applications in design. Proc. Des. Soc. 2021, 1, 1627–1636. [Google Scholar] [CrossRef]
Perri, L. What’s New in the 2022 Gartner Hype Cycle for Emerging Technologies. Gartner Insights. 2022. Available online: https://www.gartner.com/en/articles/what-s-new-in-the-2022-gartner-hype-cycle-for-emerging-technologies (accessed on 17 May 2023).
Statista. Super Apps—Statistics & Facts. 2024. Available online: https://www.statista.com/topics/10296/super-apps/ (accessed on 9 March 2024).
Diaz Baquero, A.P. Super Apps: Opportunities and Challenges. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2021. [Google Scholar]
Ota, F.K.C.; Meira, J.A.; Frank, R.; State, R. Towards Privacy Preserving Data Centric Super App. In Proceedings of the 2020 Mediterranean Communication and Computer Networking Conference (MedComNet), Arona, Italy, 17–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–4. [Google Scholar]
Roa, L.; Correa-Bahnsen, A.; Suarez, G.; Cortés-Tejada, F.; Luque, M.A.; Bravo, C. Super-app behavioral patterns in credit risk models: Financial, statistical and regulatory implications. Expert Syst. Appl. 2021, 169, 114486. [Google Scholar] [CrossRef]
Roa, L.; Rodríguez-Rey, A.; Correa-Bahnsen, A.; Arboleda, C.V. Supporting financial inclusion with graph machine learning and super-app alternative data. In Intelligent Systems and Applications, Proceedings of the 2021 Intelligent Systems Conference (IntelliSys), Amsterdam, The Netherlands, 2–3 September 2021; Springer: Cham, Switzerland, 2022; Volume 2, pp. 216–230. [Google Scholar]
Airlangga, M.C.; Sulasikin, A.; Nugraha, Y.; Husna, N.L.R.; Aminanto, M.E.; Kurniawan, F.; Kanggrawan, J.I. Understanding Citizen Feedback of Jakarta Government Super App: Leveraging Deep Learning Models. In Proceedings of the 2023 IEEE International Smart Cities Conference (ISC2), Bucharest, Romania, 24–27 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Nakamura, W.T.; de Oliveira, E.C.; de Oliveira, E.H.; Redmiles, D.; Conte, T. What factors affect the UX in mobile apps? A systematic mapping study on the analysis of app store reviews. J. Syst. Softw. 2022, 193, 111462. [Google Scholar] [CrossRef]
Jain, P.K.; Pamula, R.; Srivastava, G. A systematic literature review on machine learning applications for consumer sentiment analysis using online reviews. Comput. Sci. Rev. 2021, 41, 100413. [Google Scholar] [CrossRef]
Kwon, H.J.; Ban, H.J.; Jun, J.K.; Kim, H.S. Topic modeling and sentiment analysis of online review for airlines. Information 2021, 12, 78. [Google Scholar] [CrossRef]
Rafea, A.; GabAllah, N.A. Topic detection approaches in identifying topics and events from arabic corpora. Procedia Comput. Sci. 2018, 142, 270–277. [Google Scholar] [CrossRef]
Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Ethnologue. What Are the Top 200 Most Spoken Languages? 2023. Available online: https://www.ethnologue.com/insights/ethnologue200/ (accessed on 12 March 2024).
unesco. World Arabic Language Day. 2023. Available online: https://www.unesco.org/en/world-arabic-language-day (accessed on 12 March 2024).
Ramzy, M.; Ibrahim, B. User satisfaction with Arabic COVID-19 apps: Sentiment analysis of users’ reviews using machine learning techniques. Inf. Process. Manag. 2024, 61, 103644. [Google Scholar] [CrossRef]
Badaro, G.; Baly, R.; Hajj, H.; El-Hajj, W.; Shaban, K.B.; Habash, N.; Al-Sallab, A.; Hamdi, A. A survey of opinion mining in Arabic: A comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. Acm Trans. Asian-Low-Resour. Lang. Inf. Process. (TALLIP) 2019, 18, 1–52. [Google Scholar] [CrossRef]
Nassif, A.B.; Elnagar, A.; Shahin, I.; Henno, S. Deep learning for Arabic subjective sentiment analysis: Challenges and research opportunities. Appl. Soft Comput. 2021, 98, 106836. [Google Scholar] [CrossRef]
Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
Pilliang, M.; Akbar, H.; Firmansyah, G. Sentiment analysis for super applications in Indonesia: A case study of Gov2Go App. In Proceedings of the 2022 3rd International Conference on Electrical Engineering and Informatics (ICon EEI), Virtual Conference, 19–20 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 80–85. [Google Scholar]
Aji, S.; Hidayatun, N.; Faqih, H. The sentiment analysis of Fintech users using support vector machine and particle swarm optimization method. In Proceedings of the 2019 7th International Conference on Cyber and IT Service Management (CITSM), Jakarta, Indonesia, 6–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; Volume 7, pp. 1–5. [Google Scholar]
Al-Hagree, S.; Al-Gaphari, G. Arabic Sentiment Analysis Based Machine Learning for Measuring User Satisfaction with Banking Services’ Mobile Applications: Comparative Study. In Proceedings of the 2022 2nd International Conference on Emerging Smart Technologies and Applications (eSmarTA), Ibb, Yemen, 25–26 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
Hadwan, M.; Al-Hagery, M.; Al-Sarem, M.; Saeed, F. Arabic sentiment analysis of users’ opinions of governmental mobile applications. Comput. Mater. Contin. 2022, 72, 4675–4689. [Google Scholar] [CrossRef]
Hadwan, M.; Al-Sarem, M.; Saeed, F.; Al-Hagery, M.A. An improved sentiment classification approach for measuring user satisfaction toward governmental services’ mobile apps using machine learning methods with feature engineering and SMOTE technique. Appl. Sci. 2022, 12, 5547. [Google Scholar] [CrossRef]
Banjabi, D.; Almezeini, N. Customer Satisfaction Toward Commercial E-Services in Saudi Arabia: A Sentiment Analysis. In Proceedings of the 2023 International Symposium on Networks, Computers and Communications (ISNCC), Doha, Qatar, 23–26 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Al-Smadi, F.; Al-Shboul, B.; Al-Darras, D.; Al-Qudah, D. Aspect-Based Sentiment Analysis of Arabic Restaurants Customers’ Reviews Using a Hybrid Approach. In Proceedings of the 14th International Conference on Management of Digital EcoSystems, Venice, Italy, 19–21 October 2022; pp. 123–128. [Google Scholar]
Vayansky, I.; Kumar, S.A. A review of topic modeling methods. Inf. Syst. 2020, 94, 101582. [Google Scholar] [CrossRef]
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
Kang, H.J.; Kim, C.; Kang, K. Analysis of the trends in biochemical research using latent dirichlet allocation (LDA). Processes 2019, 7, 379. [Google Scholar] [CrossRef]
Sarker, M.R.I.; Matin, A. A hybrid collaborative recommendation system based on matrix factorization and deep neural network. In Proceedings of the 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 27–28 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 371–374. [Google Scholar]
Rajendran, D.P.D.; Sundarraj, R.P. Using topic models with browsing history in hybrid collaborative filtering recommender system: Experiments with user ratings. Int. J. Inf. Manag. Data Insights 2021, 1, 100027. [Google Scholar] [CrossRef]
Tushev, M.; Ebrahimi, F.; Mahmoud, A. Domain-specific analysis of mobile app reviews using keyword-assisted topic models. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; pp. 762–773. [Google Scholar]
Abuzayed, A.; Al-Khalifa, H. BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Comput. Sci. 2021, 189, 191–194. [Google Scholar] [CrossRef]
Singh, S.; Chauhan, T.; Wahi, V.; Meel, P. Mining tourists’ opinions on popular Indian tourism hotspots using sentiment analysis and topic modeling. In Proceedings of the 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1306–1313. [Google Scholar]
Sutherland, I.; Sim, Y.; Lee, S.K.; Byun, J.; Kiatkawsin, K. Topic modeling of online accommodation reviews via latent dirichlet allocation. Sustainability 2020, 12, 1821. [Google Scholar] [CrossRef]
Hu, N.; Zhang, T.; Gao, B.; Bose, I. What do hotel customers complain about? Text analysis using structural topic model. Tour. Manag. 2019, 72, 417–426. [Google Scholar] [CrossRef]
Kiatkawsin, K.; Sutherland, I.; Kim, J.Y. A comparative automated text analysis of airbnb reviews in Hong Kong and Singapore using latent dirichlet allocation. Sustainability 2020, 12, 6673. [Google Scholar] [CrossRef]
Ali, T.; Omar, B.; Soulaimane, K. Analyzing tourism reviews using an LDA topic-based sentiment analysis approach. MethodsX 2022, 9, 101894. [Google Scholar] [CrossRef] [PubMed]
Lobo, E.H.; Abdelrazek, M.; Frølich, A.; Rasmussen, L.J.; Islam, S.M.S.; Kensing, F.; Grundy, J. Detecting user experience issues from mHealth apps that support stroke caregiver needs: An analysis of user reviews. Front. Public Health 2023, 11, 1027667. [Google Scholar]
Marzijarani, S.B.; Sajedi, H. Opinion mining with reviews summarization based on clustering. Int. J. Inf. Technol. 2020, 12, 1299–1310. [Google Scholar] [CrossRef]
Booth, F.; Potts, C.; Bond, R.; Mulvenna, M.D.; Ennis, E.; Mctear, M.F. Review mining to discover user experience issues in mental health and wellbeing chatbots. In Proceedings of the 33rd European Conference on Cognitive Ergonomics, New York, NY, USA, 4 October 2022; pp. 1–5. [Google Scholar]
Permana, M.E.; Ramadhan, H.; Budi, I.; Santoso, A.B.; Putra, P.K. Sentiment analysis and topic detection of mobile banking application review. In Proceedings of the 2020 Fifth International Conference on Informatics and Computing (ICIC), Gorontalo, Indonesia, 3–4 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Moreno, A.; Iglesias, C.A. Understanding Customers’ Transport Services with Topic Clustering and Sentiment Analysis. Appl. Sci. 2021, 11, 10169. [Google Scholar] [CrossRef]
Obeid, O.; Zalmout, N.; Khalifa, S.; Taji, D.; Oudah, M.; Alhafni, B.; Inoue, G.; Eryani, F.; Erdmann, A.; Habash, N. CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 7022–7032. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 1188–1196. [Google Scholar]
Yan, X.; Guo, J.; Lan, Y.; Cheng, X. A biterm topic model for short texts. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 1445–1456. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 21 June–18 July 1967; Volume 1, pp. 281–297. [Google Scholar]
Mimno, D.; Wallach, H.; Talley, E.; Leenders, M.; McCallum, A. Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011; pp. 262–272. [Google Scholar]
Heinrich, G. Parameter Estimation for Text Analysis; Technical Report; Citeseer: Princeton, NJ, USA, 2005. [Google Scholar]
Shahapure, K.R.; Nicholas, C. Cluster quality analysis using silhouette score. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 747–748. [Google Scholar]
Ghazal, T.M. Performances of k-means clustering algorithm with different distance metrics. Intell. Autom. Soft Comput. 2021, 30, 735–742. [Google Scholar] [CrossRef]
Mandal, A.; Chaki, R.; Saha, S.; Ghosh, K.; Pal, A.; Ghosh, S. Measuring similarity among legal court case documents. In Proceedings of the 10th Annual ACM India Compute Conference, New York, NY, USA, 16–18 November 2017; pp. 1–9. [Google Scholar]
Hanifi, M.; Chibane, H.; Houssin, R.; Cavallucci, D. Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific Papers. Eng. Appl. Artif. Intell. 2022, 109, 104661. [Google Scholar]
Alibrahim, H.; Ludwig, S.A. Hyperparameter optimization: Comparing genetic algorithm against grid search and bayesian optimization. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Krakow, Poland, 28 June–1 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1551–1559. [Google Scholar]
Marinov, D.; Karapetyan, D. Hyperparameter optimisation with early termination of poor performers. In Proceedings of the 2019 11th Computer Science and Electronic Engineering (CEEC), Colchester, UK, 18–20 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 160–163. [Google Scholar]
Yuan, C.; Yang, H. Research on K-value selection method of K-means clustering algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef]
Salloum, S.A.; AlHamad, A.Q.; Al-Emran, M.; Shaalan, K. A survey of Arabic text mining. In IIntelligent Natural Language Processing: Trends and Applications; Springer: Cham, Switzerland, 2018; pp. 417–431. [Google Scholar]
Dialani, P. The Future of Data Revolution will be Unstructured Data. Analytics Insight. Saatavilla. 2020. Available online: https://www.analyticsinsight.net/insights/the-future-of-data-revolution-will-be-unstructured-data (accessed on 28 February 2023).
Chiche, A.; Yitagesu, B. Part of speech tagging: A systematic review of deep learning and machine learning approaches. J. Big Data 2022, 9, 10. [Google Scholar]
Tseng, S.C.; Lu, Y.C.; Chakraborty, G.; Chen, L.S. Comparison of sentiment analysis of review comments by unsupervised clustering of features using LSA and LDA. In Proceedings of the 2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST), Morioka, Japan, 23–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Kumar, A.V.; Meera, K. Sentiment Analysis Using K Means Clustering on Microblogging Data Focused on Only the Important Sentiments. In Proceedings of the 2022 10th International Conference on Emerging Trends in Engineering and Technology-Signal and Information Processing (ICETET-SIP-22), Nagpur, India, 29–30 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar]
Heriswan, D.D.; Sari, Y.A.; Furqon, M. Clustering Public Opinions Related to Quarantine during COVID-19 on Twitter Using K-DENCLUE Algorithms. In Proceedings of the 6th International Conference on Sustainable Information Engineering and Technology, Malang, Indonesia, 13–14 September 2021; pp. 252–257. [Google Scholar]
Jacob, S.S.; Vijayakumar, R. Sentimental analysis over twitter data using clustering based machine learning algorithm. J. Ambient. Intell. Humaniz. Comput. 2021; 1–12. [Google Scholar] [CrossRef]
Alhawarat, M.O.; Abdeljaber, H.; Hilal, A. Effect of stemming on text similarity for Arabic language at sentence level. PeerJ Comput. Sci. 2021, 7, e530. [Google Scholar]
Atwan, J.; Wedyan, M.; Bsoul, Q.; Hammadeen, A.; Alturki, R. The use of stemming in the Arabic text and its impact on the accuracy of classification. Sci. Program. 2021, 2021, 1367210. [Google Scholar] [CrossRef]

Figure 1. The Arabic Super App User Experience Framework (ASA-UXF).

Figure 2. Sentiment distribution of the sample dataset.

Figure 3. Sentiment distribution of the Mrsool dataset.

Figure 4. The most used words in the Mrsool dataset: (a) word frequencies and (b) word cloud.

Figure 5. Monthly number of reviews combined with users’ sentiments for Mrsool.

Figure 6. The most used words for (a) positive, (b) negative, and (c) neutral sentiment in Mrsool.

Figure 7. The extracted (a) topic 0 (b) topic 1 (c) topic 2 and (d) topic 3 using BTM.

Figure 8. Topics distribution among Mrsool dataset.

Figure 9. Topic-based sentiment analysis of Mrsool dataset.

Figure 10. The elbow method of (a) topic 0 (b) topic 1 (c) topic 2 and (d) topic 3.

Figure 11. Clustering topics using (a) Euclidean distance and (b) cosine similarity.

Figure 12. Clustering topics using (a) POS with Euclidean distance and (b) POS with cosine similarity.

Figure 13. Topics’ distribution in the Careem dataset.

Figure 14. Topic-based sentiment analysis of the Careem dataset.

Table 1. Summary of machine learning approaches to analyze user experience.

Ref	Year	Methodology	Environment	Super App	Country	Language	Results	Limitations
[24]	2022	Sentiment analysis	Gov2Go app	Yes	Indonesia	English	97% for LSTM, 94.22% for NB	Data only collected from the Google Play Store
[25]	2019	Sentiment analysis	Ovo app	Yes	Indonesia	Indonesian	82.89% for SVM	Small dataset
[26]	2022	Sentiment analysis	Eight banking services’ applications	No	Yemen	Arabic	89.65% for NB	Data only collected from the Google Play Store, using an unbalanced dataset
[27]	2022	Sentiment analysis	Tawakkalna, Tetaman, Tabaud, Sehhaty, Mawid, and Sehhah apps	Yes	Saudi Arabia	Arabic	78.46%	Using unbalanced dataset
[28]	2022	Sentiment analysis	Tawakkalna, Tetaman, Tabaud, Sehhaty, Mawid, and Sehhah apps	Yes	Saudi Arabia	Arabic converted to English	94.38% for SVM	Using Google Translate to translate Arabic reviews
[29]	2023	Sentiment analysis	Saudi Ministry of Commerce’s X account and Commercial report app	No	Saudi Arabia	Arabic	94% for SVM	Unbalanced dataset
[30]	2022	Aspect-based sentiment analysis	Jeeran restaurant’s website	No	Jordan	Arabic	84.47% for SVM	Unbalanced dataset
[38]	2021	Lexicon-based sentiment analysis and topic modeling	TripAdvisor website	No	India	English	Different sentimental scores were obtained based on star rating and location; different numbers of topics were calculated for ten Indian locations	Uses the AFINN lexicon based on simple classifications for text and which has insufficient emotional words, making it harder to detect negative reviews
[39]	2020	Topic modeling	Online Travel Agencies	No	South Korea	English	14 topics for LDA	Includes specific accommodation types in certain locations; the cultural differences in the country’s accommodation types prevent the generalization of the research results to other countries
[40]	2019	Topic modeling	TripAdvisor website	No	New York	English	10 topics for STM	Using an old dataset for one city
[41]	2020	Topic modeling	Airbnb website	No	Singapore and Hong Kong	English	5 topics for Singapore, 12 topics for Hong Kong using LDA	Using topic modeling alone cannot capture users’ sentiments
[42]	2022	Lexicon-based sentiment analysis and topic modeling	TripAdvisor website	No	Morocco	English	77.3%, 4 topics for LDA	Using 12-year-old data
[44]	2020	Clustering	TripAdvisor website	No	Seattle and Manhattan	English	5 clusters for GMM	Using clustering alone cannot capture users’ sentiments
[46]	2020	Topic modeling, sentiment analysis, and clustering	Mobile banking app	No	Indonesia	Indonesian	5 topics for LDA, 86.76% for NB, 20 clusters for k-means	The visualization showed overlapping between extracted topics, indicating the extraction of similar topics
[47]	2021	Topic modeling and clustering	Uber customer service’s X account	No	Not mentioned	English	7 topics for LDA, 7 clusters for k-means	This study does not provide a geographical analysis, as users’ opinion differ from one region to another

Table 2. Example reviews from the collected dataset with misleading scores and the CAMeL tool.

Translated Review	Rating	Sentiment
We thank you all for the excellent, fast, and clean service	3	positive
Thanks to him	1	positive
My order was for 99 riyals, and the delivery offer was for 1 riyal. After the order was completed, my account became 115. I did not like manipulation and the last time I ordered from Mrsool	5	negative
The delivery price is very high	5	negative
The worst app, their prices are high and the food is delivered cold	5	negative

Table 3. Comparison between a set of stemming and lemmatization types.

Reviews	Reviews Translated into English	Tokenize Reviews into Words	Arabic Light Stemmer	Snowball Stemmer	ISRI Stemmer	Farasa Lemmatization	Qalsadi Lemmatizer
	I liked the delivery, it was very fast
	The food always arrives cold
	Drivers are slow in delivering orders

Table 4. Sample of the reviews before and after the preprocessing steps.

Original Reviews	Reviews Translated into English	Preprocessed Reviews
	The new update is bad. Please reconsider and return the application as it was before
	As they like changed the order submission system and make it automatic
	What is the new update, everything is pending

Table 5. Finding the optimal topic number using the grid search method.

Number of Topics	Alpha	Beta	Iteration	Coherence Score
3	0.1	0.01	1000	−104.60
4	0.3	0.007	500	−103.24
5	10	0.007	500	−105.68
6	1	0.01	1000	−105.93
7	0.3	0.01	1000	−109.06

Table 6. The four extracted topics from the Mrsool dataset using BTM topic modeling.

Topic Number	Topic Name	Topic Description
0	Delivery and Payment	Users expressed delivery and payment issues.
1	Customer support and updates	Users expressed problems with updating and difficulties that they face in contacting technical and customer support.
2	Prices	Users expressed their views on delivery prices concerning location.
3	Application	Users expressed their opinions in general regarding the Mrsool app.

Table 7. Experimental result of the Mrsool dataset using k-means within Doc2Vec.

Topic Number	Topic Name	Evaluation Metric	Silhouette Score of Reviews Using Doc2Vec and K-Means
0	Delivery and Payment	Euclidean distance	0.68
0	Delivery and Payment	Cosine similarity	0.63
1	Customer support and updates	Euclidean distance	0.72
1	Customer support and updates	Cosine similarity	0.67
2	Prices	Euclidean distance	0.69
2	Prices	Cosine similarity	0.38
3	Application	Euclidean distance	0.76
3	Application	Cosine similarity	0.63

Table 8. Experimental result of the tagged dataset using k-means within Doc2Vec.

Topic Number	Topic Name	Evaluation Metric	The Result before the Optimization	The Result after the Optimization
0	Delivery and Payment	Euclidean distance	0.68	0.68
0	Delivery and Payment	Cosine similarity	0.63	0.73
1	Customer support and updates	Euclidean distance	0.72	0.72
1	Customer support and updates	Cosine similarity	0.67	0.74
2	Prices	Euclidean distance	0.69	0.70
2	Prices	Cosine similarity	0.38	0.58
3	Application	Euclidean distance	0.76	0.74
3	Application	Cosine similarity	0.63	0.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alrayani, B.; Kalkatawi, M.; Abulkhair, M.; Abukhodair, F. From Customer’s Voice to Decision-Maker Insights: Textual Analysis Framework for Arabic Reviews of Saudi Arabia’s Super App. Appl. Sci. 2024, 14, 6952. https://doi.org/10.3390/app14166952

AMA Style

Alrayani B, Kalkatawi M, Abulkhair M, Abukhodair F. From Customer’s Voice to Decision-Maker Insights: Textual Analysis Framework for Arabic Reviews of Saudi Arabia’s Super App. Applied Sciences. 2024; 14(16):6952. https://doi.org/10.3390/app14166952

Chicago/Turabian Style

Alrayani, Bodoor, Manal Kalkatawi, Maysoon Abulkhair, and Felwa Abukhodair. 2024. "From Customer’s Voice to Decision-Maker Insights: Textual Analysis Framework for Arabic Reviews of Saudi Arabia’s Super App" Applied Sciences 14, no. 16: 6952. https://doi.org/10.3390/app14166952

APA Style

Alrayani, B., Kalkatawi, M., Abulkhair, M., & Abukhodair, F. (2024). From Customer’s Voice to Decision-Maker Insights: Textual Analysis Framework for Arabic Reviews of Saudi Arabia’s Super App. Applied Sciences, 14(16), 6952. https://doi.org/10.3390/app14166952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu