1. Introduction
Volatile organic compounds (VOCs) are chemical pollutants that are frequently encountered in environments we breathe [
1]. VOCs can be classified according to their volatility. The World Health Organization (WHO) classifies indoor pollutants into three categories: Very VOCs (VVOCs), VOCs, and Semi-VOCs (SVOCs). VVOCs are characterized by their extremely high volatility; this allows them to be predominantly found in the air in gaseous form rather than bound to materials or surfaces. Due to their high volatility, it is difficult to accurately measure VVOCs. On the other hand, SVOCs are less volatile and are more likely to be found in materials or surfaces rather than in the air. In indoor environments, the concentration of VVOCs is typically higher than SVOCs due to their high volatility and their tendency to be found in the air as a gas. Although it is difficult to accurately measure VVOCs, they play a significant role in indoor air quality. Personal exposure to VOCs, which are significant health concerns, has been evaluated by the U.S. Environmental Protection Agency (EPA) in various cities in the U.S. through research. In general, the EPA’s research underscores the importance of understanding different VOC categories according to volatility levels. Therefore, attempts to model the concentrations and ratios of these compounds are common in statistical science, and it is important to model them optimally. Aside from this, fraction evaluation is essential in a wide range of disciplines, from engineering to healthcare. Traditionally, distributions supported on the unit interval are crucial tools for modeling the behavior of such stochastic variables. Modeling and predicting such variables are possible, but model selection beyond traditional models is a must for establishing a base assumption as the measurements are limited to support (0, 1).
In this regard, researchers have proposed several statistical models for modeling the phenomena on the support (0, 1). The foundational work in this field includes the beta distribution, introduced by Bayes [
2], which models these types of data. Subsequently, a variety of unit interval distributions have been explored, such as the unit distribution [
3], the unit Johnson distribution [
4], the four-parameter unit interval distribution [
5], the double bounded Kumaraswamy distribution [
6], the Topp-Leone distribution [
7] by Topp and Leone, and the unit gamma distribution by Consul and Jain [
8]. The alpha-unit model [
9], which is based on the standard normal distribution, is another contemporary unitary distribution.
It is crucial to note that the constructions of the above-mentioned models employ several transformation methodologies. These are the cumulative distribution function and quantile transformation methodology, reciprocal transformation, exponentiation, conditional distribution methodology, the T-X family approach, and the epsilon function studied by Dombi et al. [
10]. Upon using the above-mentioned strategies, researchers developed a limited number of unit interval distributions. To account for the vast range of data found in nature and measured within the unit interval, additional models are required. In recent years, researchers have made some advancements in introducing distributions that can statistically model phenomena that take values on a bounded support, thereby enriching the statistical literature with these new models. Some of these distributions include the quantile distribution by Smithson and Shou [
11]; Altun and Hamedani [
12]’s work unit log-X gamma distribution; Nakamura et al. [
13]’s contribution regarding unit interval distribution; the unit inverse Guassian distribution by Ghitany et al. [
14]; Mazucheli et al. [
15,
16,
17]’s work on unit-Gompertz, unit-Lindley, and unit-Weibull distributions; and Gündüz et al. [
18]’s work on the unit Johnson distribution. Additionally, Altun [
19] studied the log-weighted exponential distribution, Biswas and Chakraborty [
20] studied a new method for the construction of unit interval distribution, Afify et al. [
21] worked on a unit interval distribution, Korkmaz and Korkmaz [
22] studied unit log-log distribution, Fayomi et al. [
23] studied the unit–power Burr X distribution, Krishna et al. [
24] studied the unit Teissier Distribution using a conditional distribution approach, Biswas and Chakraborty [
20] derived a number of unit distributions, and Bakouch et al. [
25] derived the unit exponential distribution. These recent models have become important alternatives to the Beta distribution, a popular tool for modeling observations measured in the unit interval.
In this study, we used the Maxwell-Boltzmann distribution (MBD) for deriving a unit Maxwell-Boltzmann distribution (UMBD) by adopting the exponential transformation methodology. Hence, we used the UMBD to model and analyze the concentrations of pollutant data, which are the primary focus of this study.
The rest of this paper is organized as follows:
Section 2 provides an overview of the Maxwell-Boltzmann distribution.
Section 3 derives the unit Maxwell-Boltzmann distribution.
Section 3 also investigates several characteristics of the unit Maxwell-Boltzmann distribution, such as survival, hazard rate function (hrf), reserved hazard rate function (rhrf), moment-generating functions, and its characterization.
Section 4 investigates various estimators for estimating the unit Maxwell-Boltzmann distribution’s parameter. A Monte-Carlo simulation run is described in
Section 5 to evaluate the efficiency of the estimators provided in this paper. In
Section 6, several examples for data modeling using the UMBD are provided. These examples are presented to enable the evaluation of the UMBD’s performance in modeling real-world data and its effectiveness compared to alternatives. The work’s conclusin is provided in
Section 7.
3. Unit Maxwell-Boltzmann Distribution and Its Principal Characteristics
The purpose of this section is to derive the pdf and cdf of the UMBD. As part of this section, we also examine statistically significant features of the UMBD, such as the hrf, rhrf, the moments, the moment generating function, the characteristic function, the skewness, and the kurtosis coefficients, as well as the mean, variance, and median.
Definition 2. Let be a random variable from MBD with parameter . Then, the pdf of the random variable isand the corresponding cdf iswhere is a positive real-valued parameter. The pdf and cdf of the UMBD are plotted in Figure 2 to provide information on the formal behavior of the model for a range of θ parameter values. The UMBD is a useful model for handling data with left- or right-skewed distributions that are defined on support (0, 1), as shown in Figure 2. Next, we examine the UMBD’s characteristics, including its survival function, hrf, rhrf, mode, moments, variance, kurtosis and skewness coefficients, and characteristic and moment-generating functions.
Assume
Y is a UMBD-distributed random variable with parameter
The UMBD’s survival function
,
, when pdf (4) and cdf (5) are taken into account, is:
By the definition of the hrf, the hrf of UMBD is
and reserved hazard rate of
is
The hrf and rhrf of the UMBD are plotted in
Figure 3 for various values of parameter
to exemplify of their behavior:
Figure 3 shows that the UMBD’s hrf is compatible with bathtub (
) or increasing forms (
), while its reverse hazard function decreases for all values of
.
Proposition 1. If the random variable follows the UMBD with parameter , then H(y) satisfies Proof. Given that
is a continuous random variable and considering a general definition of the hrf and Equation (7), one can promptly write
Hence, the proposition statement follows. □
Proposition 2. Suppose a random variable follows the UMBD with parameter , and is its hrf given by Equation (7). Then, the following equation holds. Proof. Necessity: Given that
follows the UMBD
with the probability density function
defined by Equation (4), one can express the logarithm of this pdf as follows:
If the above equality is differentiated concerning
, the following is obtained:
Using this equation and considering Proposition 1, we can write
After some simplification, this equation is reduced to Equation (11).
Sufficiency: Given that Equation (11) holds, upon integration, we can express it as
If Equation (16) is integrated from 0 to
, it is obtained that
which after simplification yields
whereby from the conditions
and
. Thus, the proof is finished and the function
is verified as the cdf from the UMBD
. □
Lemma 3. Suppose is an integer. The rth raw moment, , of the UMBD is Proof. Considering pdf (4) and following the definition of the
rth moment, we have
Using Equation (20), the first two raw moments of the UMBD can be expressed as
.
Thus, the variance of the UMBD is
Furthermore, the moment generating function and the characteristic function of the UMBD are as follows:
and
respectively. □
Proposition 4. The mode of UMBD is .
Proof. By considering the pdf of the UMBD, we have
Considering
for
, we can write from (26) as
Hence, from the solution of the Equation (27) with respect to , the mode of the UMBD is obtained as . □
4. Inference
The goal of this section of the study is to investigate the solution to the problem of estimating the unknown parameter of the UMBD. Here, maximum likelihood, least-squares, weighted-least-squares, maximum spacing, and moments estimation techniques are employed to accomplish this aim.
Let
be a sample of an independently and identically distributed UMBD with a one-dimensional parameter
, and let
represent a realization of it. The likelihood function
is
and the logarithmic likelihood function is
The maximum likelihood estimator (MLE) for the
can be obtained by the derivation of Equation (29) concerning
and setting the resulting derivative as equal to zero. The Equation (29)’s derivative with respect to
is
Hence, from the solution of the Equation (30), we have the MLE of the parameter
as
The least-squares estimator (LSE) of the
can be obtained by minimizing the utility function as
with respect to parameter
, where
and
are an order with the measurements
, and
. For further information about least squares estimation methodology, see Günay & Yilmaz [
27]. Here, the utility function
includes nonlinear functions and it has to be solved using a numerical method. We use the “fmincon” routine of the Octave [
28] for minimizing the
function.
Similarly to LSE, the weighted-LSE (WLSE) for
is obtained by minimizing the utility function
To construct the maximum product space estimator (MPSE) of the parameter
, we have used the [
27] steps for this aim and have the utility function
Subsequently, the objective function
regarding parameter
is numerically maximized to yield the MPSE of
. In the function
,
, and
. For maximizing the function
, one can use the “fmincon” function of the Octave [
28].
Based on
observations and the first moment of the UMBD given by Equation (20), an estimator based on the method of moments for
denoted as
can be formally derived by solving the equation
with respect to
. Thus, the method of moments estimator (MOME) of the parameter
is obtained analytically as
5. Simulation Study
In this section, performance of the estimators MLE, LSE, WLSE, MPSE, and MOME are numerically assessed. Criteria MSE and bias are used to compare the above-mentioned estimators, and they are denoted as
and
respectively, where
implies the repetition number of the simulation.
Table 1,
Table 2,
Table 3 and
Table 4 exhibit the results of the simulations based on 1000 replicates performed on the different sample sizes of
n = 50, 100, 500, and 1000.
Based on the results provided in
Table 1,
Table 2,
Table 3 and
Table 4, it can be concluded that all estimates perform well in estimating parameter
with sufficiently small bias and MSE values. Additionally, we ran another simulation to examine how the estimators perform asymptotically. We set the value of parameter
as 1 without loss of generality. The bias and MSE values of each estimate are calculated for different sample sizes
by 1000 simulations. The simulated results are displayed in
Figure 4. One can see from
Figure 4 that the bias and MSE values of the estimates decrease when the sample size
increases. Therefore, it can be asserted that all estimates exhibit asymptotic consistency and are unbiased.
6. Applications
Air pollution (Ap) is a multifaceted environmental problem that comes from various sources. The main sources of Ap include factories, refineries, vehicle emissions, energy production plants (especially those that rely on coal or oil), and other activities that release pollutants into the atmosphere [
29]. Many of these pollutants are also sources of greenhouse gas emissions.
The health effects of Ap are significant; long-term exposure is linked to chronic conditions such as asthma, cardiovascular diseases, and premature death [
30]. Major pollutants that cause such health problems include VOCs. In addition to affecting human health, Ap also affects agricultural productivity by disrupting plant biochemical reactions and causing soil degradation through acid rain [
31].
In general, Ap can be considered as a global problem. Comprehensive strategies developed to overcome this problem and reduce sources of Ap contribute to the health of the environment and quality of life in the long term by offering a win-win strategy for reducing its effects on health and the environment and increasing general well-being and sustainability.
We have studied four environmental datasets for modeling harmful air pollutant contents like carbon monoxide (CO), sulfate particles and benzo(a)pyrene that are monitored on a continuous basis. Concentrations of these pollutants are reported once every hour, 24 h a day, and 365 days a year.
6.1. Competing Models
In this section, we compare the fits of the UMBD with the unit Topp Leone (UTL), unit log-Lindley (ULL), unit log-weighted exponential (ULWE), and unit Kumerswamy (UKw) distributions. All of these distributions are used for modeling bounded data. To reveal the potential of the UMBD model, these models are compared through analyses conducted on four environmental data sets. The pdfs of competing models are expressed in
Table 5.
To determine which distribution or distributions can model the relevant dataset, Anderson-Darling (), Cramér-von Mises (), and Kolmogorov-Smirnov (KS) tests were used. Additionally, to verify the distribution that optimally models the data among the probable models, information criteria like Akaike Information Criterion (AIC), Corrected Akaike Information Criterion (AICc), Bayesian Information Criterion (BIC), and Hannan-Quinn Information Criterion (HQIC) were examined.
6.2. Datasets
6.2.1. Dataset I
Carbon monoxide (CO) is a colorless, odorless gas that can have toxic effects on the human body. CO can originate from various sources, both natural and anthropogenic. Common sources of CO include fires, vehicle exhaust, gasoline-powered engines, fossil fuel heating systems, etc. The impact of high-level CO poses serious risks to human health. It can exacerbate symptoms of heart disease, leading to issues like chest pain. Additionally, high-level CO may cause vision problems and reduce physical and mental capabilities in otherwise healthy individuals. With reference to this, the first dataset measured the concentration of air pollutant CO in Alberta, Canada from the Edmonton Central (downtown) Monitoring Unit (EDMU) station during 1995. Measurements are listed in Myrick [
34] for the period 1976–1995 as 0.19, 0.20, 0.20, 0.27, 0.30, 0.37, 0.30, 0.25, 0.23, 0.23, 0.26, 0.23, 0.19, 0.21, 0.20, 0.22, 0.21, 0.25, 0.25, and 0.19.
Table 6 shows that the observed data behave as positively skewed and leptokurtic in nature. In this regard, we studied the goodness-of-fit statistics (GoF) and found that the proposed UMBD is the only choice for the analysis of environmental air pollutant CO contents. The model can be visualized in
Table 7.
Moreover, the information criterion indicates that the UMBD outperforms the other models with the least loss of information, as reported in
Table 8.
6.2.2. Dataset II
The second dataset measures the benzo(a)pyrene (BaP) concentration in air. Unfortunately, this chemical compound is also predominantly manmade. BaP is a polycyclic aromatic hydrocarbon (PAH) with a high molecular weight [
35]. Natural occurrences of BaP include volcanic eruptions and forest fires. Other sources of BaP are incomplete combustions of organic materials, such as in-vehicle emissions [
36]. Human metabolism of BaP has pivotal carcinogenic effects [
37]. Surface water, tap water, precipitation, groundwater, wastewater, and sewage sludge are all sources of BaP. The second dataset measured the air quality monitoring of the annual average concentration of the pollutant BaP (ng/m
3). Data were reported from the Edmonton Central (downtown) Monitoring Unit (EDMU) location in Alberta, Canada, in 1995 [
34]. Measurements are reported as 0.22, 0.20, 0.25, 0.15, 0.38, 0.18, 0.52, 0.27, 0.27, 0.27, 0.13, 0.15, 0.24, 0.37, and 0.20.
From
Table 9, it is evident that Dataset II is positively skewed and leptokurtic. In this regard, the GoF statistics as portrayed in
Table 10 indicate that the UMBD is a good choice for such an environmental air pollution phenomenon.
Such a claim is further consolidated by observing
Table 11, which suggests that the UMBD is the model with the least loss of information. Also, from
Figure 5, it is evident that the UMBD yields a good fit with the least loss of information criteria.
6.2.3. Dataset III
Sulfate particles are particles that contain sulfur. These are found in air particles that are smaller than one micron. They can be released from natural and manmade sources, such as industrial processes, coal burning, cement production, vehicle emissions, and sea salt [
38,
39,
40]. Exposure to this pollutant has been associated with numerous health problems, including reduced lung function, more frequent respiratory symptoms and illnesses (like childhood bronchitis and cough), and even premature death [
41]. Therefore, understanding the concentration of sulfate particles indoors or outdoors is vital for assessing their impact on air quality. In this regard, the third dataset measures the concentration of sulphate in Calgary from 31 different periods during 1995. Measurements are taken from [
34] and are listed as 0.048, 0.013, 0.040, 0.082, 0.073, 0.732, 0.302, 0.728, 0.305, 0.322, 0.045, 0.261, 0.192, 0.357, 0.022, 0.143, 0.208, 0.104, 0.330, 0.453, 0.135, 0.114, 0.049, 0.011, 0.008, 0.037, 0.034, 0.015, 0.028, 0.069, and 0.029.
The analysis of the third dataset indicates that the data portray skewed and leptokurtic natures, which is displayed in
Table 12. Moreover,
Table 13 illustrates that the proposed model also acts as a good alternate for the competing models. Furthermore, such competence seems to emerge as strong candidate when the information criterion yields a least value, which is portrayed in
Table 14.
6.2.4. Dataset IV
The fourth dataset measured the concentration of pollutant CO in Alberta, Canada from the Calgary northwest (residential) monitoring unit (CRMU) station during 1995. Measurements are listed in [
34] for the period 1976-95 as 0.16, 0.19, 0.24, 0.25, 0.30, 0.41, 0.40, 0.33, 0.23, 0.27, 0.30, 0.32, 0.26, 0.25, 0.22, 0.22, 0.18, 0.18, 0.20, and 0.23.
From
Table 15, it is evident that dataset IV depicts positive skewness and leptokurtic behavior. However, GoF statistics as portrayed in
Table 16 are least when compared with the competing models. As stated by
Table 17, the information criteria of the proposed model are also yield minimum values, thus the proposed model acts as the least loss of information with single parameter.
Moreover, from
Figure 6, it is evident that the proposed model yields good fit with the least loss of information criterion.
7. Conclusions
In this study, we have introduced a flexible single-parameter unit distribution called UMBD for modeling datasets bounded to the interval (0, 1). We have investigated the moments of the distribution and related distribution measurements, such as variance, skewness, and kurtosis. We have obtained the survival function, hrf, and rhrf of the UMBD and illustrated their behaviors through graphs. Moreover, we have obtained the moment generating and mode of the UMBD in this paper. We have investigated the inference problem for the parameter of the UMBD from the perspectives of maximum likelihood, least squares, weighted least squares, method of moments, and maximum product space methodologies. We have also performed various simulation studies to determine the estimation performances and empirical behaviors of the obtained estimators. Additionally, we have presented analyses performed on four practical datasets to demonstrate data modeling with the UMBD. We believe that the UMBD will be beneficial to data modelers and researchers from different fields and that the work in the paper will inspire the derivation of other unit distributions.
The results we obtained indicate that the UMBD would not be appropriate for modeling lifetime data with decreasing-increasing-decreasing (modified bathtub) and unimodal hazard rates shapes. As a result, these shortcomings can be fixed in a subsequent study. Additional estimation techniques, such as those based on the Bayesian perspective, may also be put forth. Furthermore, we also suggest adding a shape parameter to this model and looking into the existence of its information matrix.