Abstract
This paper tackles the problem of Opportunistic Spectrum Access (OSA) in the Cognitive Radio (CR). The main challenge of a Secondary User (SU) in OSA is to learn the availability of existing channels in order to select and access the one with the highest vacancy probability. To reach this goal, we propose a novel Multi-Armed Bandit (MAB) algorithm called \(\epsilon\)-UCB in order to enhance the spectrum learning of a SU and decrease the regret, i.e. the loss of reward by the selection of worst channels. We corroborate with simulations that the regret of the proposed algorithm has a logarithmic behavior. The last statement means that within a finite number of time slots, the SU can estimate the vacancy probability of targeted channels in order to select the best one for transmitting. Hereinafter, we extend \(\epsilon\)-UCB to consider multiple priority users, where a SU can selfishly estimate and access the channels according to his prior rank. The simulation results show the superiority of the proposed algorithms for a single or multi-user cases compared to the existing MAB algorithms.
Similar content being viewed by others
Availability of Data and Materials
The authors declare that all the data and materials in this manuscript are available from the authors.
Notes
A SU in the context of OSA can be considered as an agent in the classic MAB problem, and the frequency channels become equivalent to different arms.
The variable \(S_i(t)\) may also represent the reward of the ith channel at slot t.
According to [24], Chernoff–Hoeffding theorem is defined as follows: Let \(X_1,\ldots ,X_n\) be random variables in \(\{0, 1\}\), and \(E[X_t]= \mu\), and let \(S_n = \sum _{i=1}^{n} X_i\). Then \(\forall\) \(a\ge 0\), \(p\{S_n \ge n \mu +a\} \le \exp ^{\frac{-2 a^2}{n}}\) and \(p\{S_n \le n \mu -a\} \le \exp ^{\frac{-2 a^2}{n}}\).
According to [24], Chernoff–Hoeffding theorem is defined as follows: Let \(X_1, \ldots ,X_n\) be random variables in [0,1], and \(E[X_t]= \mu\), and let \(S_n = \sum _{i=1}^{n} X_i\). Then \(\forall\) \(a\ge 0\), we have \(P\{S_n \ge n \mu +a\} \le \exp ^{\frac{-2 a^2}{n}}\) and \(P\{S_n \le n \mu -a\} \le \exp ^{\frac{-2 a^2}{n}}\).
References
Marcus, M., Burtle, C. J., Franca, B., Lahjouji, A., & McNeil, N. (2002). Federal Communications Commission (FCC): Spectrum Policy Task Force. ET Docket, 02-135.
Watkins, C. (1989). Learning from delayed rewards. Cambridge: University of Cambridge.
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–294.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Burtini, G., Loeppky, J., & Lawrence, R. (2015). A survey of online experiment design with the stochastic multi-armed bandit, arXiv preprint arXiv:1510.00757.
Kaufmann, E., Cappé, O., & Garivier, A. (2012). On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, La Palma, Canary Islands.
Maillard, O., Munos, R., & Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback–Leibler divergences. In Annual conference on learning theory, Budapest, Hungary.
Scott, S. (2010). A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26, 639–658.
Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, Granada, Spain.
Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, Lyon, France.
Agrawal, S., & Goyal, N. (2013). Further optimal regret bounds for Thompson sampling. In Artificial intelligence and statistics, Scottsdale, USA.
Agrawal, S., & Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit. In Annual conference on learning theory, Edinburgh, Scotland.
Gai, Y., & Krishnamachari, B. (2011). Decentralized online learning algorithms for opportunistic spectrum access. In Global communications conference, Texas, USA.
Torabi, N., Rostamzadeh, K., & Leung, V. C. (2012). Rank-optimal channel selection strategy in cognitive networks. In Global communications conference, California, USA.
Rosenski, J., Shamir, O., & Szlak, L. (2016). Multi-player bandits-a musical chairs approach. In International conference on machine learning, New York, USA.
Avner, O., & Mannor, S. (2014). Concurrent bandit and cognitive radio networks. In European conference on machine learning and principles and practice of knowledge discovery in databases, Nancy, France.
Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2019). Distributed algorithm to learn OSA channels availability and enhance the transmission rate of secondary users. In International symposium on communications and information technologies, HoChiMinh, Vietnam.
Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2018). Opportunistic spectrum access in cognitive radio for tactical network. In European conference on electrical engineering and computer science, Bern, Switzerland.
Modi, N., Mary, P., & Moy, C. (2017). QoS driven channel selection algorithm for cognitive radio network: Multi-user multi-armed bandit approach. IEEE Transactions on Cognitive Communications and Networking, 3, 1–6.
Tekin, C., & Liu, M. (2011). Online learning in opportunistic spectrum access: A restless bandit approach. In International conference on computer communications, Shanghai, China.
Cauchy, A. (1889). Sur la convergence des séries. Oeuvres completes Sér, 2(2), 267–279.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2019). All-powerful learning algorithm for the priority access in cognitive network. In European signal processing conference, A Coruña, Spain.
Author information
Authors and Affiliations
Contributions
All the authors have contributed to the analytic and numerical results. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
In this Appendix, we investigate the upper bound of the term \(\mathbb {A}= \epsilon _t \times Prob\) in e-UCB where Prob can be expressed as follows:
The index of the i-th channel \(B_i(t, T_i(t))\) is the sum of the exploration factor, \(X_i(T_i(t))\), and the exploitation factor, \(A_i(t, T_i(t))\):
Then, we obtain:
By taking the minimum value of \(X_1({T_1(t-1)}) + A_1(t-1, T_1(t-1))\) and the maximum value of \(X_i({T_i(t-1)}) + A_i(t-1, T_i(t-1))\) at each time slot, we can upper bound Prob by the following equation:
where \(S_i \ge l\) to fulfill the condition \(T_i(t-1) \ge l\). Then, we obtain:
The above probability can be upper bounded by:
Using the ceiling operator \(\lceil \rceil\), let \(l=\lceil \frac{4 \alpha \ln (n)}{\Delta _{i}^2}\rceil\), where \(\Delta _{i} = \mu _1-\mu _i\) and \(S_i\ge l\), then the inequality \(\mu _1<\mu _i+2 A_i({t,S_i})\) in Eq. (32) becomes false, in fact:
Based on Eq. (32), we obtain:
Using Chernoff–Hoeffding boundFootnote 4 [24], we can prove that:
The two inequations above and inequation (33) lead us to:
Finally, we obtain:
Appendix 2
This appendix stands for finding an upper bound of Z that contributes to finding an upper bound of e-UCB:
where a is a constant number that can be chosen as follows: \(a= \frac{\mu _1 + \mu _i}{2}= \mu _1-\frac{\Delta _i}{2}= \mu _i+\frac{\Delta _i}{2}\), and \(\Delta _i=\mu _1-\mu _i\) . After the learning period where \(T_i(t-1)\ge l\), we have: \(T_1(t-1)>>T_i(t-1)\). Then Z can be upper bounded by:
Using the Chernoff–Hoeffding [24], we can upper bound the above equation as follows:
According to the proof provided in Appendix 1, we have \(l=\lceil \frac{4 \alpha \ln (n)}{\Delta _{i}^2}\rceil\) where \(\alpha =2\). So, we obtain:
Rights and permissions
About this article
Cite this article
Almasri, M., Mansour, A., Moy, C. et al. Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms. Wireless Pers Commun 118, 1165–1188 (2021). https://doi.org/10.1007/s11277-020-08064-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-020-08064-w