1 Introduction

During the COVID-19 pandemic, IoMT has emerged as an effective and sustainable solution that enhances the productivity and efficacy of the health care system by streamlining the clinical processes, information and work flows. The global telemedicine industry is predicted to grow by 19.3% per year instead of 15% per year predicted earlier over this period due to the pandemic [1]. This overwhelmed escalation in the IoMT market signifies the proliferation in diversified e-health use-cases and extensive applications with strict QoS provision, which in turn has developed an unparalleled stress on the wireless networks. As an example, the data processing in the emergency services demands low latency and highly reliable network to support precise operations in contrast to the video consultation services that requires high bandwidth to facilitate better user experience [2], as depicted in Fig 1. Therefore, the synchronisation of all well-defined e-health use-cases and applications to stringently leverage on a single RAT is infeasible.

Fig. 1
figure 1

QoS requirement of potential 5G enabled IoMT services

In context to this, 5G HetNet emerged as a potential panacea that symbiotically integrates various radio access technologies to maintain differentiated service provisioning. Moreover, 5G HetNet provides multi-dimensional benefits such as ubiquitous personalised services, efficient resource utilisation and enhanced scalability. However, an optimal association that extends robustness against deviations to benefit the users is indispensable in such a composite and dynamic environment. Various RAT selection strategies have been contributed in the existing literature such as Multi-Attribute Decision Making (MADM), game theoretic approaches and optimisation models. However, these approaches suffer from computational complexity and scalability issues. Moreover, these solutions are inadequate in maintaining QoS levels on the premise of battery optimisation.

As a solution, an intelligent RAT selection scheme inspired by the Double deep reinforcement learning based approach is developed that comprehensively respects the user QoS requirements through multiparametric optimisation. In coherence with the highlighted research gaps, the proposed scheme is developed with the following objectives: (i) satisfy differentiated demands of health services (ii) to optimise the battery usage (iii) to ensure seamless connectivity within the IoMT network. To fulfil the multi-objective requirements defined above, below-mentioned contributions have been made in this work:

  1. (i)

    An intelligent model exploiting the concepts of DDRL has been developed to attain optimal RAT association policy.

  2. (ii)

    A principled Software-defined wireless networkingc(SDWN)-Edge based framework has been developed to accommodate multi service capabilities in autonomic IoMT network.

  3. (iii)

    A multi-objective utility function is formulated to authenticate the QoS provisioning and energy level maintenance.

  4. (iv)

    The performance of the proposed scheme evaluated through rigorous simulations indicated a significant improvement.

The rest of this paper is structured as follows. Section 2 illustrates the state-of-the art of RAT selection solutions. For the better understanding of the proposal, a system model is discussed in Sect. 3. Section 4 elaborates the proposed mathematical approach to achieve fine-grained association policy for the formulated problem. Next, Sect. 5 expounds the results assessing the proposed approach performance. Subsequently, the paper is concluded.

2 Related Work

Numerous network selection schemes have been proposed in the present literature. Among them, MADM emerged as a popular RAT selection method that accounts different network parameters. For example, Desogus et al. proposed a traffic-specific solution which selects network on the basis of reputation based on the user feedback in the multi-service scenario. The network reputation is computed through multi-criteria method in the context of stringent traffic requirements to improve QoS and data delivery process. This approach defines utility functions for each fundamental criterion in accordance with service type to compute a score associated with the considered network. In 2019, Monteiro et al. proposed a network selection mechanism for hetnets called context-aware network selection (CANS) that considers distinct parameters such as user preferences, cost, local network and mobile terminal capabilities, user’s displacement speed, device state, bandwidth usage and device battery level. Authors developed a prototype system to validate the proposed approach and found reduction in energy consumption of the device. A novel flexible network selection scheme proposed in [5], integrates the fuzzy logic with the MADM technique to facilitate context-aware network selection in a multi-service 5G heterogeneous wireless networks. The proposal includes four processes namely network screening, fuzzy process and hierarchical and integrated attributes analysis that minimises the ping-pong effect to enhance the user satisfaction and network robustness. Yadav et al. (2018) proposed a MADM based network selection strategy to establish seamless connectivity for biovital data transmission with reduced number of handovers. However, the this method fails when there is lack of consensus in the evaluation criterion with the increase in parameters and network of interdependencies.

Therefore, another potential solution utilised for RAT selection is game theory as presented in [7] that proposed a matching game theoretic approach to ensure stable association among connected Internet of Things (IoT) devices and RATs in a 5G composite environment. To attain stable association, characteristic preference functions have been defined for both IoT devices and RATs that ensure improved energy efficiency, while respecting the RAT’s capacity and reduced cost of connectivity in an autonomic IoT environment. Authors in [8], leveraged the concept of evolutionary game theory to select a suitable network that ensures load balancing in 5G heterogeneous network. An enhanced network load balancing is achieved with a utility function defined as an aggregate function of available network capacity and parameters. Authors in [9] proposed a coherent distributed algorithm called early acceptance that exploits matching game theory to achieve balanced association in 5G HetNets. Moreover, the early acceptance guaranteed faster and energy efficient network selection in comparison to centralised RAT selection approaches. However, the game theoretic approaches turns out to be deficit in terms of stability and adaptability with respect to the scenario.

To address the drawbacks highlighted by the aforementioned methods, artificial intelligence based algorithms emerged as a significant approach that guarantees optimal RAT selection in a highly composite environment on the basis of stored experience samples. Therefore, Authors in [10] proposed a generic mathematical framework based on reinforcement learning model that facilitates optimal RAT selection policy in smart city environment. Authors presented a well-defined reward function that considers the IoT device characteristics and RAT peculiarities to maximise the throughput while reducing energy consumption and cost of connectivity, in accordance with the generated environment-specific alerts. Authors in [11], jointly optimizes the network selection and power control in OFDMA based uplink HetNets by employing the multi-agent DQN method that facilitates enhanced user satisfaction and energy efficient network selection. Moreover, the proposed approach maximises the overall network utility with guaranteed QoS in accordance with the environment dynamics. Mollel et al. proposed an offline intelligent network selection scheme based on DDRL to reduce frequent handovers and facilitate QoS service provisioning. Specifically, the proposed algorithm leverages the historical user trajectory and network topology information to attain optimal base station selection policy, accounting the number of handovers and system throughput.

Ma et al. presented a multiagent DQN network selection algorithm based on Nash Q-learning that converges to a optimal policy that maximises system throughput while reducing blocking probability with guaranteed QoS with the help of an explicit reward function. Specifically, the proposed scheme leverages AHP and Gray method to compute the user preferences and discrete time Markov chains to model the network selection problem. Chkirbene et al. developed a novel network selection framework based on DRL method that dynamically adapts over heterogeneous health network to facilitate the medical data delivery. The proposed scheme reduces the energy consumption, latency, monetary cost and distortion while guaranteeing user QoS.

Nonetheless, the existing RAT selection approaches mainly concentrate on optimising the system performance in a single service scenario neglecting the energy optimization and cost of connectivity constraint as highlighted in Table 1. Therefore, an intelligent RAT selection approach has been proposed to guarantee the diverse requirements of the patient in a multi-service IoMT network.

Table 1 Comparison between existing research and the proposed scheme

3 System Model and Problem Formulation

To better comprehend the proposed RAT selection solution, the system model for the same is presented in this section. Subsequently, the multi-objective problem is formulated that aims at optimising the system performance while associating the users with optimal networks.

3.1 System Model

The system under investigation considers an IoMT network, where each patient is equipped with WBAN sensors and devices to monitor the physiological status as shown in Fig 2. The measurements collected from the Wireless Body Area Network (WBAN) is transferred to the health cloud through i Patient Trusted Node (PTN) i.e. smartphone, characterised with battery capacity \({b_i}^{max}\). This monitoring involves computation intensive processes and immediate feedback. Therefore, an edge server with computing and receiving capabilities is deployed to account a service request characterised by a packet length \(l_{i}\), classified on the basis of biovital measurements collected from the sensors associated with a patient. The core of this edge server is a classifier model trained with the e-health database that classifies the patient’s biovital measurements into a health risk level namely low, medium and high and accordingly map to the respective patient specific-services viz. pervasive monitoring, video consultation, and emergency services. Furthermore, j (where j = 1,..,M) randomly distributed RATs have been considered to be connected to a central entity called SDWN controller. Consequently, energy consumed by the patient trusted nodes to transmit packet of \(l_{i}\) length over j RAT [15, 16], is given as

$$\begin{aligned} E_{i,j}= p_{i,j}*D_{i,j} \end{aligned}$$
(1)

where \(p_{i,j}\) is the power consumption and the transmission delay \(D_{i,j}\) is defined as

$$\begin{aligned} D_{i,j}= \dfrac{l_{i}}{DR_{i,j}} \end{aligned}$$
(2)

here \(DR_{i,j}\) is the transmission rate between PTN and RAT and the proposed solution accounts the cost of connectivity with respect to different RATs which is defined as

$$\begin{aligned} C_{i,j}=l_{i}*c_{j} \end{aligned}$$
(3)

where \(c_{j}\) is the cost of connectivity per packet for j RAT.

Fig. 2
figure 2

Proposed system model

The proposed solution must ensure an acceptable QoS and battery level to ensure better user experience. As a consequence, none of the patient node will remain disconnected. Hence, the prime objective of the proposed work is to optimise energy consumption while guaranteeing acceptable QoS levels and cost of connectivity. These objectives are captured through the utility function discussed below. For the efficient functioning of an IoMT network, the patient trusted nodes are required to:(i) optimise the energy efficiency (ii) transfer the biovital information in timely manner (iii) alleviate the connectivity cost. Inspired by these above cited perquisites, a multi-objective utility function is formulated to define the utility that aims at achieving trade off between energy consumption, data rate, delay, jitter and PLR and the same is defined as follows

$$\begin{aligned} SU_{i,j}=\sum _{j}( \alpha _{DR}*{DR_{i,j}}+ \alpha _{D} * \dfrac{1}{D_{i,j}}+ \alpha _{EC} * \dfrac{1}{EC_{i,j}} + \alpha _{PLR}* \dfrac{1}{PLR_{i,j}}+ \alpha _{C}* \dfrac{1}{C_{i,j}} \end{aligned}$$
(4)

where \(\alpha _{DR}\), \(\alpha _{D}\), \(\alpha _{EC}\), \(\alpha _{PLR}\), \(\alpha _{C}\) are the weighting coefficient that signifies the relative importance of the crucial parameters namely data rate, delay, energy consumption, PLR and cost of connectivity. To capture service heterogeneity, the analytic hierarchy process is applied to compute the weights of these decision attributes to support the diversified service preferences. As an example, an emergency response service poses highest demand on the network delay, while the strict requirements of bandwidth in the video consultation services cannot be neglected. As for the pervasive remote monitoring, lower cost network will be preferred. Moreover, while reporting an event through a provided RAT, the internal variable of PTN (i.e. battery consumption) is likely to get affected. In particular, no further services can be requested when the battery gets depleted and the same is expressed in the form of following constraint

$$\begin{aligned} \sum _{j} EC_{i,j} \leqslant b_{i}^{max} \end{aligned}$$
(5)

And hence the utility function for the PTN can be modified as

$$\begin{aligned} O_{j}=SU_{j}*H(b_{i}^{max}-\sum _{j}EC_{i,j}) \end{aligned}$$
(6)

The Heaviside function in the above equation turns the preference function zero, when the PTN runs out of the battery.

3.2 Problem Formulation

For the efficient functioning of the considered IoMT network, there is a need to optimise the connectivity between the patient trusted node and e-health cloud while satisfying the service specific QoS demands. Therefore, a multi-objective optimisation problem has been formulated to achieve a balance between battery consumption minimisation and QoS provisioning. In essence to this, a weighted sum scheme is adopted, which incorporates a multiple objectives into a single objective function \(O_{ij}\), defined as the system utility and the formulated problem aims at maximising this objective function in the considered IoMT network.

$$\begin{aligned}&\beta : max \{O_{i,j}\} \end{aligned}$$
(7)
$$\begin{aligned}&C1:\sum _{j} EC_{i,j} \leqslant b_{i}^{max} \end{aligned}$$
(8)

In the above stated problem \(\beta\), the constraint defined in C1 ensures that the transmission energy for each patient trusted gateway must not exceed its maximum battery limits. The formulated problem clearly implies that the QoS level maintenance and energy optimisation are mutually involved with each other. Therefore, the optimal solution for the formulated problem is proposed in upcoming section.

4 iMnet: Intelligent RAT Selection framework for 5G enabled IoMT network

As a solution to the above formulated problem, an intelligent approach incorporating the benefits of Multi-layer Perceptron (MLP) classifier and double deep reinforcement learning is investigated in this section to attain a optimal RAT selection policy that can maximise the system utility with the guaranteed QoS levels. Initially this section, introduces the edge computing layer. Furthermore, a brief description over DDRL framework is presented, followed by the definition of its fundamental elements. The key terms and their description considered during the mathematical analysis are summarized in Table 2.

Table 2 Symbols and their conventions
Fig. 3
figure 3

Proposed intelligent network selection framework

Table 3 Risk and service classification in accordance with biovital parameters

4.1 Edge Computing Layer

The proposed solution distributes the logic among the IoMT network, edge server and SDWN controller. Initially, the IoMT layer gleans physiological information from the patient trusted nodes present in the IoMT network whereas the radio-node layer recapitulates the fundamental network technologies and allows the standardised communication capabilities among the PTNs and health cloud as presented in Fig. 3. In 5G HetNets, more than one RAT is likely to be shortlisted in accordance with PTN specific services. As a result, the network cognitive engine implements the MLP classifier to decide whether a RAT is acceptable for the demanded application. This act of stratifying RATs into valid and invalid sets is termed as invalid action reduction [17, 18] that eradicates the insignificant actions for quick convergence. The external data (comprising the sensor measurements) and internal data (comprising RAT specific measurements) extracted from the infrastructure layer is forwarded to the edge computing layer for the service and valid RAT identification. The service cognitive engine and network cognitive engine in the edge computing layer implements MLP classifier to respectively classify the patient-specific biovital measurements into well-defined services S (as presented in Table 3) and RATs into valid and invalid sets in the composite IoMT network. The MLP classifier for both the engines consider real-time physiological measurements \(M_{s}\) and patient-specific network measurements \(M_{n}\) and the definition for the same is given as

$$\begin{aligned} MLP_s =\, & f_s (M_s:u_s (\tau _s:\nu _s)) \end{aligned}$$
(9)
$$\begin{aligned} MLP_n = \,& f_n (M_n:u_n (\tau _n:\nu _n)) \end{aligned}$$
(10)

where \(f_s ()\) and \(f_n ()\) denotes activation function for the service and network cognitive engine respectively whereas \(u_s ()\) and \(u_n ()\) are the corresponding parameter set for the same. The layer size is represented as \(\tau _s\) and \(\tau _n\) where \(\nu _s\) and \(\nu _n\) are the variable parameters respectively for the service and network cognitive engine. The MLP classifier implemented in service cognitive engine has input layer of size 5x5 with tanh activation and output layer size of 3x3 with softmax activation whereas network cognitive engine implements a MLP network with input layer size of 5X5, tanh activation and output layer size of 2x2, softmax activation function. To get better insight to the proposed approach, a brief introduction to the DDRL is presented below:

figure c
figure d

4.2 Software-Defined Wireless Networking Layer

The SDWN layer models the SDWN controller as a DDRL agent which observes a state and choose an action on the basis of \(\epsilon -greedy\) policy. Subsequently, the DDRL agent receives a reward r outlined with the maximisation of objective function. Consequently, the agent learns to converge to an optimal network selection policy according to the received reward. Before discussing these fundamental elements of DDRL, a brief introduction to DDRL is presented below:

Reinforcement learning is a feedback artificial intelligence technique that ensures automated decision making in a complex and dynamic environment. Specifically, the agent performs suitable actions on the basis of previous learned experiences with the help of a reward function and current state [19]. However, the reinforcement learning method fails to attain an optimal strategy due to dimensional limitations. As a solution, Deep reinforcement learning emerged as an exemplary method that implements both target and online networks based on deep neural networks to optimize the overall performance. To compute the approximate Q values, DQN defines a parameterized value function,\(Q_{\phi }=s\times a \rightarrow r\), which accepts state S as input to produce Q-value for all feasible actions whereas \(\phi\) signifies the parameters related to the Q-values. The DQN is trained through mini-batches \((s,a,r,s')\), selected randomly from the experience buffer D to reduce the correlation among the training examples. This training of DQN is necessary to update the weights defined as \(\theta\) to minimizes the loss function defined as

$$\begin{aligned} L(\phi ) = E[(y_i^{DQN}-Q_i(s,a_i;\phi ))^2] \end{aligned}$$
(11)

Where the target value y is defined as

$$\begin{aligned} y_i^{DQN}=r+ \gamma max_{a^{'}\epsilon A} Q_i(s^{'},a_i^{'};\phi ^{'}) \end{aligned}$$
(12)

here \(\phi ^{'}\) denotes the target network weights. The action defined in (12) can be selected through \(\epsilon\)-greedy policy from the online network \(Q(s_i,a_i,\phi )^2\). The over-optimistic estimation elucidated in equation (12) results in inaccurate derived policy. Therefore, Double DQN algorithm is implemented which leverages on two different DQN to decouple the action selection and action evaluation [20] and the target function for the same is defined as

$$y_{i}^{{DDQN}} = r + \gamma Q(s^{\prime } ,\arg \max _{{a^{\prime } \in A}} Q_{i} (s^{\prime } ,a_{i}{{^{\prime } }} ;\phi );\phi^{\prime } )$$
(13)

Furthermore, both the online and target network employs the next state \(s^{'}\) to compute the optimal Q values . Subsequently, target value \(y^{DDQN}_{i}\) is calculated with the defined values of discount factor \(\gamma\) and reward value r.

4.3 DDRL Framework Design

As a consequence of dynamic and complex environment, the RAT selection problem is formulated as Markov Decision Process (MDP) \(M=(s,a,r,\gamma )\) which accepts the state s. Based on this state, the SDWN controller executes an action a to which a reward r is returned in order to improve the learned policy and their definitions are discussed below:

  1. 1.

    State: The patient-specific service S and valid RATs network measurement with respect to the PTN comprises state space. The valid RAT measurements include data rate (DR), delay (D), PLR, energy consumption (EC) and cost of connectivity (C). Therefore, the state space is

    $$\begin{aligned} s_t^j=\{S,B(j), D(j), PLR(j), EC(j), C(j)\}, j \in valid RAT \end{aligned}$$
    (14)

    and \(s_t\) represents the state of the considered system [17],

    $$\begin{aligned} s_{t}=[\cup _{\forall j \in valid RAT^{s_{t}^{j}}}] \end{aligned}$$
    (15)
  2. 2.

    Action space: Each agent executes an action \(a \in A\) corresponding to every state \(s \in s_{t}\), represented as

    $$\begin{aligned} A_{v}= \{RAT_{1}, RAT_{2},...., RAT_{v}\} \end{aligned}$$
    (16)

    The action space represents the list of valid RATs shortlisted by the network cognitive engine in the dynamic IoMT network.

  3. 3.

    Reward design: As IoMT is a composite and complex network, the group of RATs which can meet the PTN specific QoS demands varies and accordingly the valid and invalid RATs changes. With reference to the formulated problem, the selection of invalid RATs must be avoided to learn optimal RAT selection policy. Therefore, the reward design is defined as

    $$\begin{aligned} r={\left\{ \begin{array}{ll} O_{i,j},& \text{if } a\in A_{v} \\ 0 , & \text{if } a \notin A_{v} \\ \end{array}\right. } \end{aligned}$$
    (17)

    After executing each valid action in each time step, the network cognitive engine feedbacks the aforementioned reward to the agent for objective function maximisation.

5 Results and Discussion

In this section, a series of extensive simulations has been carried out to evaluate the performance of the proposed scheme. Primarily, the simulation environment is presented, followed by the service-specific weight calculation. Subsequently, the convergence performance and comparative analysis of the proposed scheme with the other existing schemes namely DRL, greedy and random scheme.

Table 4 Simulation setting
Table 5 Selected hyperparameter for DDRL agent

5.1 Simulation Environment

The simulation environment considers three PTNs that collect the physiological information from the patients and transfer it to the health cloud through multiple RATs namely 5G, LTE and LoRa. In essence to this, MIMIC-III database [21] is utilised to acquire time series of biovital measurements presented in Table 3. To establish the above-mentioned multi-RAT environment, a discrete model is adopted to describe and approximate the network dynamics namely data rate, delay, PLR to certain levels. The minimum and maximum level of each network parameter required to define the joint data rate-delay-PLR state as one of all possible combinations with equal probability is presented in Table 4. Eventually, the Double Deep Reinforcement learning model is implemented using the Matlab deep learning toolbox [22]. The simulation parameters and DDRL agent hyperparameters are summarized in Table 4 and Table 5. In order to evaluate the performance of the proposed scheme, it has been compared against (i) DRL (ii) greedy (iii) random scheme. In a DRL based scheme, the agent selects among the valid and invalid RATs present in the considered environment. The greedy scheme selects nearest RAT for each requested service, ignoring the service specific requirements whereas random scheme associates with each RAT with same probability.

Fig. 4
figure 4

Measurement of service preference for (a) Pervasive monitoring (b) Video consultation (c) Robotic care

5.2 Service Preference Measurement

The stringent QoS requirements possessed by the services in the IoMT network are required to be satisfied to achieve better user experience. Therefore, the AHP method is employed in this section to model the preferences of service demanded by the PTNs and the same has been discussed below:

  1. 1.

    Pervasive Monitoring This class of services include transmission of biovital measurements gathered through the various wearable sensors concerning the patient to the medical specialist ensuring pervasive monitoring of the patient. The telemonitoring of the patient demands high reliability, low delay and cost network , the weights for the same are defined in Fig. 4a.

  2. 2.

    Video Consultation The video consultation in the IoMT network facilitates diagnostic and therapeutic assistance through a video session between the patient and medical professional. This class of services pose the highest demand on low delay and high reliability. The weights computed through AHP for this class of service are presented in Fig. 4b.

  3. 3.

    Robotic Care Robotic care delivers AI based diagnostic and telesurgery for the efficient management of patients. With the aim of providing emergency assistance, this class of service demands highly reliable and low latency connectivity. The weights for the same are defined in Fig. 4c.

Fig. 5
figure 5

Convergence analysis of proposed scheme and other existing schemes

Fig. 6
figure 6

Valid action selection with varying episodes

Fig. 7
figure 7

RAT access trend attained before and after convergence

Fig. 8
figure 8

Average Reward of proposed scheme and other existing schemes

5.3 Performance Analysis

The variation of reward highlighted in Fig. 5 depicts that for first 35 episodes, the initial exploration rate is set as 1 to achieve an initial estimate of policy selecting random actions. Afterwards, the exploration rate is decreased to be equal to 0.1 at each episode to quickly converge to optimal policy. With the varying episodes the model is trained to learn dependency trends among the state and actions, resulting in increased reward. The fluctuations seen even after the curve has leveled off are due to the selection of suboptimal actions with the probability of 0.01 (\(\alpha =0.90\)) at each iteration. Furthermore, it can be concluded from Fig. 6 that the proposed scheme efficiently learns the valid action selection with the varying episode to facilitate quick convergence to the significant solutions.

It can be observed from Fig. 7 that after 5 episodes, 5G NR is a preferred choice for each considered service namely pervasive monitoring, video consultation and robotic care but the choice for RAT varies for each service between 10-30 episodes as the proposed approach follows the exploitation policy to choose actions that returns higher rewards. Finally, after 30 episodes, the selection becomes stable for various differentiated IoMT services.

5.4 Comparative Analysis

As highlighted in Fig. 5 , the proposed scheme results in faster convergence due to the incorporation of invalid action reduction scheme and DDRL concepts. The invalid action reduction scheme employs function approximation to achieve mapping related to the Q-values of only valid RATs whereas the DDRL approach minimises the overestimation error by executing argmax operation only on valid actions resulting in faster convergence. On the contrast, DRL approach experience catastrophic interference phenomenon that disturbs the stability and leads to slower convergence. The same can be verified from Fig. 8 that highlights the superiority of the proposed scheme in comparison to the DRL, greedy and random scheme by 43%, 147%, 308% respectively. The DDRL approach allows proposed scheme to incur higher reward values as it sweeps the defined space to gain maximum reward values in lesser episodes whereas the overoptimistic estimation in DRL approach affects the learning quality. In addition to this, the greedy behaviour towards action value estimates in greedy scheme and the unaware RAT selection in random scheme results in their respective poor performance.

Fig. 9
figure 9

Convergence performance with respect to varying (a) mini-batch size (b) discount factor (c) learning rate (d) optimisation strategy

5.5 Complexity Analysis

This section introduces the detailed time complexity analysis of the suggested solution based on DDRL concepts. The DDRL model incorporates a fully connected deep neural network with three layers: input, hidden, and output. The complexity of the proposed approach is discussed with respect to edge computing and software-defined wireless networking layer. The edge computing layer comprises two MLP classifier namely \(MLP_{s}\) and \(MLP_{n}\) respectively employed in service cognitive and network cognitive engine to classify the physiological measurements into service and RATs into valid and invalid RATs. Therefore, the time complexity contributed by edge computing layer is given as \(O(\eta *( \tau _{si}* \tau _{sh}*\tau _{so}*T_s+ \tau _{ni}* \tau _{nh}*\tau _{no}*T_n))\), where \(\eta\) represents total number of episodes where \(\tau _{si}\), \(\tau _{sh}\), \(\tau _{so}\), \(T_s\) respectively denotes input layer, hidden layer, output layer neurons and training data associated with \(MLP_{s}\) classifier. On the other hand, \(\tau _{ni}\), \(\tau _{nh}\), \(\tau _{no}\), \(T_n\) represents input layer, hidden layer, output layer neurons and training data associated with \(MLP_{n}\) classifier. As the input layer and hidden layer size are same for both the classifier, the time complexity can be reduced to \(O(\eta * \tau _{si}* \tau _{sh}*(\tau _{so}*T_s+*\tau _{no}*T_n))\) [23].

The input layer neuron is represented by the state space cardinality i.e. \(1+v\), whereas the output layer neurons are represented by action space i.e. v, where v represents valid RATs. Additionally, \(H_{fn}\) is considered to be fully connected hidden layer of DDQN containing \({e_{fn}^{h}}\) neurons. As a result, the overall number of weights required to be updated are \((1+v)*{e_{fn}^{h}} + (v)*{e_{fn}^{H_{fn}}}+{\sum _{h=2}^{H_{fn}}}{e_{fn}^{h-1}}*{e_{fn}^h}\). Consequently, the time complexity of the DDRL algorithm is defined as \(O(X*((1+v)*{e_{hn}^{f}} + (v)*{e_{fn}^{H_{fn}}}+{\sum _{h=2}^{H_{fn}}}{e_{fn}^{h-1}}*{e_{fn}^H}))\), where X is the complexity related to neuron weight updation. Moreover, the time complexity related to perceiving the experience sample is \(O(\eta *t*i)\) . Therefore, the overall time complexity of the proposed approach in \(\eta\) episodes including t time steps is defined as \(O(\eta * \tau _{si}* \tau _{sh}*(\tau _{so}*T_s+*\tau _{no}*T_n))\)+\(O(\eta *t*(i+X*((1+v)*{e_{fn}^{h}} + (v)*{e_{fn}^{H_{fn}}}+{\sum _{h=2}^{H_{fn}}}{e_{fn}^{h-1}}*{e_{fn}^h})))\) [24, 25].

The time complexity analysis of DRL based approach accounts valid and invalid RATs (j) and the same is given as \(O(\eta * \tau _{si}* \tau _{sh}*\tau _{so}*T_s)+O(\eta *t*(i+X*((1+j)*{e_{fn}^{h}}+ n*{e_{fn}^{h_{fn}}}+{\sum _{h=2}^{h_{fn}}}{e_{fn}^{h-1}}*{e_{fn}^h})))\) [26]. Therefore, it can be concluded that the proposed approach ensures reduced complexity compared to DRL based approach as the former accounts v valid RATs. The time complexity of random scheme is given as \(O(\eta )\) [26], as the actions are sampled randomly from a provided set of actions whereas the complexity analysis of greedy scheme is provided as \(O(\eta *t*(1+j)*j)\) [27].

5.6 Convergence Analysis

The reinforcement learning algorithm convergence can be assured when the reward function is bounded and the learning rate satisfies the following conditions:(i) \(\sum _{t=0}^{\infty } \alpha ^{(t)}=\infty\) ,(ii) \(\sum _{t=0}^{\infty } (\alpha ^{(t)})^{2}<\infty\) [28]. In addition to this, the sample complexity of the reinforcement learning algorithm is defined as \(\frac{O(|S| |A|)}{W}\), where W is related to the hyperparameters associated with Q-learning. Hence, the systematic hyperparameter tuning guarantees convergence and the same is discussed in the subsequent section.

5.7 Training Efficiency with Different Learning Hyperparameter

Hyperparameters selection plays a pivotal role in the successful operation of the deep learning model. However, it is crucial to predict the best hyperparameter configuration for the considered problem in advance. Consequently, an extensive trial and error method is required to attain optimal hyperparameter configuration in a given time [29, 30]. Therefore, the convergence performanceof the proposed scheme under various hyperparameter is illustrated in Fig. 9. Figure 9a highlighted the convergence statistics with varying mini-batch sizes and concluded that mini batch size of 32 permits faster updation of neural network weights, resulting in faster convergence to optimal solutions. Figure 9b showcased that low discount factor leads to faster training dynamics as compared to higher discount factor as it results in inaccuracy and instabilities. Figure 9c depicts that the learning rate of 0.01 allows faster convergence to best solution as 0.1 and 0.001 respectively push the model to quickly converge to insignificant solutions. Adam optimisation strategy outperforms other schemes in Fig. 9d as it integrates the heuristics of both RMSProp and SGDM.

6 Conclusions

The significant growth in the IoMT industry has highlighted the need of effective communication infrastructure to support extensive IoMT services with strict QoS provisioning. To satisfy these requirements, 5G HetNets emerged as a potential solution that supports agility and flexibility in composite and complex environment. However, maintaining adequate level of QoS requirements while reducing energy consumption in the IoMT network is crucial. As a solution, an intelligent network selection scheme has been proposed to attain optimal RAT association policy in the complex and dynamic IoMT network. Moreover, a SDWN-Edge enabled framework is leveraged to facilitate flow scheduling differentiation on the basis of PTNs QoS requirements. In order to capture the service heterogeneity, AHP method is employed to calculate the service preference weights. The analytical results validated through the exhaustive simulations exhibited that the proposed strategy converges to optimal RAT association policy in lesser number of episodes. It has been found from the results obtained in this work that the proposed scheme has performed considerably better in terms of system utility modelled as reward i.e. 43%, 147% and 308% respectively in comparison with DRL, greedy and random scheme.