1 Introduction
Haematoxylin and Eosin (H&E) staining serves as the primary examination method for human tissue samples in histology, standing as the gold standard for evaluating nuclei and cytoplasmic features to detect abnormalities associated with human diseases, including cancer [
24]. Despite its technical simplicity, the routine staining procedures for achieving high-quality results demand specific resources, such as effective chemical and material management. Furthermore, inherent challenges are often associated with H&E staining, including staining variability, potential artifact formations stemming from various steps in the process, and the need for careful tissue handling [
23].
Recent advancements in deep learning (DL) technology have made it possible to achieve label-free virtual H&E staining directly from autofluorescence images obtained on unstained tissue. This innovative approach allows for the rapid generation of clinical-grade H&E-stained images in almost real-time, eliminating the need for conventional staining procedures [
2,
14]. In the realm of Deep Learning (DL) techniques, U-Nets [
20] and Generative Adversarial Networks (GANs, [
7]) are widely employed models, along with their variations. Notably, the pix2pix network [
9] and CycleGAN [
32] are particularly prominent for applications in supervised and unsupervised virtual histological staining, respectively.
Despite the success of GANs in virtual histological staining, one noticeable challenge known as hallucination may persist, where synthetic data generated is irrelevant to the input data distribution, leading to mis-diagnosis of medical conditions [
3]. To overcome this problem, a recommended strategy is to incorporate extra regularisation terms into the GAN’s loss functions, allowing guided training process for more expected synthesis through more effective constraints. Apart from common loss functions utilised in GANs, diverse extra constraints have been applied, including L1 loss [
17,
27], total variation (TV) [
12,
15,
19,
30], physics-guided loss [
22] to minimise the impact of background noise, and structural similarity index (SSIM) [
1,
11]. However, the influence of introducing additional regularisation terms on the outcomes of virtual histological staining into the training process remains largely unexplored.
In the field of computer vision, numerous endeavours have been undertaken to capitalise on additional constraints for achieving high-quality image synthesis. One category of such efforts involves leveraging high-level features extracted from pretrained models. For instance, Mahendran et al. introduced the concept of deep features obtained from various layers, representing a diverse range of details [
16]. Another regularisation approach based on deep features is texture loss, wherein multi-level features from pretrained VGG19 are imposed on generated images, and feature correlations at a given level are determined by the Gram matrix [
5]. Subsequently, methods such as content and style transfer [
6] and perceptual losses [
10] were proposed to underscore the varied details embedded in both source and target images, contributing to more realistic image synthesis.
In this study, we aim to systematically assess the influence of various loss functions on virtual H&E staining. Specifically, we will utilise the pix2pix GAN as the foundational model and integrate five additional regularisation functions, namely Structural Similarity Index (SSIM, [
25]), Total Variation (TV, [
21]), Deep Image Structure and Texture Similarity (DISTS) [
4], content, and texture losses. The evaluation of outcomes will involve both qualitative examination through visual inspection and quantitative analysis using four commonly employed image similarity metrics. These metrics include mean squared error (MSE), Peak Signal-to-Noise Ratio (PSNR) [
31], Normalised Mutual Information (NMI) [
31], and Feature-based Similarity Index (FSIM) [
28].
4 Discussion and Conclusion
In this study, we have thoroughly evaluated the impact of loss functions on virtual H&E staining with label-free autofluorescence images. Specifically, we examined five additional regularisation terms widely applied in virtual histological staining, in parallel with five different weights. Given the implementation outlined in Section
2.1, it is evident that SSIM, texture, and DISTS, when assigned various weights, can produce virtual H&E staining that meets acceptable standards. Importantly, these variations do not hinder diagnostic decision-making. Additionally, content loss with smaller weights also yields satisfactory results. On the contrary, content loss with substantial weights and TV are incapable of achieving acceptable outcomes in virtual H&E staining. The quantitative results obtained from similarity metrics, including MSE, NMI, PSRN, and FSIM, align consistently with the quality of the generated images, making them effective for comparison purposes. However, it remains exceptionally difficult to solely rely on these metrics to identify satisfactory virtual staining.
The results illustrate that although SSIM loss is mathematically and conceptually the simplest loss, it is capable of generating reliable images and surpasses all other losses across all quantitative scores. In addition, SSIM loss is much less susceptible to variations induced by different weights compared to other loss functions, maintaining a relatively consistent performance across the range of weights considered. However, as depicted in Figure
1, SSIM loss struggles reconstructing small immune cell (green arrows in Figure
1), although this may not affect diagnostic decision making.
Compared to SSIM loss, content, texture, and DISTS losses are significantly more complicated as they all utilise deep features to determine similarity. For instance, with the GPU employed for training, SSIM can process 64 images per run, while the other three can only handle 16 images. Consequently, the training time per one epoch increases dramatically from about 25 seconds for SSIM to over 110 seconds for the other three losses, even with the same batch size of 16.
We also showed that TV loss does not generate acceptable images, which is contradictory with the outcomes in [
19]. This may have several reasons. First, our autofluorescence images are single-channel, where their images were collected using different filters, resulting in multi-channel images at different excitation and emissions wavelengths. In addition, Our images were collected with a confocal microscope, whereas their images were acquired using a wide-field microscope, allowing their images to more closely match the true H&E images. Meanwhile, our GAN model is different from their custom model. Last but not least, our study aims to evaluate the impact of these loss functions. To ensure a fair comparison, we intended to fix other parameters instead of fine-tuning them for optimal images.
As far as the metrics are concerned, all align well with the visual outcomes. Particularly, FSIM presnts the most significant contrast, whereas NMI produces scores with the least contrast. Therefore, FSIM may be more suitable for identifying noticeable differences. However, it becomes less sensitive to dissimilarity at the micro level. For example, DISTS-basd scores across the weights are too close to reflect the cellular differences in virtual images. In comparison, MSE and PSNR are more sensitive than FSIM on those trivial changes.
Another important aspect in the evaluation is the hyperparameter
β used in Equation
2. We acknowledge that there are an infinite number of values eligible for the purpose. Empirically, this value were set to approximately 10 times the value of
α [
6,
8,
19,
32]. Accordingly, we selected various
β less than 10, including 1, 2, 5, and 10. We also evaluated 20 as an additional value to have an initial impression on whether a large hyperparameter would be beneficial for the reconstruction. Our findings align with existing literature, indicating that within the range of [1, 10], the reconstruction generally improves as the hyperparameter increases. However, larger values exceeding 10 appear to negatively impact the quantitative outcomes.
Future work will extend to other loss functions, such as LPIPS (Learned Perceptual Image Patch Similarity, [
29]), and the combination of those loss functions. In addition, although pix2pix dominates the supervised virtual histological staining, novel DL technologies, such as vision transformer and diffusion model [
13], have yielded outstanding performance. Therefore, another potential improvement is to include those models as the foundation.