1 Introduction
Patients in the intensive care unit (ICU) generally undergo stressful and traumatic experiences stemming from critical illness, medical procedures, pain, and a hostile environment [
20]. Even after ICU discharge, these patients are vulnerable and at a high risk of readmission to the hospital and the ICU [
70,
80]. Post-ICU patients often suffer from post-intensive care syndrome (PICS) which refers to “new or worsening impairments in physical, cognitive, or mental health arising after critical illness and persisting beyond acute care hospitalization” [
55]. PICS is quite common, affecting up to 75% of patients discharged from the ICU [
14,
56,
60,
67] and reduced quality of life and hindrances in reintegrating into society of post-ICU patients [
19,
24]. Psychological aspects of PICS include depression, anxiety disorders, and post-traumatic stress disorder (PTSD) [
33]. While there is growing interest in their prevention and treatment, there exist limited interventions available, such as ICU follow-up clinics [
12,
33] and ICU diaries [
6,
33], and there is a need for more diverse and effective approaches.
Visual art has been widely utilized to promote psychological well-being in clinical environments. While art therapy is commonly understood as a form involving creative activities, in this paper, we use ’art therapy’ as an umbrella term where art serves as a medium for therapeutic benefits [
13,
23]. This includes various forms, such as engaging with existing artwork to stimulate emotions and self-reflections. Art therapy, for example, has been employed as a method to address various forms of psychological disorders, including depression [
3,
23], anxiety [
23], and PTSD [
73]. This approach leverages the unique characteristics of visual art, such as diverse styles that activate interpretation and imagination, and the ability to stimulate the expression of memories and specific emotions [
73], which have demonstrated effectiveness in addressing these psychological disorders [
3,
23,
73]. In the context of hospitals in general, as well as critical care settings such as the ICU, the use of visual art as a positive distraction has also demonstrated its effectiveness in reducing stress, anxiety, and pain perception [
54,
76]. This body of evidence showcases the potential of visual art as an intervention for addressing the psychological aspects of PICS.
Previous studies have suggested the importance of personalization in providing positive distractions to enhance effectiveness [
71] while minimizing potential side effects [
57]. This indicates that for visual art to serve as a positive distraction, it is crucial that it resonates with the patient’s emotional needs, emphasizing the significance of selecting appropriate art for each patient. Furthermore, to achieve a prolonged effect of positive distraction, a continuous supply of personalized art is necessary, which entails a large number of artworks.
In light of these challenges, it becomes evident that embracing a personalised methodology for facilitating the selection of paintings in art therapy is not merely advantageous, but of paramount importance. The intersection of personalised medicine and art therapy holds the potential to revolutionise the landscape of PICS treatment. Particularly recent advances in Machine Learning-based Visual Art Recommendation Systems (VA RecSys) hold great potential to open a novel avenue for tackling the challenges of artwork selection in a personalised and nuanced manner to be used for art therapy of post-ICU patients and beyond. By integrating these systems, we can bridge the gap between the vast universe of artworks and the unique emotional needs of each patient, thereby supporting experts in the selection process of curated artworks that resonate with the penitent’s distinct cognitive and emotional requirements.
In this paper, we set out to explore the potential benefits of integrating Machine Learning (ML) based VA RecSys within the framework of PICS treatment using art therapy. To the best of our knowledge, there are no prior works leveraging VA RecSys in a therapeutic context. Therefore, we formulate the following research question: Can VA RecSys algorithms support the psychological well-being of post-ICU patients through personalized art therapy?
In pursuit of this investigation, we propose approaches to integrate state-of-the-art VA RecSys engines within the context of PICS prevention and reduction, evaluating their potential efficacy and relevance. We explore four VA RecSys engines that have shown superiority in uncovering complex semantic relationships of artwork and have been successfully applied for personalised recommendation tasks [
90,
91]. In particular, we trained three uni-modal engines on image and textual data of artworks, and one multimodal engine that fuses both image and text. For our image-based approach we use the popular Residual Neural Network (ResNet) [
28], for our text-based approach we adopt both Latent Dirichlet Allocation (LDA) [
7] and Bidirectional Encoder Representations from Transformers (BERT) [
18], whereas for our fusion approach we use Bootstrapping Language-Image Pre-training (BLIP) [
43]. The learned representations are used to derive personalised artwork recommendations for therapy that are presumably matching penitents’ emotional needs.
In sum, this paper makes the following contributions:
•
We develop and study four advanced VA RecSys engines using different backbone architectures (ResNet, LDA, BERT, and BLIP) to support PICS prevention and reduction through guided art therapy.
•
We conduct a usability test with 4 healthcare experts to assess the appropriateness of VA RecSys engines and a large-scale study with 150 post-ICU patients to assess the efficacy of the proposed VA RecSys engines, as compared to expert-curated recommendations.
•
We contextualise our findings and provide guidance about potential strategies to integrate ML-based VA RecSys in a personalised PICS intervention and beyond.
3 Background: Learning Latent Representations of Visual Art
Representation learning is a powerful computational concept that involves automatically uncovering the underlying structure within complex data [
5]. It is a process where an algorithm learns to convert raw data inputs into more compact, meaningful, and feature-rich representations. These representations capture essential patterns, relationships, and characteristics within the data, enabling more effective analysis, understanding, and utilization [
93].
In the context of VA RecSys, representation learning plays a pivotal role in converting intricate visual elements into condensed yet informative forms. This process involves training algorithms on text and/or image modalities, often based on neural networks, to recognise and extract not only distinctive features present in paintings, such as colours, shapes, and textures but also complex concepts embodied within artworks such as the emotional and cognitive reflections they trigger which are not always observable to the naked eye [
90]. The learned representations encode these features in a way that captures the essence of the artwork’s semantics as well as visual identity. This goes beyond mere pixel values, as the algorithms internalise the higher-level characteristics that make each piece of art unique [
91].
Based on the input data source there are two notable paths in representation learning literature which are unimodal and multimodal approaches [
11]. As discussed in section
2.3, we draw inspiration from the recent successes of VA RecSys employing NN-based representation learning, and observing how the resulting personalised recommendations capture hidden semantics and benefit users in many ways such as learning, discovery, enhanced engagement and better interaction experience. The key idea of representation learning in this setting is that textual and visual modalities of paintings are used to learn an embedding space where similar items are represented close to each other in the embedding space as explained in the following subsections. Figure
2 and
3 summarise the painting representation learning approaches we propose and study for PICS rehabilitation therapy.
3.1 Unimodal VA representation learning
This approach extracts and encodes inherent features of paintings from a single type of data modality (i.e., image or textual description).
3.1.1 Image-based VA representation learning.
Today, image feature extraction techniques predominantly rely on pre-trained Convolutional Neural Network (CNN) architectures, such as AlexNet [
39], GoogLeNet [
74], and VGG [
72]. An exemplar of this trend is the winner of the 2015 ImageNet challenge, ResNet, introduced by He et al.[
28]. ResNet pioneered the integration of residual layers to facilitate the training of very deep CNNs, setting the record with architectures comprising over 100 layers. A prominent version of this architecture is ResNet-50, featuring 50 layers, trained extensively on a vast repository of images from the ImageNet database.
1 Consequently, ResNet-50 has assimilated intricate feature representations across a diverse spectrum of images and has showcased its potential as a superior visual feature extractor compared to other pre-trained models[
4,
32,
42].
To extract latent visual features (image embeddings) from paintings, we employed the ResNet-50 model pre-trained on the ImageNet dataset. By channelling each painting image through the network, we derived a convolutional feature map, resulting in a feature vector representation. Upon completing the extraction process for all image features within the dataset containing
m number of images, we produce a matrix
\(\mathbf {A} \in \mathbb {R}^{m \times m}\), with each entry reflecting the cosine similarity measure among all image embeddings. Cosine similarity is an effective metric to find item similarities from embedding spaces, which is commonly used in data mining and information retrieval [
58,
91]. This matrix encapsulates the latent visual distribution across all images, serving as a foundation for calculating similarities among paintings for a VA RecSys task discussed in section
43.1.2 Text-based VA representation learning.
Learning latent representations of paintings from their textual descriptions was proven a powerful technique to uncover hidden semantic concepts that are embodied across artwork [
84,
89,
90]. In this work, we adopt two of the popular text-based representation learning approaches that demonstrated success in VA RecSys tasks namely, Latent Dirichlet Allocation (LDA) and Bidirectional Encoder Representations from Transformers (BERT).
LDA. Our first text-based VA RecSys approach is LDA, an unsupervised generative probabilistic model proposed by Blei et al. [
7]. LDA attempts to model a collection of observations as a composite of distinct categories, or topics. In this context, each observation corresponds to a document, and the features are the presence, occurrence, or count of words, while the categories constitute the underlying topics. Notably, the specifics of the topics are not predefined; only the number of topics is chosen beforehand. These topics are learned as probability distributions over the words within each document.
The procedure for constructing an LDA model within the VA RecSys framework is as follows. We begin by curating a collection of documents, each containing textual information about individual paintings. Subsequently, a desired number of topics, denoted as
k, is determined, and each word
w within the document collection is assigned to a topic. This assignment is guided by
θi ∼
Dir(
α), where
θ signifies the topic distribution for a document
d,
α represents the per-document topic distribution,
i ∈ 1,...,
k, and
Dir(
α) denotes a Dirichlet distribution spanning the
k topics. The learning phase involves computing conditional probabilities
P(
t|
d) (representing the likelihood of topic
t given document
d) and
P(
w|
t) (indicating the likelihood of word
w given topic
t). A comprehensive discourse on LDA topic modeling is presented in [
7] and [
34]. Upon completing the training of the LDA model over the entire textual dataset containing
m number of documents representing each painting, a matrix
\(\mathbf {A} \in \mathbb {R}^{m \times m}\) is generated. Each entry
a(
i,
j) within this matrix corresponds to the cosine similarity measure between document embeddings. This matrix encapsulates the latent distribution of topics across all documents, which is utilised for calculating semantic similarities among paintings to derive recommendations.
BERT. Similarly, for the second approach with BERT, we start by curating documents for each painting. Then the feature learning process goes through three distinct phases. Firstly, we transform each painting document into an embedding representation by leveraging the pre-trained SBERT large language model.
2 This transformation maps sentences and paragraphs into a 384-dimensional dense vector space [
68]. Secondly, we employ the uniform manifold approximation and projection (UMAP) algorithm [
50] to reduce the dimensionality of these embeddings. UMAP, a dimension reduction technique, facilitates the transformation of multi-dimensional data points into a two-dimensional space. This step enhances efficiency while preserving the original embeddings’ overarching structure. Thirdly, we leverage the HDBSCAN algorithm [
9], a soft-clustering technique, to semantically cluster the reduced embeddings. HDBSCAN avoids the misallocation of unrelated documents to clusters, thus enhancing the quality of clustering outcomes.
From these clusters, we extract latent topic representations using a custom class-based term frequency-inverse document frequency (c-TF-IDF) algorithm. This algorithm generates importance scores for words within a topic cluster. The essence of c-TF-IDF lies in its capacity to provide topic descriptions by identifying the most vital words within a cluster. Words boasting high c-TF-IDF scores are selected for each topic, thereby creating topic-word distributions for every document cluster. A more detailed discussion of our topic modeling strategy with BERT can be found in the work of Grootendorst et al. [
25]. Similar to the LDA approach, upon the completion of training the BERT model across the entire textual dataset of size
m, we produce a matrix
\(\mathbf {A} \in \mathbb {R}^{m \times m}\). Each entry within this matrix quantifies the cosine similarity measure between all document embeddings. As with the LDA approach mentioned above, this similarity matrix captures the latent distribution of topics throughout all documents. Thus, it can be utilised to compute similarities of paintings for a recommendation task.
3.2 Multimodal VA representation learning
This approach combines information from multiple data sources, like images and associated textual descriptions to create a unified representation space [
65]. This joint embedding enables the exploration of the interconnectedness between the inherent attributes of each modality. The latent features extracted from images and textual descriptions are mapped into the same embedding space, ensuring that semantically similar images and corresponding textual descriptions are brought closer together [
44]. This synergy between textual narratives and visual aesthetics enhances the potential for various applications in interpreting artworks. Among the different approaches in the literature, we use Bootstrapping Language-Image Pre-training (BLIP) [
43], which has demonstrated superior performance in various downstream tasks, including VA RecSys.
BLIP is a technique that trains neural networks by combining language and image data. It trains a model to predict either an image or text given the other, in order to improve the model’s understanding of multimodal relationships. BLIP uses a unified encoder-decoder model that can operate in three modes. The first mode, the unimodal encoder, encodes image and text separately. The second mode, the image-grounded text encoder, uses cross-attention to inject visual information into the text encoder. The third mode, the image-grounded text decoder, replaces bi-directional self-attention layers with causal self-attention layers. During pre-training BLIP optimizes three objectives: Image-Text Contrastive Loss (ITC), Image-Text Matching Loss (ITM), and Language Modeling Loss (LM). ITC aligns the visual and text transformers by encouraging similar representations for positive image-text pairs and dissimilar representations for negative pairs. ITM classifies whether image-text pairs are positive or negative. LM generates textual descriptions based on images.
For our VA representation learning task, we utilized the pre-trained BLIP model as a multimodal feature extractor. First, we extract multimodal features and use the ITM head to compute ITM scores for each painting, generating probability-matching scores for each image-text pair. Then, we compute a matrix
\(\mathbf {A} \in \mathbb {R}^{m \times m}\) where each entry
Aij is the probability matching score between the joint painting embeddings which can be used to compute similarities for a VA RecSys tasks. See Figure
3 for an illustration of our multimodal approach to learning latent semantic representations of paintings with BLIP.
7 Discussion
Based on our findings, we can answer positively our research question posed at the beginning of this paper. That is, VA RecSys algorithms can indeed support the rehabilitation of post-ICU patients using art therapy. This has important implications in several fronts, as we discuss below.
7.1 Personalised visual art as PICS intervention
We have explored the potential of art therapy as a PICS intervention and have tested VA RecSys as a means to personalize visual art for this goal. Overall, we found that this comparatively new approach to using art, which combines narrative techniques with personalized recommendations, allowed participants to engage with various healing elements in the artworks. Additionally, our findings show that personalized guided art therapy is effective in temporarily alleviating negative emotions and enhancing positive emotions as well as enhancing mood states. This suggests that by increasing its duration and dosage, it has the potential to address the psychological aspects of PICS as an intervention, which could potentially result in more lasting effects in enhancing the affective state of patients. Importantly, we utilized nature-based artwork in this study. While the use of nature-based artwork to support former ICU patients is a novel approach, previous studies have demonstrated the therapeutic effects of nature-based visuals in various forms, ranging from a real nature view [
78] to static as well as dynamic versions of virtual nature [
31,
49,
83]. The results of our study contribute to the ongoing research efforts in applying nature-based visuals for therapeutic purposes, demonstrating their potential to support the healing process of patients and enhance their psychological well-being. Furthermore, in line with a recent study [
37] that has highlighted the influence of personal characteristics on the impact of visual nature experiences, our study suggests the potential for a higher level of personalization with the support of RecSys.
Finally, we should mention that the process of guided art therapy in this study engaged an expert solely during the preparation phase, remaining independent throughout. This suggests the potential for developing guided art therapy as an intervention for remote and self-administered use, which could help alleviate the primary constraints of current PICS interventions, known for their high demand on healthcare professionals.
7.2 Crossing boundaries: VA Recsys - From entertainment to therapy
VA RecSys engines originally emerged as a means to enhance user experience in the entertainment field. Particularly, recent approaches boosted by machine learning techniques have undoubtedly demonstrated their potential in supporting users such as museum visitors and art enthusiasts to discover art pieces that are tailored to their personal preferences and interests. Furthermore, their ability to uncover complex semantics embodied within visual art made them powerful tools to support learning and discovery by exposing users to novel content. While the art entertainment industry has benefited from these advancements, our study sheds light on the remarkable potential of VA RecSys to transcend the space of entertainment, emerging as therapeutic tools within the healthcare domain. In the field of art entertainment, users seek diversion, enjoyment or relaxation while service providers strive to not only enhance user engagement and satisfaction but also drive up revenue. VA RecSys has long been at the forefront, seamlessly aligning these dual objectives.
In stark contrast to entertainment, therapy serves a deeper purpose; it is a journey of healing and self-discovery. The use of art in therapy has been proven to provide individuals with a unique medium to express complex emotions, confront traumatic experiences, and embark on the path to recovery. Art therapy, in particular, has emerged as a powerful tool in the hands of trained professionals to address a wide range of psychological and emotional challenges. The key to effective therapy lies in personalization and relevance. Patients seek a therapeutic experience that resonates with their individual needs and experiences. Thus, a careful selection of paintings tailored to the individual patient speaks to their unique circumstances, fostering self-reflection and healing. By introducing VA RecSys to the domain of therapy, we have showcased its remarkable potential to assist professionals not just in curating personalized artworks from a vast selection but also in delivering precise, tailored treatment to patients.
The intent here is not mere engagement as in the domain of entertainment but rather the transformation of the individual’s affective state and well-being. Thus, the adoption of VA RecSys algorithms from entertainment to the context of therapy requires rigorous quality control before being deployed in a system facing patients. As informed by our pilot test in section
5.3 not all top-performing VA RecSys engines in the entertainment domain were found to be appropriate for the purpose of therapy. Particularly recommendations from our text-based engines BERT and LDA tend to contain paintings that feature contents evoking negative emotions and with potentially harmful consequences. Therefore, we need to underscore the importance of acknowledging the risks involved and taking the necessary precautions when adopting these algorithms. On the contrary, our image-based and fusion-based engines produced paintings that were deemed appropriate by experts and our results also indicate that they were even perceived to support healing better than expert-curated recommendations. However, the promising results we observed may indicate a potential for a Human-in-the-Loop approach wherein experts fine-tune the recommendations generated by VA RecSys engines. While experts play a crucial role in ensuring the quality of paintings, VA RecSys could significantly reduce their workload (e.g., sifting through thousands of individual paintings from a database), which is reportedly a concern [
10,
61], thereby enhancing the potential for scaling up guided art therapy to bring benefits to more patients. This is nonetheless an exciting opportunity of Human-AI collaboration for future work.
7.3 Looking ahead: Potential of VA RecSys in healthcare beyond PICS intervention
Our exploration of VA RecSys in the context of PICS is but a glimpse into the vast potential of this innovation within healthcare. Particularly, in light of our promising results observed in PICS treatment, one natural extension of this approach is to implement VA RecSys-assisted visual art into the ICUs (see Figure
14-a). This could support PICS prevention and the well-being of patients by providing essential emotional support (e.g., reducing fear and anxiety) through recommending personalized art.
The adaptability of VA RecSys-assisted art therapy holds promise in areas far beyond the boundaries of intensive care where the use of visual content as a positive distraction is already active. For instance, the use of projection creates a more relaxing experience in an MRI room where patients can get easily worried and feel discomfort (see Figure
14-b). In the context of residential care, as another example, a virtual window or digital frames are used to support cognitive activation and recovery (see Figure
14-c). The implications of our findings extend the potential of VA RecSys engines in the intersection of AI and healthcare. The role of VA RecSys engines in facilitating the use of visual art as a positive distraction is merely the tip of the iceberg, hinting at a future where technology enables more holistic and personalized care, amplifying our capacity to heal and connect on a profound level.
8 Limitations and Future Work
While our study highlights the potential of VA RecSys within a therapeutic context, we acknowledge certain limitations and chart out promising directions for future research. Firstly, we have observed significant disparities when using VA RecSys in therapeutic context compared to its conventional application in entertainment. While our study has shed light on these distinctions, it has also highlighted risks associated with therapeutic use, especially when leveraging text-based models. This may partially be attributed to the quality of the text data source.
Data quality, underpinning VA RecSys recommendations, plays a pivotal role in its effectiveness. We have employed artist-curated descriptions of 2,368 paintings from the National Gallery dataset, but it is evident that these descriptions may not fully encapsulate the intricate affective attributes of the artworks. Thus, improving these models for therapeutic purposes can potentially be achieved by curating richer, more comprehensive affective descriptions. Although this entails substantial content curation efforts and a thorough evaluation, we believe it is a worthwhile endeavour. For future work it would also be beneficial to implement tree-based indexing data structures to scale up more efficiently for larger datasets.
Another limitation is that our current preference elicitation method relies on users selecting a single preferred painting, which may oversimplify their preferences. An area ripe for improvement involves allowing users to rate all the sample paintings that provide comprehensive representations of affective states (i.e., calmness, restoration, and cheerfulness), thereby capturing to what extent they resonate towards each affective dimension. By using these ratings as weights and projecting them into the embedding space, we can refine recommendation accuracy and granularity. Thus, the development and evaluation of VA RecSys combining the curation of high-quality data with such preference weighting mechanisms holds potential to improve the current approach in therapeutic settings. Additionally, by deriving more personalised content recommendations that uncover deeper semantics of artworks, this may also extend current VA RecSys approaches mostly limited in entertainment [
89,
90] to benefit other areas such as education, blend learning, and discovery of artistic concepts.
As hinted in the above subsection, one particularly promising avenue is the exploration between humans and AI systems in the context of therapy. Here, experts can fine-tune VA RecSys-generated recommendations to align them precisely with individual patient needs. Investigating the dynamics of such collaborative efforts and developing tools to facilitate expert interventions could significantly enhance the therapeutic value of VA RecSys. This exciting direction opens doors to more targeted and personalized therapy experiences, bridging the gap between technology and human expertise. Furthermore, while we have gained valuable insights with the current sample (i.e., former patients with psychological symptoms of PICS), validation with patients exhibiting PICS symptoms is warranted. Finally, one key challenge lies in comprehending the reasoning behind the VA RecSys recommendations, which remains a critical aspect in determining model performance in different contexts. The explanation of machine learning models has been a longstanding challenge in the field of AI. Nevertheless, recent strides have been made in the realm of explainable AI, with emerging techniques and methodologies. Leveraging these innovative approaches to provide more transparent and interpretable explanations for VA RecSys recommendations holds substantial promise. This advancement can facilitate a Human-in-the-Loop approach, empowering experts to refine and enhance therapy efforts with greater precision.