1 Introduction

Medical images are widely used in clinical decision-making. For example, chest x-ray images are used for diagnosing pneumonia and pleural effusion. The interpretation of medical images requires extensive expertise and is prone to human errors. Considering the demands of accurately interpreting medical images in large amounts within short times, an automatic medical imaging report generation model can be helpful to alleviate the labor intensity involved in the task. In this work, we aim to propose a novel medical imaging report generation model focusing on radiology. To be more specific, the inputs of the proposed framework are chest x-ray images under different views (frontal and lateral) based on which radiology reports are generated accordingly. Radiology reports contain information summarized by radiologists and are important for further diagnosis and follow-up recommendations.

The problem setting is similar to image captioning, where the objective is to generate descriptions for natural images. Most existing studies apply similar structures including an encoder based on convolutional neural networks (CNN), and a decoder based on recurrent neural networks (RNN) [11] which captures the temporal information and is widely used in natural language processing (NLP). Attention models have been applied in captioning to connect the visual contents and semantics selectively [13]. More recently, studies on radiology report generation have shown promising results. To handle paragraph-level generation, a hierarchical LSTM decoder has been applied to generate medical imaging reports [6] incorporating with visual and tag attentions. Xue et al. build an iterative decoder with visual attentions to enforce the coherence between sentences [14]. Li et al. propose a retrieval model based on extracted disease graphs for medical report generation [7]. Medical report generation is different from image captioning in that: (1) data in medical and clinical domains is often limited in scales and thus it is difficult to obtain robust models for reasoning; (2) medical reports are paragraphs other than sentences as in image captioning, where conventional RNN decoders such as long short-term memory (LSTM) have issues of gradient vanishing; and (3) generating medical reports requires higher precision when used in practice, especially on medical-related contents, such as disease diagnosis.

We choose the widely used Indiana University Chest X-ray radiology report dataset (IU-RR) [1] for this task. In most cases, radiology reports contain descriptive findings in the form of paragraphs, and conclusive impressions in one or a few sentences. To address the challenges mentioned above, we aim to improve both the encoder and decoder in the following aspects:

First, we construct a multi-task scheme consisting of chest x-ray image classification and report generation. This strategy has been shown to be successful because the encoder is enforced to learn radiology-related features for decoding [6]. Since the data scale of IU-RR is small, encoder pretraining is important in order to obtain a robust performance. Different from previous studies using ImageNet which is collected for general-purposed object recognition, we pretrain with large scale chest x-ray images from the same domain, namely CheXpert [5], to better capture domain specific image features for decoding. Second, most of previous studies using chest x-ray images for disease classification and report generation consider the frontal and lateral images from the same patient as two independent cases [6, 12]. We argue that lateral images contain complementary information to frontal images in the process of interpreting medical images. Such multi-view features should be synthesized selectively other than contributing equally (concatenate, mean or sum) to the final results. Moreover, it is likely to generate inconsistent results for the same patient based on images from different views. We propose to synthesize multi-view information by applying a sentence-level attention model, and enforce the encoder to extract consistent features with a cross-view consistency (CVC) loss.

From the decoder side, we use hierarchical LSTM (sentence and word level LSTM) to generate radiology reports. RNN decoders tend to memorize word distributions and patterns which frequently occur in the training data, and thus might produce inaccurate results when the target contents have not been observed, or when multiple patterns share similar distributions given the previous contents. Such limitations significantly hinders the credibility of machine-generated results in practical use, since incorrect medical-related contents can be very misleading. For example, generating “left-sided pleural effusion” while the ground truth is “right-sided pleural effusion”. In addition, the source visual contents stay too far from the targeted word decoder in hierarchical LSTM which makes the generation process more difficult. To address such issues, we explore the semantics conveyed in the textual contents and apply them directly to the word decoder. We first extract frequent medical concepts based on the radiology reports and fine-tune the encoder to recognize such concepts. The obtained medical concepts contain explicit information to accurately generate deterministic medical-related contents, such as diagnosis, locations, and observations.

The main contributions of our work are summarized as follows: (1) to the best of our knowledge, we are the first to employ the latest CheXpert dataset to obtain a more robust encoder for radiology report generation; (2) we selectively incorporate multi-view visual contents with sentence-level attentions and enforce the consistency between different views; (3) we extract and apply medical concepts to the decoder with word-level attentions to enhance the correctness of the medical-related contents; (4) our integrated framework outperforms the state-of-the-art baselines in the experiments, and (5) we visualize uncertain radiographic observations predicted by the encoder to provide an added benefit to direct more expert attention to such uncertainties for further analysis in practice.

Fig. 1.
figure 1

Overall framework of the proposed encoder and decoder with attentions. ED, and \(D'\) denote the encoder, sentence decoder, and word decoder, respectively.

2 Methodology

The proposed framework consists of a multi-view CNN encoder and a concept enriched hierarchical LSTM decoder as in Fig. 1. We apply a multi-task scheme including: (1) radiographic observation classification to pretrain and fine-tune the encoder with large-scale images; (2) to extract medical concepts; and (3) to fuse all information to generate radiology reports. Therefore, two datasets are used in this work: CheXpert [5], a large collection of chest x-ray images under 14 common chest radiographic observations to pretrain the image encoder, and Indiana University Chest X-ray [1] containing full radiology reports but in a considerably smaller scale for training and evaluating the report generation task.

2.1 Image Encoder

The encoder uses Resnet-152 [4] as the backbone and extracts visual features for predicting chest radiologic observations and radiology report generation.

Chest Radiographic Observations: The task is formulated as a multi-label classification with 14 common radiographic observations following [5] including: enlarged cardiom, cardiomegaly, lung opacity, lung lesion, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture, support devices, and no finding. Compared with previous studies using pretrained encoders based on ImageNet [6, 14], pretraining with images from the same domain yields better results. We add one full-connected layer as classifier and compute the binary cross entropy (BCE) loss. Additionally, we consider both frontal and lateral images of one patient as one input pair and enforce the prediction consistencies under different views by a mean square error (MSE) loss over the multi-view encoder outputs. The loss function is thus defined in Eq. 1 where \(y_{i,j}\) is the j-th ground truth entry (\(j\in {[1,14]})\)) of the i-th sample (\(i\in (1, N)\)), the frontal view and lateral view prediction are denoted as \(\hat{y}_{i_{f}}\) and \(\hat{y}_{i_{l}}\). The encoder outputs global features after average-pooling, and local features \({\mathbf {v}\in \mathbb {R}^{k \times d_{v}}}\) from the last CNN block where k denotes the number of local regions and \(d_{v}\) denotes the dimension.

$$\begin{aligned} \mathcal {L}_{I} = -\sum _{v\in \{f,l\}} \sum _{i, j} \left[ y_{i,j}\log \hat{y}^{v}_{i,j} + (1-y_{i,j})\log \left( 1-\hat{y}^{v}_{i,j} \right) \right] + \lambda \sum _{i} \left( {y}_{i_{f}} - {y}_{i_{l}} \right) ^{2} \end{aligned}$$
(1)

Medical Concepts: The textual reports contain descriptive information related to the visual contents which have not yet been explored by existing models. In IU-RR, Medical text indexer (MTI) can be potentially used in a similar manner [6]. However, MTIs are sometimes noisy and not normalized. Therefore, we use SemrepFootnote 1 to extract normalized medical concepts that frequently occur in the training reports. We empirically set the minimal occurrences as 80 and obtained 69 medical concepts for a decent detection accuracy. We fix the pretrained image encoder, and add another fully connected layer on top as the concept classifier.

2.2 Hierarchical Decoder

Since conventional RNN decoder is not suitable for paragraph generation, we apply a hierarchical decoder, which has been widely used in paragraph encoding and decoding, to generate radiology reports. The decoder contains two layers: a sentence LSTM decoder that outputs sentence hidden states, and a word LSTM decoder which decodes the sentence hidden states into natural languages. In this way, reports are generated sentence-by-sentence.

Fig. 2.
figure 2

Different fusion schemes for multi-view image features.

Sentence Decoder with Attentions: The sentence decoder is fed with visual features extracted from the encoder, and generates sentence hidden states. Since we have both frontal and lateral features, the selection of fusion schemes is important. As show in Fig. 2, we propose and compare three fusion schemes: an intuitive solution is to directly concatenate the features from both views; early fusion where the concatenated features are selectively attended by the previous hidden state; and late fusion which fuses the hidden states by two decoders after visual-sentence attentions. To generate sentence hidden state \(\mathbf {h}_{t_{s}}\) at time step \(t_{s}\in (1,N^{s})\), we compute the visual attention weights \(\alpha _{i}\) with Eq. 2, where \(v_{m}\) is the m-th local feature, and \(\mathbf {W}_{a}\), \(\mathbf {W}_{v}\) and \(\mathbf {W}_{s}\) are weight matrices.

$$\begin{aligned} \mathbf {a}_{i} = \mathbf {W}_{a}\left[ \tanh \left( \mathbf {W}_{v}\mathbf {v_{i}} + \mathbf {W}_{s}\mathbf {h}_{t_{s}-1} \right) \right] \text {,} \quad \alpha _{i}=softmax(\mathbf {a}_{i}) \end{aligned}$$
(2)

By leveraging all the local regions, the attended local feature is thus calculated as \(v_{att}=\sum _{m=1}^{k} \alpha _{i,m}v_{m}\), and is concatenated with the previous hidden state to be fed into the sentence LSTM for computing the current hidden state \(\mathbf {h}_{t_{s}}\).

Word Decoder with Attentions: Incorporated with the obtained medical concepts, the sentence hidden states are used as inputs to the word LSTM decoder. For each word decoding step, the previous word hidden state \(\mathbf {\hat{h}}_{t_{w}}\) for time step \(t_{w}\in (1, N^{w}_{t_{s}})\) is used to generate the word distribution over the vocabulary and output the word with the highest score. The embedding \(\mathbf {w}_{t_{w}}\) of the predicted word \(\hat{w}_{t_{w}}\) is then fused with medical concepts in order to generate the next word hidden state. Given the medical concept embeddings \(\mathbf {c}\in \mathbb {R}^{n\times d_{c}}\) for p medical concepts for the i-th sample, and the predicted concept distributions \(\hat{y^{c}_{i}}\), the attention weights over all medical concepts at time step \(t_{w}\) is defined in Eq. 3 where \(\mathbf {W}_{a^{c}}\), \(\mathbf {W}_{c}\), and \(\mathbf {W}_{w}\) are the weight matrices to be learned.

$$\begin{aligned} \mathbf {a}^{c}_{i} = \mathbf {W}_{a^{c}}\left[ \tanh \left( \hat{y^{c}_{i}}\mathbf {W}_{c}\mathbf {c} + \mathbf {W}_{w}\mathbf {\hat{h}}_{t_{w}-1} \right) \right] \text {,} \quad \alpha ^{c}_{i}=softmax(\mathbf {a}^{c}_{i}) \end{aligned}$$
(3)

Similar to visual attention model, the attended medical concept feature is calculated as \(c_{att}=\sum _{n=1}^{p} \alpha ^{c}_{j,n}c_{n}\), and is concatenated with the previous word embedding to generate the next word. We use cross entropy loss \(\mathcal {L}_{W}\) given the predicted word distribution \(\hat{y^{w}}_{t_{w}}\) and the ground truth \(y^{w}_{t_{w}}\) using Eq. 4.

$$\begin{aligned} \mathcal {L}_{W} = -\sum ^{N^{s}}_{t_{s}=1} \sum ^{N^{w}_{t_{s}}}_{t_{w}=1} y^{w}_{t_{w}}\log \left( \hat{y^{w}}_{t_{w}} \right) \end{aligned}$$
(4)

3 Experiment

Data Collection: CheXpert [5] contains 224,316 multi-view chest x-ray images from 65,240 patients of 14 common radiographic observations. The observations are generated using NLP tools from the radiology reports labeled as positive, negative, and uncertain. We inherited and visualized the uncertain predictions to address more expert attention for practical use. An alternative dataset is ChestX-ray14 [12]. We chose to use CheXpert because its labeler is reported to be more reliable as compared with ChestX-ray14 [5].

Since neither of the aforementioned datasets released radiology reports, we use IU-RR [1] for evaluating radiology report generation. For preprocessing, we first removed samples without multi-view images, and concatenated the “findings” and “impression” sections because in some forms all contents are either in the “findings” or “impression” section with the other left blank. We filtered out the reports with less than 3 sentences. In the end, we obtained 3,074 samples with multi-view images of which 20% (615 samples/1,330 images) are used for testing, and the 80% (2459 samples/4,918 images) are used for training and validation. For encoder fine-tuning, we extract the same 14 labels as [5] on IU-RR. For report parsing, we converted the texts to tokens, and added “\(\langle \)start\(\rangle \)” and “\(\langle \)end\(\rangle \)” to the beginning and end of each sentence, respectively. Low frequency (less than 3 occurrences) words were dropped, and textual errors were replaced with “\(\langle \)unk\(\rangle \)” which are caused by being falsely recognized as confidential information during the original data de-identification of IU-RR.

Table 1. Average ROC-AUC (avg-AUC) on radiographic observation classification.

Chest Radiographic Observations: We conducted extensive experiments on the encoder regarding two factors: how to properly pretrain and fine-tune the encoder, and how to leverage the multi-view information. The classification results on radiographic observations are shown in Table 1. In general, pretraining with ImageNet (ImgNet) performs marginally better than models without pretraining (Base), and encoders pretrained by CheXpert (CX) performs the best, indicating that pretraining with large scale data in the same domain helps. Enforcing cross-view consistency (CX+CVC) also improves the results. We obtained the best result by fusing multi-view predictions with max operation (CX+CVC+F).

Table 2. Evaluations of generated radiology reports.

Radiology Report Generation: The evaluation metrics we use are BLEU [9], METEOR [2], and ROUGE [8] scores, all of which are widely used in image captioning and machine translation tasks. We compared the proposed model with several state-of-the-art baselines: (1) a visual attention based image captioning model (Vis-Att) [13]; (2) radiology report generation models, including a hierarchical decoder with co-attention (Co-Att) [6], multimodal generative model with visual attention (MM-Att) [14], and knowledge-drive retrieval based report generation (KERP) [7]; and (3) the proposed multi-view encoder with hierarchical decoder (MvH) model, the base model with visual attentions and early fusion (MvH+AttE), MvH with late fusion fashion (MvH+AttL), and the combination of late fusion with medical concepts (MvH+AttL+MC). MvH+AttL+MC* is an oracle run based on ground-truth medical concepts and considered as the upper bound of the improvement caused by applying medical concepts. As shown in Table 2, our proposed models generally outperform the state-of-the-art baselines. Compared with MvH, multi-view feature fusions by attentions (AttE and AttL) yield better results. Applying medical concepts significantly improve the performance especially on Meteors because the recalls rise with more semantical information provided directly to the word decoder, and Meteor weights more on recalls over precisions. However, the improvement is subject to prediction errors on medical concepts, indicating that a better encoder would benefit the whole model by a large margin as shown in MvH+AttL+MC*.

Discussion: As Fig. 3 shows, AttL (and other baseline models) have difficulties generating abnormalities and locations because there is no explicit abnormal information involved in word-level decoding compared with our proposed model. Not all the predicted medical concepts would necessarily appear in the generated reports. On the other hand, the prediction errors from the encoder propagate, such as predicting “right” instead of “right lung”, and affect the generated reports, suggesting a more accurate encoder is beneficial. Moreover, since there are no constraints on the sentence decoder during the training, it is likely to generate similar hidden states for our model. In this case, a stacked attention mechanism would be beneficial for forcing the decoder to focus on different image sub-regions. In addition, we observe that it is difficult for our model to generate unseen sentences and sometimes there are syntax errors. Such errors are due to the limited corpus scale of IU-RR, and we expect by exploring unpaired textual data for pretraining the decoder would address such limitations [3].

Fig. 3.
figure 3

An example report generated by the proposed model. The medical concepts marked red are false (positive/negative) predictions. The underlined sentences are abnormality descriptions. Uncertain predictions are visualized using Grad-cam [10]. (Color figure online)

4 Conclusions

In this paper, we present a novel encoder-decoder model for radiology report generation. The proposed model takes advantage of multi-view information in radiology by applying visual attentions in a late fusion fashion, and enriches the semantics involved in the hierarchical LSTM decoder with medical concepts. Consequently, both the visual and textual contents have been better exploited to achieve the state-of-the-art performance. The automatic interpretation approach will simplify and expedite the conventional process of generating radiology reports and better assist human experts in decision making. As a valuable added benefit, uncertain radiographic observations are extracted and visualized by our model because it is important to direct more expert attention to such uncertainties for further analysis in practice.