Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Toward Attribute-Controlled Fashion Image Captioning

Published: 23 September 2024 Publication History

Abstract

Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve) and styles (e.g., cool, classic, fresh) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.

1 Introduction

Fashion has always been an industry that fascinates individuals around the world. From clothing and accessories to footwear and beauty products, fashion has a universal appeal that transcends geographical and cultural boundaries. With the rise of e-commerce platforms, fashion companies tend to increase their focus on fashion-related research. Extensive efforts in fashion image-related research [9, 12, 40] direction have been made in recent years, with studies exploring areas such as in-shop fashion clothes retrieval [31, 37, 52], fashion landmark detection [34, 64], and fashion attribute recognition [10, 17, 23]. Despite this progress, in recent years, there has been significant research dedicated to automating the generation of high-quality descriptions for fashion images [42, 43, 59]. Fashion image captioning is a process that involves generating accurate and engaging sentences that effectively describe input fashion images and specify item attributes, such as style, shape, and fabric. Fashion image captioning helps fashion e-commerce platforms generate descriptions for goods more efficiently. By using automated processes to generate descriptions, companies can save time and resources while still providing accurate and engaging product descriptions. Additionally, fashion image captioning can help consumers better understand the design and attributes of the products, which can lead to increased customer satisfaction and, ultimately, benefits for fashion companies.
Recent advancements in fashion image captioning models have leveraged the encoder–decoder framework. The fashion image encoder encodes the rich visual features of the fashion image, while the fashion caption decoder generates fashion descriptions based on these visual cues. Yang et al. [59] proposed pioneer work in the field of fashion image captioning, establishing the FAshion Image CAptioning Dataset (FACAD) for this task. Nguyen et al. [43] conducted experimental research in the field of fashion image caption generation by integrating visual attention from the backbone network [7] into a Recurrent Neural Network (RNN) language decoder. However, most of the existing fashion image captioning model leverages visual representations to generate a fixed caption for a given fashion image. For example, in Figure 1, given the fashion image, the pioneer method [59] generates a fixed fashion image caption soft and cozy this irresistible button front cardi is one you love wearing option in (a). They do not provide the flexibility to automatically change the generated fashion caption for the given image according to individual preferences and requirements, which limits adaptability for industry-level applications.
Fig. 1.
Fig. 1. (a) Existing fashion image captioning methods, such as AFIC [59], generate a fixed caption. The images in (b) and (c) show the controlled captions generated through our proposed method. Our proposed method can generate engaging fashion image captions that suit personal needs by leveraging fashion attributes (in red) and fashion styles (in blue). The bold words in the captions indicate that the generated words are aligned with the given semantic attributes.
To solve this problem, we proposed Attribute-Controlled fashion image Captioning (ACC)—a controllable fashion image captioning method that can automatically change and generate fashion captions that suit unique preferences [5]. We propose using semantic attributes as a control signal together with visual information to generate fashion image captions. The semantic attributes consist of fashion attributes and fashion styles. Based on the different preferred semantic attributes given, we can control and generate the desired fashion image captions that interpret the fashion image with the semantic meaning of given attributes. As illustrated in Figure 1(b) and (c), our proposed method generates different engaging fashion captions by changing the given fashion attributes and/or fashion style. Inspired by conventional image captioning work [32, 63], we adopt the transformer-based encoder–decoder framework [48] to build our controllable fashion image captioning model. We develop the model architecture that simultaneously understands fashion image content and semantic attributes to provide engaging fashion captions. Furthermore, the existing FACAD [59] contains many clothing texture images, which could result in generating error-prone captions. Therefore, we clean, filter, and construct a new FACAD called FACAD170K to study controllable fashion image captioning. The main contributions of our work consist of the following parts.
(1) Our proposed approach aims to enhance the adaptability of fashion image captioning by incorporating controllability into the method, which allows for dynamic adjustments of the generated fashion captions based on individual preferences and requirements.
(2) We introduce a new model that leverages multi-modal information from both semantic attributes and fashion images to generate fashion captions. The semantic attributes given served as the control signal to control and produce the desired fashion image captions that involve the semantic information of the given attributes.
(3) We introduce a Fashion Image Captioning Dataset—FACAD170K. The FACAD170K will benefit the development of fashion image captioning-related research applications. Experimental results demonstrate that our method outperforms existing fashion and conventional image captioning work for fashion caption generation. We make the FACAD170K dataset publicly available.
(4) We further validated the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning dataset. Experimental results showed that the proposed model achieved comparable and superior performance to other state-of-the-art conventional image captioning methods.

2 Related Works

2.1 Fashion Image Captioning

Fashion image captioning has attracted significant attention in the field of computer vision for fashion images. Yang et al. [59] introduce a semantic-guided fashion image captioning model, which utilizes spatial attention mechanism [57], semantic attribute visual features, and a reinforcement learning-based approach to effectively generate descriptive fashion captions with the FACAD. Nguyen et al. [43] experimented using channel and spatial attention features [7] for fashion image captioning. Moratelli et al. propose a transformer-based fashion captioning model that integrates external textual memory through k-nearest neighbor searches, which retrieves fashion contexture information for captioning. These pioneer methods produce static image captions for fashion, lacking the ability to dynamically adjust the generated caption based on individual preferences. In contrast, our proposed method incorporates text-based semantic attribute information to provide additional controllability to the fashion captioning model, which generates preferable fashion captions. Different from our previous work [5] that directly fuse semantic attribute and visual features for controllable fashion image captioning, in this article, we further explore and develop various multimodal feature fusion strategies that effectively integrate the visual and semantic attribute features in the decoding stage for better attribute-controlled fashion image caption generation.

2.2 Image Captioning

Image captioning is a cross-modal task combining computer vision [4, 18, 24, 55] and natural language processing [48, 53]. Many methods [26, 44, 50, 54, 57] have been proposed since theemergence of deep learning in recent years. Vinyals et al. [50] proposed to use the deep Convolutional Neural Network (CNN) encoder to extract the visual features of the image and utilized the RNN network as the language decoder to generate caption sequences. Subsequently, the spatial attention mechanism [57] was further investigated for its application in image captioning tasks. More recently, the transformer network [48] with a Multi-Head Attention (MHA) mechanism has been introduced to benefit image caption generation. Huang et al. [26] proposed attention on attention architecture, which utilized the gated attention mechanism to measure the relevance between the attention weight and the queries. Cornia et al. [11] developed a meshed-memory transformer to exploit both low-level and high-level contributions of the visual feature for caption generation. Yan et al. [58] integrated a task-adaptive attention module into the transformer-based model, enabling it to identify task-specific clues and reduce misleading information from improper key-value pairs. Li et al. [33] developed a long short-term graph, which effectively captures short-term spatial relationships and long-term transformation dependencies of visual features for image captioning. Guo et al. [22] introduce the normalized attention module, allowing the transformer network to incorporate the geometry structure of the input objects through geometry-aware self-attention. Pan et al. [44] developed the unified X-linear attention block for image captioning. Luo et al. [41] proposed a dual-level collaborative transformer for image captioning to realize the complementary advantages of the region and spatial image features. Moreover, many existing methods [27, 61] incorporated semantic visual features, which allows the captioning model to generate the most relevant attribute words. Jiang et al. [27] built a guiding network to learn an extra guiding vector for caption sentence generation. Deng et al. [13] proposed a syntax-guided hierarchical attention network that incorporates visual features with semantic and syntactic information to improve the interpretability of the captioning model. Li et al. [32] devised the transformer-based framework, which exploits the visual and semantic information simultaneously. Yao et al. [60] proposed to utilize the Graph Convolutional Network (GCN) to integrate semantic and spatial relationships between objects. Yang et al. [21] proposed to integrate semantic priors into the model by exploiting a graph-based representation of both images and sentences. Dong et al. [15] introduced dual-GCN to model the object relationships over a single image and model the relationships over similar images in the dataset for image captioning.

2.3 Controllable Image Captioning

Controllable image captioning generates image descriptions following designated control signals. Deshpande et al. [14] proposed to predict a meaningful summary of the image using the part-of-speech tag and subsequently guiding the caption generation based on that summary. Chen et al. [8] utilized abstract scene graph structure to represent user intention at the fine-grained level and control what and how detailed the generated description should be. Chen et al. [6] developed a controllable image captioning model that uses human-like verb-specific semantic roles as the control signal, considering both event-compatible and sample-suitable requirements. Shuster et al. [47] specified 215 different text-based personality traits and were able to generate the visual caption based on the different traits. Differently, our controllable fashion image captioning method controls both text-based fashion attributes and styles, which allows users to inject the preferred semantic meaning into the fashion description.

3 Overall Framework

In this section, we introduce our ACC model. The model consists of the transformer-based attribute controlling encoder, a fashion image encoder with the Feature Enhancement Module (FEM), and an adaptive fashion caption decoder. Figure 2 gives an overview of the proposed architecture.
Fig. 2.
Fig. 2. The overall framework of the proposed method. We utilize semantic attributes as control signals to generate fashion image captions. Given a fashion image and preferable semantic attributes in (a) or (b) as control signals, the proposed model is able to generate the desired fashion caption. Furthermore, we introduce the FEM in the fashion caption encoder to enhance the visual representation learning. In the caption decoder, we utilize the Local LSTM and the Adaptive Multimodal Fusion Module (AMFM) to model the local structures over word embedding and to adaptively fuse multimodal information. LSTM, Long Short-Term Memory.

3.1 Attribute Controlling Encoder

We utilize semantic attributes to control and guide the process of caption generation for fashion images. By incorporating semantic attributes, such as fashion attributes and fashion style, we expect that the model can interpret fashion image content with semantic information to impart a higher level of control and allow for a more tailored fashion description generation. Let \(\mathcal{A}=\{a_{1},a_{2},...,a_{n}\}\) be the input semantic attributes, where \(a_{i}\) denotes one text word describing the attribute (e.g., stripe, fuzzy, sleeve, comfort) and \(n\) represents the total number of attributes which can be flexibly determined depending on personal preference. We use an attribute controlling encoder to encode the semantic attributes \(\mathcal{A}\) to feature space as \(\hat{\mathbf{A}}\in\mathbb{R}^{n\times d}\), where \(d\) is the feature dimension. Specifically, the set of semantic attributes \(\mathcal{A}\) are first encoded as word embeddings \(\mathbf{T}=[\mathbf{t}_{1};\mathbf{t}_{2};...;\mathbf{t}_{n}]\in\mathbb{R}^{n \times d}\) and then fed to a single layer transformer encoder block with the MHA module, which can be formulated as
\[\displaystyle\begin{split}\displaystyle\text{Head}_{i}&\displaystyle=\text{ self-attention}(\mathbf{T}^{q},\mathbf{T}^{k},\mathbf{T}^{v})\\ &\displaystyle=\text{softmax}\left(\frac{\mathbf{T}^{q}\mathbf{T}^{k}_{\top}}{ \sqrt{d_{k}}}\right)\mathbf{T}^{v}\end{split}\]
(1)
\[\displaystyle\hat{\mathbf{A}}=\text{concat}(\text{Head}_{ 1},\text{Head}_{2},\dots,\text{Head}_{h})\mathbf{W}_{o}\in\mathbb{R}^{n\times d}\]
(2)
where \(\mathbf{T}^{q},\mathbf{T}^{k}\), and \(\mathbf{T}^{v}\) are the query, key, and value matrix projected from \(\mathbf{T}\), respectively. \(\mathbf{W}_{o}\) denotes the output projection matrix that aggregates the information from \(h\) parallel attention heads. \(\top\) represents the transpose operation. To ease the presentation, we neglect the Feed-Forward Network and Layer normalization modules in the equation. \(\hat{\mathbf{A}}\) will be injected into the fashion caption decoder and serve as the control signal.

3.2 Fashion Image Encoder

The fashion image encoder is utilized to extract rich and meaningful visual information from fashion images. For the input fashion image \(\mathcal{I}\), we first use a transformer-based image encoder upon ResNet backbone [24] to extract its visual features. Furthermore, we introduce an FEM that attends and enhances the visual representation over channel dimension. Applying an attention mechanism in a channel-wise manner can be viewed as a process of selecting critical visual semantic representations, which further improve the feature representations and contribute to better fashion caption prediction.
Specifically, we use the pre-trained CNN-based network to extract visual features and reshape to \(\mathbf{X}\in\mathbb{R}^{N\times D}\), where \(N=H\times W\) are spatial size, \(H\) and \(W\) is the height and width of the visual features, and \(D\) denotes the channel dimension. The visual features \(\mathbf{X}\) are then fed into the Transformer encoder to obtain the attended visual representations \(\mathbf{X}^{a}\)
\[\displaystyle\begin{split}\displaystyle\text{Head}_{i}&\displaystyle=\text{ self-attention}(\mathbf{X}^{q},\mathbf{X}^{k},\mathbf{X}^{v})\\ &\displaystyle=\text{softmax}\left(\frac{\mathbf{X}^{q}\mathbf{X}^{k}_{t}}{ \sqrt{d_{k}}}\right)\mathbf{X}^{v}\end{split}\]
(3)
\[\displaystyle\mathbf{X}^{a}=\text{concat}(\text{Head}_{1},\text{Head}_{2}, \dots,\text{Head}_{h})\mathbf{W}_{o}\in\mathbb{R}^{N\times D}.\]
(4)
In this way, the visual encoder can capture long-range contextual information over spatial dimensions. In addition, the encoder further enhances the visual representation by capturing the long-range contextual information over channel dimensions. To achieve this, we introduce an FEM to aggregate the context from the channel dimension \(D\). Concretely, the FEM first reshapes the extracted visual features from the backbone network with the dimension of \(H\times W\times D\) to \(\mathbf{\bar{X}}\in\mathbb{R}^{D\times N}\). Subsequently, it calculates the channel attention score based on \(\mathbf{\bar{X}}\) using the \(softmax\) function, followed by conducting matrix multiplication to weight the visual features. We present the following steps to capture the channel-level dependencies:
\[\displaystyle\mathbf{E}=\text{softmax}(\mathbf{\bar{X}} \mathbf{\bar{X}}^{\top})\]
(5)
\[\displaystyle\mathbf{X}^{e}=(\mathbf{E}\mathbf{\bar{X}})+ \mathbf{\bar{X}}\in\mathbb{R}^{D\times N}.\]
(6)
Here, \(\mathbf{E}\in\mathbb{R}^{D\times D}\) is a channel attention weight. \(\mathbf{X}^{e}\) is reshaped to the same dimension as \(\mathbf{X}^{a}\) and they are concatenated to obtain the final visual representations \(\hat{\mathbf{X}}=\mathbf{X}^{a}\oplus\mathbf{X}^{e}\in\mathbb{R}^{N\times D}\), where \(\oplus\) denotes the concatenation operator. \(\hat{\mathbf{X}}\) will be input into the fashion caption decoder for fashion description generation.

3.3 Adaptive Fashion Caption Decoder

The main goal of our decoder is to predict the fashion caption based on the visual features \(\hat{\mathbf{X}}\) and the semantic attribute features \(\hat{\mathbf{A}}\). One possible solution is to apply the transformer decoder [48]. Suppose we are given \(t\) caption words, which are either predicted at inference or given at training. We can encode them as word embeddings \(\mathbf{Y}=[\mathbf{y}_{1};\mathbf{y}_{2};\dots;\mathbf{y}_{t}]\in\mathbb{R}^{ T\times d}\) and apply an MHA module to obtain the sentence representations \(\mathbf{U}\in\mathbb{R}^{T\times d}\), i.e., \(\mathbf{U}={\textrm{MHA}}(\mathbf{Y}^{q},\mathbf{Y}^{k},\mathbf{Y}^{v})\). The MHA module captures the long-term dependencies among the caption words. Moreover, we observe that the modeling local structure of the words is also beneficial for our fashion image captioning task (e.g., as shown in Table 3). We propose incorporating a localLSTM with MHA to model both local structures and global dependencies information over the caption words.
Specifically, we segment the long sequence of word embeddings, denoted as \(\mathbf{Y}\), into multiple shorter sequences based on a predefined window size of \(M=3\) (as illustrated in Figure 3). These short sequences of word embeddings are then processed by the shared Long Short-Term Memory (LSTM) network (LocalLSTM) independently and output a sequence of hidden representations \(\mathbf{H}=[\mathbf{h}_{1};\mathbf{h}_{2};\dots;\mathbf{h}_{t}]\in\mathbb{R}^{ T\times d}\) with the information of local structures, that is,
\begin{align}\mathbf{h}_{1},\mathbf{h}_{2},...,\mathbf{h}_{t}=\text{LocalLSTM}(\mathbf{y}_{ 1},\mathbf{y}_{2},...,\mathbf{y}_{t}).\end{align}
(7)
Then, we use a masked MHA to further capture the long-term global dependencies and obtain the sentence representations as
\[\displaystyle\begin{split}\displaystyle\text{Head}_{i}&\displaystyle=\text{ masked-self- attention}(\mathbf{H}^{q},\mathbf{H}^{k},\mathbf{H}^{v})\\ &\displaystyle=\text{softmax}\left(\frac{\mathbf{H}^{q}\mathbf{H}^{k}_{t}}{ \sqrt{d_{k}}}\right)\mathbf{H}^{v}\end{split}\]
(8)
\[\displaystyle\mathbf{U}=\text{concat}(\text{Head}_{1}, \text{Head}_{2},\dots,\text{Head}_{h})\mathbf{W}_{o}\in\mathbb{R}^{T\times d}.\]
(9)
Fig. 3.
Fig. 3. The illustration of the LocalLSTM module. The word embeddings \(\mathbf{Y}=[\mathbf{y}_{1};\mathbf{y}_{2};\dots;\mathbf{y}_{t}]\) are segmented into multiple shorter sequences, which are then fed into the LocalLSTM module to model and capture local structural information. \(\mathbf{H}\) denotes output hidden representations.
Next, we predict the fashion caption words based on sentence features \(\mathbf{U}\), visual features \(\hat{\mathbf{X}}\), and semantic attribute features \(\hat{\mathbf{A}}\). Nevertheless, our previous work [5] combined visual features with semantic attribute features through direct fusion or concatenation, which makes it challenging for the decoder to balance both representations for fashion caption generation. Hence, it is essential to investigate and develop an effective feature fusion strategy that seamlessly integrates multimodal features. We explored three feature fusion strategies to combine the visual and semantic attribute features. The evaluation result is shown in Table 5. The three feature fusion strategies include (1) Direct Concatenation (DC) of both features and using Cross-MHA with sentence features \(\mathbf{U}\) to compute fashion caption representations \(\mathbf{C}_{dc}\); (2) Cross Fusion that utilizes two Cross-MHA modules conditioned on visual features \(\hat{\mathbf{X}}\) and semantic attribute features \(\hat{\mathbf{A}}\) to compute the fashion caption representations \(\mathbf{C}_{cf}\); (3) Utilizing a gating mechanism to balance the contribution of multimedia representation and contribute to better fashion caption generation \(\mathbf{C}_{amf}\), which is named Adaptive Multimodal Fusion Module (AMFM) and shown in Figure 4.
\[\displaystyle(1)\ \mathbf{C}_{dc}=\text{Cross-MHA}(\mathbf{U}^{q},\mathbf{F}^{k},\mathbf{F}^{v})\in\mathbb{R}^{T\times d}\]
(10)
\[\displaystyle\mathbf{F}=[\hat{\mathbf{X}};\hat{\mathbf{A }}]\]
(11)
\[\displaystyle(2)\ \mathbf{C}_{cf}=\mathbf{W}_{cf}([ \mathbf{C}_{X};\mathbf{C}_{A}])\in\mathbb{R}^{T\times d}\]
(12)
\[\displaystyle(3)\ \mathbf{C}_{amf}=\mathbf{C}_{X}\odot \mathbf{g}+\mathbf{C}_{A}\odot(1-\mathbf{g})\in\mathbb{R}^{T\times d}\]
(13)
\[\displaystyle\mathbf{g}=\delta(\mathbf{W}_{g}[\mathbf{C} _{X};\mathbf{C}_{A}])\]
(14)
\[\displaystyle\mathbf{C}_{X}=\text{Cross-MHA}(\mathbf{U}^ {q},\hat{\mathbf{X}}^{k},\hat{\mathbf{X}}^{v})\]
(15)
\[\displaystyle\mathbf{C}_{A}=\text{Cross-MHA}(\mathbf{U}^ {q},\hat{\mathbf{A}}^{k},\hat{\mathbf{A}}^{v}),\]
(16)
where [;] and \(\delta\) denote concatenation operation and Sigmoid activation function. \(\mathbf{W}_{cf}\) and \(\mathbf{W}_{g}\) is the projection matrix. The fashion caption representations \(\mathbf{C}\) will be fed into a linear projection layer and a softmax layer for the prediction of caption words. During the inference, we can vary the input semantic attributes to compute the different fashion caption representations and obtain different caption words.
Fig. 4.
Fig. 4. The structure of AMFM.

4 Training Objective

For this fashion image captioning work, we train the model by optimizing the Cross Entropy (XE) loss \(\mathcal{L}_{XE}\) as
\begin{align}\mathcal{L}_{XE}=-\sum_{t=1}^{T}\log(p_{\theta}(\mathbf{y}_{t}^{*}|\mathbf{y}_ {1:t-1}^{*},\mathcal{A},\mathcal{I})).\end{align}
(17)
The captioning model is trained to predict the target Ground Truth (GT) caption \(\mathbf{y}_{t}^{*}\) with the given words \(\mathbf{y}_{1:t-1}^{*}\), fashion attributes \(\mathcal{A}\) and fashion image \(\mathcal{I}\).

5 FACAD170K Dataset

The original FACAD [59] crawls about 1 million fashion images from Google, which contains many noisy images. As shown in Figure 5(a), the FACAD contains many identical single-colored texture images. Many of these identical texture images predict similar captions, which affects the training result. Furthermore, some different fashion item images have the same GT caption. For example, the last two images in Figure 5(a) have the same GT caption that describes jeans only. Hence, we clean the FACAD and choose the fashion item’s best front view images in their color set to form the FACAD170K dataset. The new FACAD170K does not contain texture images. The resolution of the fashion images is \(1{,}560\times 2{,}392\) and labeled with fashion caption and semantic attributes. The semantic attributes are adopted from [59] that extract the nouns and adjectives words when they both appear in the captions and fashion item titles from the source website. The semantic attributes annotation such as “cotton,” “plaid” (fashion attributes), and “comfort” (fashion style) provide some detailed information about a specific item. We keep the maximum length of the GT caption to 25 and a maximum of 5 semantic attributes for each fashion item. The dataset statistics are shown in Table 1, and some examples are shown in Figure 5(b). We utilized 168,862, 5,000, and 5,000 images for training, validation, and testing, respectively. The fashion items that are different in color but share the same GT caption are contained in the same data split. We make the FACAD170K dataset publicly available at https://github.com/caicch/FACAD170K-dataset.
Fig. 5.
Fig. 5. (a) Examples from FACAD. (b) Examples from the cleaned FACAD170K.
Table 1.
Total fashion imagesTotal Fashion captionsAverage caption lengthAverage attributes per imageAttribute vocabCaption vocab
178,862126,74920.624.59848,808
Table 1. Fashion Image Captioning—FACAD170K Dataset Statistics

6 Experiments

In this section, we demonstrate and investigate the effectiveness of the proposed method through ablation studies and empirical evaluations over the abovementioned FACAD170K fashion captioning dataset, conventional MSCOCO, and Flickr30K image captioning dataset.

6.1 Datasets and Evaluation Matrices

In the following experiments, we used the FACAD170K dataset to conduct the training and testing for fashion image captioning. We further evaluate the effectiveness of the proposed method on the conventional MSCOCO [36] and Flickr30K [62] image captioning dataset. It consists of more than 120,000 images, and each image is annotated with five GT captions. We follow the Karpathy dataset split [30], where 113k, 5k, and 5k images are used for training, testing, and validation, respectively. The vocabulary size for the MSCOCO caption dataset is 9,487. The Flickr30K dataset consists of 31,000 images and is annotated with 5 captions per image. Similarly, we used splits from Karpathy, where 29k, 1k, and 1k images are used for training, validation, and testing, respectively. The vocabulary size is 7,000.
We adopt the commonly used automatic evaluation metrics to evaluate the quality of image captions, which includes BLEU\(@\)N (B\(@\)N, N = 1,2,3,4) [45], METEOR (M) [3], ROUGE-L (R) [35], CIDER (C) [49], and SPICE (S) [1]. The BLEU evaluation metric is employed to evaluate the precision accuracy between the candidate and reference sentences, with “N” denoting the n-gram precision between these sentences. METEOR evaluates the probabilities of uni-gram precision and recall, while ROUGE-L measures similarity by calculating the longest common subsequence between two sentences. Both of these two matrices account for sentence fluency that involves the penalty factor. CIDER further measures the semantic representation of the sentence by computing the cosine similarity with term frequency-inverse document frequency values. SPICE measures the effectiveness of the captions in recovering objects, attributes, and relations. In all cases, higher metric scores indicate greater accuracy in the generated captions.

6.2 Training Details for Fashion Image Captioning

We fine-tune the conv5 layers to extract the fashion image features similar to the existing method [59]. The dimensions of image features, semantic attribute features, and caption features are 2,048, 300, and 512, respectively. The local window size is set to 3. The number of layers for encoder and decoder is set to 2 and 4, with 8 attention heads for the best captioning performance. We use the Adam optimizer to train the proposed model with a batch size of 16. The base learning rate is set to \(1\times 10^{-4}\). The model is trained using the XE loss. We use the beam search strategy, and the beam size is set to 5. At the inference stage, the input semantic attributes are optional, and we assume that given attributes should reasonably match the image if they are the input of the model.

6.3 Quantitative Evaluation for Fashion Image Captioning

Table 2 compares our proposed method to the existing fashion image captioning method and the conventional image captioning methods trained on the FACAD170K dataset. The column Semantic Attribute (SA) represents whether the model uses semantic attributes for captioning. For all the baselines that are used for comparison, we re-implement their method based on their original paper. We report the results using ResNet-101 as the backbone network. The methods compared include SAT [57], AFIC [59], ETA [32], MVT [63], ASET [66], EDK [56], and MFFT [67]. AFIC [59] is the method proposed for fashion image captioning. Other methods are originally proposed for the conventional image captioning, and we use them as the baselines for the fashion image captioning task. We modify \(\text{MVT}_{fa}\), \(\text{ASET}_{fa}\), \(\text{EDK}_{fa}\) that takes semantic attributes as additional input to train the model and use it for comparison. \(\text{ACC}_{visual}\) is our proposed method that only takes the image as input to generate fashion caption without any semantic attributes, while \(\text{ACC}_{visual+fa}\) uses image and up to 5 semantic attributes for fashion captioning. In Table 2, we can observe that when we only use the image as input, our framework achieves the best performance over all the metrics, and we believe that it’s due to feature enhancement and localLSTM module. Furthermore, our proposed model can significantly improve the performance when it takes paired semantic attributes and images as input. The semantic attributes are beneficial for fashion image captioning tasks, in which we can generate better fashion image captions by incorporating the semantic attributes.
Table 2.
ModelSAB@1B@2B@3B@4MRC
SAT [58]\(\times\) 27.516.111.88.112.523.269.3
ETA [32]\(\times\) 26.515.712.08.612.722.272.4
AFIC [60]\(\times\) 28.216.912.510.513.023.988.7
MVT [64]\(\times\) 27.716.812.610.713.024.192.0
ASET [67]\(\times\) 28.317.313.111.013.224.193.7
EDK [57]\(\times\) 28.117.413.011.013.224.293.2
MFFT [68]\(\times\) 27.316.912.710.613.123.591.4
\(\text{MVT}_{fa}\) [64]\(\sqrt{}\) 45.629.422.517.421.837.8162.6
\(\text{ASET}_{fa}\) [67]\(\sqrt{}\) 45.330.422.618.821.838.0168.8
\(\text{EDK}_{fa}\) [57]\(\sqrt{}\) 46.131.022.618.522.038.1169.5
\(\text{ ACC}_{visual}\) \(\times\) 28.517.713.511.413.324.597.0
\(\text{ ACC}_{visual+fa}\) \(\sqrt{}\) 46.131.924.420.822.438.9183.5
Table 2. Fashion Image Captioning Performance on the FACAD170K Test Split
All values are in percentages (%), and higher is better. The italic numbers indicate the best result when only taking the fashion image as input, and the bold numbers indicate the best result by including the semantic attribute as input.

6.4 Ablation Analysis for Fashion Image Captioning

In Table 3, we investigate the effectiveness of various proposed modules in fashion image captioning. \(Baseline\) uses the full transformer-based encoder–decoder architecture and takes fashion images as input only for captioning. \(Baseline+LocalLSTM\) includes the localLSTM module to model the local structure of word embeddings during fashion caption training, while \(Baseline+LocalLSTM+FEM\) further involves the FEM to incorporate more informative visual features from fashion images. We can observe that both LocalLSTM and FEM benefit the fashion image captioning model in generating more accurate fashion captioning, in which \(Baseline+LocalLSTM+FEM\) model improves C and B@4 score of 8.4% and 1%, respectively.
Table 4 measures the effectiveness of the given semantic attributes that benefit the fashion image captioning model. However, it is hard to evaluate the performance of generated controlled fashion captions based on the user’s preferable fashion attributes or fashion styles. Hence, we include semantic attributes multi-label image classifier to model the semantic attributes representations \(Visual+f_{rep}\) similar to the work [66] and assume that the modeled attributes semantic features represent the user’s preferable fashion attributes. Furthermore, we also evaluate the effectiveness of using the paired semantic attributes as input for fashion image captioning. We test the effectiveness of utilizing 1, 3, and 5 (\(Visual+5f_{a}\)) semantic attributes. We can see that the performance of using modeled attributes semantic features \(Visual+f_{rep}\) is close to the \(Visual\) model, meaning that the model performs well when using reasonably matched fashion semantic attributes. In addition, we can observe that the evaluation result improves when the number of given semantic attributes increases. This further proves that the semantic attributes are beneficial for fashion image captioning tasks.
Table 3.
ModelsB@1B@2B@3B@4MRC
\(Baseline\) 28.016.912.710.412.924.088.4
\(Baseline+LocalLSTM\) 28.317.413.311.213.224.294.9
\(Baseline+LocalLSTM+FEM\) 28.517.713.511.413.324.597.0
Table 3. Ablation Study on Different Settings When Taking Fashion Imageas Input for Caption Generation
The best scores are denoted in bold.
Table 4.
ModelsB@1B@2B@3B@4MRC
\(Visual\) 28.517.713.511.413.324.597.0
\(Visual+f_{rep}\) 28.517.613.411.513.424.795.3
\(Visual+1f_{a}\) 32.720.715.813.316.030.4109.7
\(Visual+3f_{a}\) 41.226.319.916.619.234.3147.8
\(Visual+5f_{a}\) 46.131.924.420.822.438.9183.5
Table 4. The Evaluation Result of Utilizing Semantic Attributes for Fashion Captioning
The best scores are denoted in bold.
Furthermore, we investigated the three multimodal feature fusion strategies (in Section 3.3) for the fashion captioning model and showed the evaluation result in Table 5. We compared the strategies in Equations (10), (12), and (13) and concluded that the proposed AMFM aids in the fuse visual and attributes features for ACC. The proposed model with AMFM outperforms another two strategies in all evaluation matrices.
Table 5.
Fusion methodsB@1B@2B@3B@4MRC
\(\mathbf{C}_{dc}\) 46.231.323.219.422.137.5176.5
\(\mathbf{C}_{cf}\) 46.531.023.119.622.338.6178.5
\(\mathbf{C}_{amf}\) 46.131.924.420.822.438.9183.8
Table 5. The Performance of Adopting Various Multimodal Fusion Strategies
The best scores are denoted in bold.

6.5 Qualitative Analysis for Fashion Image Captioning

Figure 6 shows the examples of fashion image captions generated by our ACC model and the existing methods [59, 63] as well as the GT captions. Our method lets user describe the fashion image according to their preferences. For instance, in example (a), the fashion expert or user can inject the fashion attributes of “logo” and “crewneck” into the model to generate a sentence that focuses more on describing the clothing detail of “iconic logo on this comfy crewneck sweatshirt,” or they may inject the fashion style like “cool” to control the fashion caption generation that describes the fashion image with more stylish feelings of cool as “a cool black sweatshirt keep you comfy and chic while kicking back at home.” Similar to the examples shown in (b), (c), (d), (e), and (f), we can generate charming and engaging fashion captioning by changing the preferred fashion attribute and style words. Especially in (e), the generated caption based on image-only input tends to describe the sneaker with “Italian” style (e.g., “handcrafted italian sneake”), but the user can change the style more toward “French” as “minimal detailing brings maximum versatility to a black french fashion sneaker kicked up a notch” based on their preferences. Besides controlled fashion caption, \(\text{ACC}_{visual}\) and \(\text{ACC}_{visual+f_{a}}\) models generate a more accurate caption that better describes fashion items. The fashion caption generated by existing methods is aligned with the logic of language but less accurate with respect to GT caption. We can see that the word predicted (bold words) in the captions generated by \(\text{ACC}_{visual}\) is more accurate as compared to the GT caption.
Fig. 6.
Fig. 6. Sample fashion captions generated by the proposed method and existing methods MVT [63] and AFIC [59]. By conditional on text-based semantic attributes, our model is able to provide the flexibility to control the fashion image caption generation in the user’s preferred way. The bold words denote correctly predicted words respective to the GT caption, red and green words represent the emphasized fashion attribute and styled in the generated attribute-controlled caption.

6.6 Ablation Analysis for MSCOCO Dataset

Besides fashion image captioning evaluation, we further test the effectiveness of the introduced modules in the captioner on the MSCOCO captioning dataset. Since our proposed ACC model is an end-to-end captioning method, we followed the settings in recent end-to-end transformer-based captioning works [38, 65]. We evaluated the captioning performance by including the LocalLSTM and FEM in the baseline transformer encoder–decoder framework. The model is trained using XE loss. For a fair comparison, the semantic attribute is not included for captioning evaluation on MSCOCO captioning dataset. We use Swin Transformer [39] (less parameter) as a backbone network instead of the Vision Transformer [16] backbone used in [38] to ease the training process. The ablative evaluation is shown in Table 6. We can observe that the introduced model performs better as compared to the baseline model in 1% and 1.7% in B@4 and C scores. The improved results are aligned with the result shown in Table 3 and further prove the effectiveness of the introduced components.
Table 6.
ModelsB@1B@2B@3B@4MRCS
\(Baseline\) 77.361.147.436.728.857.3121.021.8
\(Baseline+LocalLSTM\) 77.661.748.037.328.657.6122.021.8
\(Baseline+LocalLSTM+FEM\) 77.862.248.437.628.658.7122.721.9
Table 6. Ablation Study on Different Settings on MSCOCO Image Captioning Dataset
The bold number indicates the best result.

6.7 Quantitative Analysis for Conventional Image Captioning Dataset

We report the performance comparisons between our proposed method with the state-of-art models trained using self-critical loss (SCST) [46] on the MSCOCO captioning test set (Karpathy test split) in Table 7. The models compared include Up-Down [2], GCN-LSTM [60], RFNet [28], ORT [25], AoANet [26], ETA [32], MVT [63], ASET [66], BCAN [29], CPTR [38], S2TC [65], MFFT [67], and IRM [69]. Up-down explored using bottom-up region features for image captioning with soft attention. GCN-LSTM exploits the pairwise relationship information between region features for image captioning. RFNet and BCAN learn to collaboratively enhance the attention over regional visual features and word features to improve the captioning performance. OTR, AoANet, and MVT utilize transformer-based design for image captioning. OTR introduces geometric attention to the captioner that additionally injects the spatial relationship over regional visual features. AoANet improves the MHA mechanisms to select the relevant features for captioning. MVT incorporates multiple visual features with various backbones as visual inputs. ETA and ASET inject semantic and visual information into transformer-based architecture to boost the captioning performance. CPTR and S2TC studied end-to-end image captioning methods with grid or token-based visual features. MFFT proposed to fuse multi-features, such as spatial and semantic information, to assist the caption generation. IRM introduced a retrieval-based method that retrieves the semantic-related information from sentences for the description generation. Our method achieves superior performance in one of the most important metrics “C” for image captioning tasks that evaluate the sentence accuracy and the semantic precision of the predicted sentences. Furthermore, it can be observed that our proposed method achieves competitive performance in most of the evaluation metrics as compared to other methods. The performance emphasizes the effectiveness of our approach.
Table 7.
MethodsB@1B@2B@3B@4RMCS
Up-Down [2]79.8--36.356.927.7120.121.4
GCN-LSTM [60]80.5--38.258.328.5127.622.0
RFNet [28]79.163.148.436.557.327.7121.921.2
AoANet [26]80.2--38.958.829.2129.822.4
ORT [25]80.5--38.658.428.7128.322.6
ETA [32]81.5--39.358.928.8126.622.7
MVT [63]80.8--39.859.129.1130.9-
BCAN [29]81.165.250.337.958.628.6125.322.5
CPTR [38]81.766.652.240.059.429.1129.4-
S2TC [65]81.1--39.659.129.6133.523.2
ASET [66]80.6--39.358.929.2131.023.1
MFFT [67]81.0--39.659.129.2131.123.0
IRM [69]81.2--39.459.329.4131.022.9
Ours81.366.652.439.659.429.4134.223.3
Table 7. Image Captioning Performance on the MSCOCO Test Split When Using the SCST [47] loss
All values are in percentages (%), and higher is better. The italic numbers indicate the second best result, and the bold numbers indicate the best result.
We further validate the performance of the proposed method on the Flick30K captioning test set (Karpathy test split) in Table 8. We compared with the recent methods, namely DAN [19], STP [20], IPSG [68], IGCA [51], ASET [66], and IRM [69], which have reported their performance on the Flick30K captioning dataset. DAN introduced a deliberate residual attention network that generates preliminary captions and then refined them for better captions. STP explored a ruminant decoding framework to refine the initial caption generated by the base decoder to produce a more comprehensive and polished result. IPSG proposed to integrate part of speech information into the encoder-decoder framework. IGCA captured global information on the target words, thereby improving the predictions of target words in image captioning. Table 8 shows that our proposed method outperforms other approaches across most evaluation metrics. These improvements highlight the effectiveness of the proposed method.
Table 8.
MethodsB@1B@2B@3B@4MRC
DAN [19]73.855.140.329.423.0-66.6
STP [20]---26.820.548.157.0
IPSG [68]69.449.835.525.425.153.846.9
IGCA [51]73.3--30.225.754.158.1
ASET [66]73.356.642.932.123.6\(\_\) 70.9
IRM [69]74.3--29.823.450.668.3
Ours74.157.543.733.024.355.373.7
Table 8. Image Captioning Performance on the Flickr30K Test Split
The bold number indicates the best result.

7 Conclusion

In this article, we propose a novel method for controllable fashion image captioning. We utilize paired semantic attributes together with fashion images to train the proposed method. Furthermore, we introduce FEM to enhance the fine-turned visual representation by emphasizing the contextual information over channel dimensions, and we integrate the LocalLSTM module to learn the local structures of the sentence besides modeling global dependencies with MHA. In addition, we conducted extensive experiments and proposed to fuse multimodal information of visual and semantic attributes information with the AMFM for better attribute-conditioned fashion image caption generation. At the inference stage, our proposed method allows the end user to randomly choose the preferable semantic attributes to influence and control the generation of fashion image captions. The generated caption will contain the semantic information of the user-provided preferred semantic attributes. Our proposed method provides flexibility to control and generate engaging fashion image captions based on personally preferred semantic attributes.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In 2016 Proceedings of the European Conference on Computer Vision. 382–398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In 2005 Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[4]
Chen Cai, Suchen Wang, Kim-Hui Yap, and Yi Wang. 2024. Top-down framework for weakly-supervised grounded image captioning. Knowledge-Based Systems 287 (2024), 111433. DOI:
[5]
Chen Cai, Kim-Hui Yap, and Suchen Wang. 2022. Attribute conditioned fashion image captioning. In 2022 IEEE International Conference on Image Processing. 1921–1925. DOI:
[6]
Long Chen, Zhihong Jiang, Jun Xiao, and Wei Liu. 2021. Human-like controllable image captioning with verb-specific semantic roles. In 2021 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16841–16851. DOI:
[7]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.
[8]
Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9962–9971.
[9]
Wen-Huang Cheng, Sijie Song, Chieh-Yun Chen, Shintami Chusnul Hidayati, and Jiaying Liu. 2021. Fashion meets computer vision: A survey. ACM Computing Surveys 54, 4, Article 72 (Jul 2021), 41 pages. DOI:
[10]
Charles Corbiere, Hedi Ben-Younes, Alexandre Ramé, and Charles Ollion. 2017. Leveraging weakly annotated data for fashion image retrieval and label prediction. In 2017 Proceedings of the IEEE International Conference on Computer Vision Workshops. 2268–2274.
[11]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020). 10578–10587.
[12]
Lavinia De Divitiis, Federico Becattini, Claudio Baecchi, and Alberto Del Bimbo. 2023. Disentangling features for fashion recommendation. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 1s (2023), 1–21.
[13]
Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2022. Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 2 (2022), 880–892. https://doi.org/10.1109/TCSVT.2021.3063423
[14]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In 2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10695–10704.
[15]
Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In 2021 Proceedings of the 29th ACM International Conference on Multimedia. 2615–2624.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. DOI:
[17]
Zunlei Feng, Zhenyun Yu, Yongcheng Jing, Sai Wu, Mingli Song, Yezhou Yang, and Junxiao Jiang. 2019. Interpretable partitioned embedding for intelligent multi-item fashion outfit composition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–20.
[18]
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3141–3149.
[19]
Lianli Gao, Kaixuan Fan, Jingkuan Song, Xianglong Liu, Xing Xu, and Heng Tao Shen. 2019. Deliberate attention networks for image captioning. In 2019 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8320–8327.
[20]
Longteng Guo, Jing Liu, Shichen Lu, and Hanqing Lu. 2019a. Show, tell, and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia 22, 8 (2019), 2149–2162.
[21]
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019b. Aligning linguistic words and visual semantic units for image captioning. In 2019 Proceedings of the 27th ACM International Conference on Multimedia. 765–773.
[22]
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327–10336.
[23]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. In 2017 Proceedings of the IEEE International Conference on Computer Vision. 1463–1471.
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[25]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Advances in Neural Information Processing Systems 32 (2019).
[26]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In 2019 Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.
[27]
Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018a. Learning to guide decoding for image captioning. In 2018 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No. 1.
[28]
Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018b. Recurrent fusion network for image captioning. In 2018 Proceedings of the European Conference on Computer Vision (ECCV). 499–515.
[29]
Weitao Jiang, Weixuan Wang, and Haifeng Hu. 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4, Article 125 (Nov 2021), 20 pages.
[30]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In 2015 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.
[31]
Furkan Kinli, Baris Ozcan, and Furkan Kirac. 2019. Fashion image retrieval with capsule networks. In 2019 Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
[32]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019b. Entangled transformer for image captioning. In 2019 Proceedings of the IEEE/CVF International Conference on Computer Vision. 8928–8937.
[33]
Liang Li, Xingyu Gao, Jincan Deng, Yunbin Tu, Zheng-Jun Zha, and Qingming Huang. 2022. Long short-term relation transformer with global gating for video captioning. IEEE Transactions on Image Processing 31 (2022), 2726–2738. DOI:
[34]
Yixin Li, Shengqin Tang, Yun Ye, and Jinwen Ma. 2019a. Spatial-aware non-local attention for fashion landmark detection. In 2019 IEEE International Conference on Multimedia and Expo. 820–825.
[35]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. ACM, 74–81.
[36]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference. 740–755.
[37]
Yujie Lin, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Jun Ma, and Maarten De Rijke. 2019. Explainable outfit recommendation with joint outfit matching and comment generation. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1502–1516.
[38]
Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, and Jing Liu. 2021a. CPTR: Full transformer network for image captioning. arXiv:2101.10804. DOI:
[39]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
[40]
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In 2016 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1096–1104.
[41]
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level Collaborative Transformer for Image Captioning. 2021 Proceedings of the AAAI Conference on Artificial Intelligence, 2286–2293.
[42]
Nicholas Moratelli, Manuele Barraco, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2023a. Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23, 3 (2023). DOI:
[43]
Bao T. Nguyen, Om Prakash, and Anh H. Vo. 2021. Attention mechanism for fashion image captioning. In Computational Intelligence Methods for Green Technology and Sustainable Development: Proceedings of the International Conference (GTSD ’20). 93–104.
[44]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.
[45]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In 2002 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[46]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.
[47]
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In 2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12516–12526.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of 31st Conference on Neural Information Processing Systems. Vol. 30.
[49]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In 2015 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.
[50]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In 2015 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[51]
Changzhi Wang and Xiaodong Gu. 2022. Image captioning with adaptive incremental global context attention. Applied Intelligence 52 (2022), 1–23.
[52]
Zhonghao Wang, Yujun Gu, Ya Zhang, Jun Zhou, and Xiao Gu. 2017. Clothing retrieval with visual attention model. In 2017 IEEE Visual Communications and Image Processing (VCIP). IEEE, 1–4.
[53]
Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-Transformer: Recurrent Neural Network Enhanced Transformer. CoRR abs/1907.05572. Retrieved from https://doi.org/10.48550/arXiv.1907.05572
[54]
Kejun Wu, You Yang, Qiong Liu, Gangyi Jiang, and Xiao-Ping Zhang. 2023c. Hierarchical independent coding scheme for varifocal multiview images based on angular-focal joint prediction. IEEE Transactions on Multimedia (2023), 1–13. DOI:
[55]
Kejun Wu, You Yang, Qiong Liu, and Xiao-Ping Zhang. 2023b. Focal stack image compression based on basis-quadtree representation. IEEE Transactions on Multimedia 25 (2023), 3975–3988. DOI:
[56]
Ting-Wei Wu, Jia-Hong Huang, Joseph Lin, and Marcel Worring. 2023a. Expert-defined keywords improve interpretability of retinal image captioning. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. 1859–1868. DOI:
[57]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning. 2048–2057.
[58]
Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2022. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 43–51. DOI:
[59]
Xuewen Yang, Heming Zhang, Di Jin, Yingru Liu, Chi-Hao Wu, Jianchao Tan, Dongliang Xie, Jue Wang, and Xin Wang. 2020. Fashion captioning: Towards generating accurate descriptions with semantic rewards. In Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part XIII 16. 1–17.
[60]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In 2018 Proceedings of the European Conference on Computer Vision (ECCV). 684–699.
[61]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In 2016 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.
[62]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78. DOI:
[63]
Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2020. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4467–4480. DOI:
[64]
Weijiang Yu, Xiaodan Liang, Ke Gong, Chenhan Jiang, Nong Xiao, and Liang Lin. 2019. Layout-graph reasoning for fashion landmark detection. In 2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2937–2945.
[65]
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In 2022 Proceedings of the International Joint Conferences on Artificial Intelligence, Vol. 5. DOI:
[66]
Jing Zhang, Zhongjun Fang, Han Sun, and Zhe Wang. 2022. Adaptive semantic-enhanced transformer for image captioning. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–12. DOI:
[67]
Jing Zhang, Zhongjun Fang, and Zhe Wang. 2023. Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Applied Intelligence 53, 11 (2023), 13398–13414.
[68]
Ji Zhang, Kuizhi Mei, Yu Zheng, and Jianping Fan. 2020. Integrating part of speech guidance for image captioning. IEEE Transactions on Multimedia 23 (2020), 92–104.
[69]
Shanshan Zhao, Lixiang Li, and Haipeng Peng. 2023. Incorporating retrieval-based method for feature enhanced image captioning. Applied Intelligence 53, 8 (2023), 9731–9743.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 9
September 2024
780 pages
EISSN:1551-6865
DOI:10.1145/3613681
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2024
Online AM: 05 June 2024
Accepted: 25 May 2024
Revised: 04 February 2024
Received: 27 June 2023
Published in TOMM Volume 20, Issue 9

Check for updates

Author Tags

  1. Fashion
  2. image captioning
  3. controllable
  4. semantic understanding
  5. dataset

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 375
    Total Downloads
  • Downloads (Last 12 months)375
  • Downloads (Last 6 weeks)93
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media